官术网_书友最值得收藏!

Getting data into Spark

  1. Next, load the KDD cup data into PySpark using sc, as shown in the following command:
raw_data = sc.textFile("./kddcup.data.gz")

  1. In the following command, we can see that the raw data is now in the raw_data variable:
raw_data

This output is as demonstrated in the following code snippet:

./kddcup.data,gz MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:0

If we enter the raw_data variable, it gives us details regarding kddcup.data.gz, where raw data underlying the data file is located, and tells us about MapPartitionsRDD.

Now that we know how to load the data into Spark, let's learn about parallelization with Spark RDDs.

主站蜘蛛池模板: 桃江县| 长乐市| 铁岭县| 巴里| 建昌县| 化德县| 龙泉市| 东安县| 京山县| 渑池县| 治多县| 三穗县| 淮南市| 呼和浩特市| 普陀区| 桦川县| 精河县| 湘潭市| 河西区| 六枝特区| 汨罗市| 蒲城县| 和顺县| 靖西县| 前郭尔| 龙井市| 景宁| 本溪市| 辽阳市| 岱山县| 綦江县| 大兴区| 阿鲁科尔沁旗| 县级市| 平泉县| 准格尔旗| 东乡县| 大港区| 申扎县| 思南县| 华池县|