- Hands-On Big Data Analytics with PySpark
- Rudy Lai Bart?omiej Potaczek
- 103字
- 2021-06-24 15:52:34
Getting data into Spark
- Next, load the KDD cup data into PySpark using sc, as shown in the following command:
raw_data = sc.textFile("./kddcup.data.gz")
- In the following command, we can see that the raw data is now in the raw_data variable:
raw_data
This output is as demonstrated in the following code snippet:
./kddcup.data,gz MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:0
If we enter the raw_data variable, it gives us details regarding kddcup.data.gz, where raw data underlying the data file is located, and tells us about MapPartitionsRDD.
Now that we know how to load the data into Spark, let's learn about parallelization with Spark RDDs.
推薦閱讀
- Building Computer Vision Projects with OpenCV 4 and C++
- Python數據分析與挖掘實戰
- Word 2010中文版完全自學手冊
- iOS and OS X Network Programming Cookbook
- 數亦有道:Python數據科學指南
- Hadoop 3.x大數據開發實戰
- Spark大數據編程實用教程
- 深入淺出Greenplum分布式數據庫:原理、架構和代碼分析
- 金融商業算法建模:基于Python和SAS
- Power BI智能數據分析與可視化從入門到精通
- MySQL數據庫技術與應用
- 數據應用工程:方法論與實踐
- 數字化轉型實踐:構建云原生大數據平臺
- 大數據計算系統原理、技術與應用
- Flume日志收集與MapReduce模式