官术网_书友最值得收藏!

Getting the data from the repository to Spark

We can follow these steps to download the dataset and load it in PySpark:

  1. Click on Data Folder.
  2. You will be redirected to a folder that has various files as follows:

You can see that there's kddcup.data.gz, and there is also 10% of that data available in kddcup.data_10_percent.gz. We will be working with food datasets. To work with the food datasets, right-click on kddcup.data.gz, select Copy link address, and then go back to the PySpark console and import the data.

Let's take a look at how this works using the following steps:

  1. After launching PySpark, the first thing we need to do is import urllib, which is a library that allows us to interact with resources on the internet, as follows:
import urllib.request
  1. The next thing to do is use this request library to pull some resources from the internet, as shown in the following code:
f = urllib.request.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/kddcup.data.gz"),"kddcup.data.gz"

This command will take some time to process. Once the file has been downloaded, we can see that Python has returned and the console is active.

  1. Next, load this using SparkContext. So, SparkContext is materialized or objectified in Python as the sc variable, as follows:
sc

This output is as demonstrated in the following code snippet:

SparkContext
Spark UI
Version
v2.3.3
Master
local[*]
AppName
PySparkShell
主站蜘蛛池模板: 平和县| 黔南| 会昌县| 河东区| 集贤县| 辽宁省| 抚宁县| 庄浪县| 汶川县| 珲春市| 石台县| 高阳县| 亚东县| 岳西县| 隆子县| 宁波市| 伊宁县| 定陶县| 原阳县| 柞水县| 和顺县| 亚东县| 文水县| 建瓯市| 游戏| 农安县| 宣化县| 宁河县| 青海省| 新巴尔虎右旗| 镇巴县| 文昌市| 玉门市| 南丹县| 安义县| 左云县| 峨边| 永宁县| 普宁市| 喜德县| 盱眙县|