官术网_书友最值得收藏!

  • Learning Spark SQL
  • Aurobindo Sarkar
  • 449字
  • 2021-07-02 18:23:44

Using Spark with MongoDB (NoSQL database)

In this section, we will use Spark with one of the most popular NoSQL databases - MongoDB. MongoDB is a distributed document database that stores data in JSON-like format. Unlike the rigid schemas in relational databases, the data structure in MongoDB is a lot more flexible and the stored documents can have arbitrary fields. This flexibility combined with high availability and scalability features make it a good choice for storing data in many applications. It is also free and open-source software. 

If you do not have MongoDB already installed and available, then you can download it from https://www.mongodb.org/downloads. Follow the installation instructions for your specific OS to install the database.

The New York City schools directory dataset for this example has been taken from the New York City Open Data website and can be downloaded from https://nycplatform.socrata.com/data?browseSearch=&scope=&agency=&cat=education&type=datasets.

After you have your MongoDB database server up and running, fire up the MongoDB shell. In the following steps, we will create a new database, define a collection, and insert New York City school's data using the MongoDB import utility from the command line.

You should see a screen similar to the following when you start the MongoDB shell:

Next, execute the use <DATABASE> command to select an existing database or create a new one, if it does not exist.

If you make a mistake while creating a new collection, you can use the db.dropDatabase() and/or db.collection.drop() commands to delete the dababase and/or the collection, respectively, and then recreate it with the required changes.
>use nycschoolsDB
switched to dbnycschoolsDB

The mongoimport utility needs to be executed from the command prompt (and not in the mongodb shell):


mongoimport --host localhost --port 27017 --username <your user name here> --password "<your password here>" --collection schools --db nycschoolsDB --file <your download file name here>

You can list the imported collection and print a record to validate the import operation, as follows:

>show collections
schools
>db.schools.findOne()

You can download the mongo-spark-connector jar for Spark 2.2 (mongo-spark-connector_2.11-2.2.0-assembly.jar) from http://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/.

Next, start the Spark shell with the mongo-spark-connector_2.11-2.2.0-assembly.jar file specified on the command line:

./bin/spark-shell --jars /Users/aurobindosarkar/Downloads/mongo-spark-connector_2.11-2.2.0-assembly.jar
scala> import org.apache.spark.sql.SQLContext
scala> import org.apache.spark.{SparkConf, SparkContext}
scala> import com.mongodb.spark.MongoSpark
scala> import com.mongodb.spark.config.{ReadConfig, WriteConfig}

Next, we define the URIs for read and write operations from Spark:

scala>val readConfig = ReadConfig(Map("uri" -> "mongodb://localhost:27017/nycschoolsDB.schools?readPreference=primaryPreferred"))

scala>val writeConfig = WriteConfig(Map("uri" -> "mongodb://localhost:27017/nycschoolsDB.outCollection"))

Define a case class for the school record, as follows:

Next, you can create a DataFrame from our collection and display a record from our newly created DataFrame.

scala>val schoolsDF = MongoSpark.load(sc, readConfig).toDF[School]

scala>schoolsDF.take(1).foreach(println)
Note: The following sections will be updated with the latest versions of the connector packages later.

In the next several sections, we describe using Spark with several popular big data file formats.

主站蜘蛛池模板: 西藏| 大英县| 班戈县| 新沂市| 绵竹市| 聂拉木县| 南安市| 张掖市| 霍州市| 建宁县| 宁国市| 晋宁县| 澳门| 本溪市| 铜山县| 晋宁县| 昌都县| 冕宁县| 万州区| 安图县| 绍兴市| 明溪县| 沙雅县| 玉门市| 砀山县| 三江| 柯坪县| 平潭县| 灵寿县| 临猗县| 大埔区| 曲靖市| 长顺县| 昂仁县| 玛沁县| 探索| 阿瓦提县| 六盘水市| 平谷区| 江都市| 景洪市|