書名： Learning Spark SQL
作者名： Aurobindo Sarkar
本章字數： 449字
更新時間： 2021-07-02 18:23:44

Using Spark with MongoDB (NoSQL database)

In this section, we will use Spark with one of the most popular NoSQL databases - MongoDB. MongoDB is a distributed document database that stores data in JSON-like format. Unlike the rigid schemas in relational databases, the data structure in MongoDB is a lot more flexible and the stored documents can have arbitrary fields. This flexibility combined with high availability and scalability features make it a good choice for storing data in many applications. It is also free and open-source software.

If you do not have MongoDB already installed and available, then you can download it from https://www.mongodb.org/downloads. Follow the installation instructions for your specific OS to install the database.

The New York City schools directory dataset for this example has been taken from the New York City Open Data website and can be downloaded from https://nycplatform.socrata.com/data?browseSearch=&scope=&agency=&cat=education&type=datasets.

After you have your MongoDB database server up and running, fire up the MongoDB shell. In the following steps, we will create a new database, define a collection, and insert New York City school's data using the MongoDB import utility from the command line.

You should see a screen similar to the following when you start the MongoDB shell:

Next, execute the use <DATABASE> command to select an existing database or create a new one, if it does not exist.

If you make a mistake while creating a new collection, you can use the db.dropDatabase() and/or db.collection.drop() commands to delete the dababase and/or the collection, respectively, and then recreate it with the required changes.

>use nycschoolsDB
switched to dbnycschoolsDB

The mongoimport utility needs to be executed from the command prompt (and not in the mongodb shell):


mongoimport --host localhost --port 27017 --username <your user name here> --password "<your password here>" --collection schools --db nycschoolsDB --file <your download file name here>

You can list the imported collection and print a record to validate the import operation, as follows:

>show collections
 schools
 >db.schools.findOne()

You can download the mongo-spark-connector jar for Spark 2.2 (mongo-spark-connector_2.11-2.2.0-assembly.jar) from http://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/.

Next, start the Spark shell with the mongo-spark-connector_2.11-2.2.0-assembly.jar file specified on the command line:

./bin/spark-shell --jars /Users/aurobindosarkar/Downloads/mongo-spark-connector_2.11-2.2.0-assembly.jar
scala> import org.apache.spark.sql.SQLContext
scala> import org.apache.spark.{SparkConf, SparkContext}
scala> import com.mongodb.spark.MongoSpark
scala> import com.mongodb.spark.config.{ReadConfig, WriteConfig}

Next, we define the URIs for read and write operations from Spark:

scala>val readConfig = ReadConfig(Map("uri" -> "mongodb://localhost:27017/nycschoolsDB.schools?readPreference=primaryPreferred"))

scala>val writeConfig = WriteConfig(Map("uri" -> "mongodb://localhost:27017/nycschoolsDB.outCollection"))

Define a case class for the school record, as follows:

Next, you can create a DataFrame from our collection and display a record from our newly created DataFrame.

scala>val schoolsDF = MongoSpark.load(sc, readConfig).toDF[School]

scala>schoolsDF.take(1).foreach(println)

Note: The following sections will be updated with the latest versions of the connector packages later.

In the next several sections, we describe using Spark with several popular big data file formats.

官术网_书友最值得收藏!

Learning Spark SQL

Using Spark with MongoDB (NoSQL database)