- Learning Spark SQL
- Aurobindo Sarkar
- 449字
- 2021-07-02 18:23:44
Using Spark with MongoDB (NoSQL database)
In this section, we will use Spark with one of the most popular NoSQL databases - MongoDB. MongoDB is a distributed document database that stores data in JSON-like format. Unlike the rigid schemas in relational databases, the data structure in MongoDB is a lot more flexible and the stored documents can have arbitrary fields. This flexibility combined with high availability and scalability features make it a good choice for storing data in many applications. It is also free and open-source software.
The New York City schools directory dataset for this example has been taken from the New York City Open Data website and can be downloaded from https://nycplatform.socrata.com/data?browseSearch=&scope=&agency=&cat=education&type=datasets.
After you have your MongoDB database server up and running, fire up the MongoDB shell. In the following steps, we will create a new database, define a collection, and insert New York City school's data using the MongoDB import utility from the command line.
You should see a screen similar to the following when you start the MongoDB shell:

Next, execute the use <DATABASE> command to select an existing database or create a new one, if it does not exist.
>use nycschoolsDB
switched to dbnycschoolsDB
The mongoimport utility needs to be executed from the command prompt (and not in the mongodb shell):
mongoimport --host localhost --port 27017 --username <your user name here> --password "<your password here>" --collection schools --db nycschoolsDB --file <your download file name here>
You can list the imported collection and print a record to validate the import operation, as follows:
>show collections
schools
>db.schools.findOne()




You can download the mongo-spark-connector jar for Spark 2.2 (mongo-spark-connector_2.11-2.2.0-assembly.jar) from http://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/.
Next, start the Spark shell with the mongo-spark-connector_2.11-2.2.0-assembly.jar file specified on the command line:
./bin/spark-shell --jars /Users/aurobindosarkar/Downloads/mongo-spark-connector_2.11-2.2.0-assembly.jar
scala> import org.apache.spark.sql.SQLContext
scala> import org.apache.spark.{SparkConf, SparkContext}
scala> import com.mongodb.spark.MongoSpark
scala> import com.mongodb.spark.config.{ReadConfig, WriteConfig}
Next, we define the URIs for read and write operations from Spark:
scala>val readConfig = ReadConfig(Map("uri" -> "mongodb://localhost:27017/nycschoolsDB.schools?readPreference=primaryPreferred"))
scala>val writeConfig = WriteConfig(Map("uri" -> "mongodb://localhost:27017/nycschoolsDB.outCollection"))
Define a case class for the school record, as follows:

Next, you can create a DataFrame from our collection and display a record from our newly created DataFrame.
scala>val schoolsDF = MongoSpark.load(sc, readConfig).toDF[School]
scala>schoolsDF.take(1).foreach(println)


In the next several sections, we describe using Spark with several popular big data file formats.
- Advanced Machine Learning with Python
- Kali Linux Web Penetration Testing Cookbook
- 新一代通用視頻編碼H.266/VVC:原理、標準與實現
- 軟件測試項目實戰之性能測試篇
- Neo4j Essentials
- Scratch 3.0少兒編程與邏輯思維訓練
- Python程序設計
- Reactive Programming With Java 9
- Getting Started with Gulp
- Unity&VR游戲美術設計實戰
- Cocos2d-x Game Development Blueprints
- C++ System Programming Cookbook
- jQuery技術內幕:深入解析jQuery架構設計與實現原理
- scikit-learn Cookbook(Second Edition)
- Unity 5 Game Optimization