官术网_书友最值得收藏!

Creating RDDs

One simple example of using sc is that you can use the sc.parallelize function to take a hardcoded set of data and make an RDD out of it. But that's not very interesting, it's not really going to be useful in a real production setting because if you could hardcode the data, then it's not really a big dataset to begin with now, is it? More often, we'll use something like sc.txtFile to create an RDD object. So, for example, if I have a giant text file full of, oh I don't know, movie ratings data for example, on my hard drive, we can use sc.txtFile to create an RDD object from SparkContext and then we can just use that RDD object going forward and process it:

sc.textFile("file:///c:/users/frank/gobs-o-text.txt") 

Now, again, if I have a set of information that fits on my computer, that's not really big data either, you can also create a text file from an s3n location or from an HDFS URI. So these are both examples of distributed file systems that can handle much larger datasets, that we might be able to fit on one machine. You can just as easily use an s3n or an HDFS URI as well as the file URI to load up data from a cluster or from a distributed file system as well as from a simple file that might be running on the same machine as your driver script.

You can also create RDDs from Hive:

hiveCtx = HiveContext(sc) rows = hiveCtx.sql("SELECT name, age FROM users") 

If you have a HiveContext object that's already been connected to an existing Hive repository, you can create an RDD from that. If you don't know what Hive is, don't worry about it: Hive is basically another thing that runs on top of Hadoop for data warehousing. You can also create RDDs from things, such as JDBC, you can tie it correctly to any SQL database that has a JDBC or ODBC interface. You can also use popular NoSQL databases such as Cassandra, and it has interfaces for things such as HBase and Elasticsearch and a lot of other things that are growing all the time. Basically, any data format that you can access from Python or from Java, depending on what language you're using, you can access through Spark as well, so you can load up JSON information and comma-separated value lists. You can also talk to things like sequence files and object files and load compressed formats directly. So there are a lot of ways to create an RDD; pretty much whatever format your source data might be in, the odds are that you can create an RDD from it in Spark pretty easily.

主站蜘蛛池模板: 塘沽区| 望城县| 阜新市| 丹阳市| 新昌县| 英山县| 开封市| 邳州市| 河津市| 冕宁县| 青浦区| 共和县| 尤溪县| 综艺| 鄱阳县| 峨边| 贵溪市| 香港 | 鄱阳县| 河曲县| 巴青县| 鄂州市| 溆浦县| 金平| 汪清县| 新建县| 阳江市| 浙江省| 剑河县| 邯郸县| 邵东县| 秦安县| 永仁县| 云龙县| 资中县| 深圳市| 阳新县| 垦利县| 太康县| 嘉善县| 怀宁县|