官术网_书友最值得收藏!

Sampling with the RDD API

In this section, we use RDDs for creating stratified samples with and without replacement.

First, we create an RDD from our DataFrame:

We can specify the fractions of each record-type in our sample, as illustrated:

In the following illustration, we use the sampleByKey and sampleByKeyExact methods to create our samples. The former is an approximate sample while the latter is an exact sample. The first parameter specifies whether the sample is generated with or without replacement:

Next, we print out the total number of records in the population and in each of the samples. You will notice that the sampleByKeyExact gives you exact numbers of records as per the specified fractions:

The sample method can be used to create a random sample containing the specified fraction of records in the sample. Next, we create a sample with replacement, containing 10% of the total records:

Other statistical operations, such as hypothesis testing, random data generation, visualizing probability distributions, and so on, will be covered in the later chapters. In the next section, we will explore our data using Spark SQL for creating pivot tables.

主站蜘蛛池模板: 临颍县| 崇州市| 金平| 镇康县| 乌什县| 北京市| 石泉县| 元朗区| 聂荣县| 永春县| 安吉县| 壶关县| 驻马店市| 怀宁县| 西丰县| 应城市| 静海县| 明光市| 连州市| 和静县| 车致| 江孜县| 庆云县| 湛江市| 潮州市| 罗源县| 克山县| 兰坪| 张掖市| 金沙县| 鹰潭市| 公主岭市| 庆阳市| 马山县| 邓州市| 洛扎县| 南昌县| 白水县| 兴和县| 阜阳市| 集贤县|