- Learning Spark SQL
- Aurobindo Sarkar
- 189字
- 2021-07-02 18:23:48
Sampling with the RDD API
In this section, we use RDDs for creating stratified samples with and without replacement.
First, we create an RDD from our DataFrame:

We can specify the fractions of each record-type in our sample, as illustrated:

In the following illustration, we use the sampleByKey and sampleByKeyExact methods to create our samples. The former is an approximate sample while the latter is an exact sample. The first parameter specifies whether the sample is generated with or without replacement:

Next, we print out the total number of records in the population and in each of the samples. You will notice that the sampleByKeyExact gives you exact numbers of records as per the specified fractions:

The sample method can be used to create a random sample containing the specified fraction of records in the sample. Next, we create a sample with replacement, containing 10% of the total records:

Other statistical operations, such as hypothesis testing, random data generation, visualizing probability distributions, and so on, will be covered in the later chapters. In the next section, we will explore our data using Spark SQL for creating pivot tables.
- ArcGIS By Example
- Java高并發(fā)核心編程(卷1):NIO、Netty、Redis、ZooKeeper
- Spring+Spring MVC+MyBatis從零開始學(xué)
- C語言程序設(shè)計(jì)習(xí)題與實(shí)驗(yàn)指導(dǎo)
- 區(qū)塊鏈國(guó)產(chǎn)化實(shí)踐指南:基于Fabric 2.0
- 物聯(lián)網(wǎng)系統(tǒng)架構(gòu)設(shè)計(jì)與邊緣計(jì)算(原書第2版)
- Continuous Delivery and DevOps:A Quickstart Guide Second Edition
- JavaScript Unit Testing
- 軟件開發(fā)中的決策:權(quán)衡與取舍
- C語言進(jìn)階:重點(diǎn)、難點(diǎn)與疑點(diǎn)解析
- Illustrator CS6中文版應(yīng)用教程(第二版)
- Java語言GUI程序設(shè)計(jì)
- Yii框架深度剖析
- Swift 5從零到精通iOS開發(fā)訓(xùn)練營(yíng)
- Procedural Content Generation for Unity Game Development