- Learning Spark SQL
- Aurobindo Sarkar
- 122字
- 2021-07-02 18:23:48
Sampling with the DataFrame/Dataset API
We can use the sampleBy to create a stratified sample without replacement. We can specify the fractions for the percentages of each value to be selected in the sample.
The size of the sample and the number of record of each type are shown here:

Next, we create a sample with replacement that selects a fraction of rows (10% of the total records) using a random seed. Using sample is not guaranteed to provide the exact fraction of the total number of records in the Dataset. We also print out the numbers of each type of records in the sample:

In the next section, we will explore sampling methods using RDDs.
推薦閱讀
- 圖解Java數據結構與算法(微課視頻版)
- Offer來了:Java面試核心知識點精講(原理篇)
- Mastering Articulate Storyline
- WordPress Plugin Development Cookbook(Second Edition)
- STM32F0實戰:基于HAL庫開發
- Python貝葉斯分析(第2版)
- Java編程技術與項目實戰(第2版)
- INSTANT Passbook App Development for iOS How-to
- Java高并發核心編程(卷1):NIO、Netty、Redis、ZooKeeper
- Node.js 12實戰
- Go語言入門經典
- Java Web開發基礎與案例教程
- C++ Windows Programming
- Mahout實踐指南
- HTML 5與CSS 3權威指南(第3版·下冊)