官术网_书友最值得收藏!

  • Learning Spark SQL
  • Aurobindo Sarkar
  • 122字
  • 2021-07-02 18:23:48

Sampling with the DataFrame/Dataset API

We can use the sampleBy to create a stratified sample without replacement. We can specify the fractions for the percentages of each value to be selected in the sample.

The size of the sample and the number of record of each type are shown here:

Next, we create a sample with replacement that selects a fraction of rows (10% of the total records) using a random seedUsing sample  is not guaranteed to provide the exact fraction of the total number of records in the DatasetWe also print out the numbers of each type of records in the sample:

In the next section, we will explore sampling methods using RDDs.

主站蜘蛛池模板: 太仆寺旗| 十堰市| 句容市| 成武县| 会昌县| 佛坪县| 邳州市| 连州市| 厦门市| 巩留县| 高邮市| 海兴县| 普安县| 隆尧县| 章丘市| 石屏县| 筠连县| 达尔| 聊城市| 宝鸡市| 桑日县| 新营市| 黔南| 麟游县| 个旧市| 宣威市| 南投县| 宜兴市| 时尚| 张家港市| 银川市| 兰坪| 仁怀市| 离岛区| 承德市| 綦江县| 若尔盖县| 龙胜| 宣恩县| 剑川县| 崇信县|