官术网_书友最值得收藏!

  • Mastering PostgreSQL 12
  • Hans Jürgen Sch?nig
  • 579字
  • 2021-08-20 10:00:34

Making use of ordered sets

Ordered sets are powerful features, but are not widely regarded as such and not widely known in the developer community. The idea is actually quite simple: data is grouped normally, and then the data inside each group is ordered given a certain condition. The calculation is then performed on this sorted data.

A classic example would be the calculation of the median.

The median is the middle value. If you are, for example, earning the median income, the number of people earning less and more than you are identical; 50% of people do better and 50% of people do worse.

One way to get the median is to take sorted data and move 50% into the dataset. This is an example of what the WITHIN GROUP clause will ask PostgreSQL to do:

test=# SELECT region, 
percentile_disc(0.5) WITHIN GROUP (ORDER BY production)
FROM t_oil
GROUP BY 1;
region | percentile_disc ----------------+----------------- Middle East | 1082 North America | 3054 (2 rows)

The percentile_disc function will skip 50% of the group and return the desired value.

Note that the median can significantly deviate from the average.

In economics, the deviation between the median and the average income can even be used as an indicator of social equality or inequality. The higher the median is compared to the average, the greater the income inequality. To provide more flexibility, the ANSI standard does not just propose a median function. Instead, percentile_disc allows you to use any value between 0 and 1.

The beauty is that you can even use ordered sets along with grouping sets, as shown in the following:

test=# SELECT region, 
percentile_disc(0.5) WITHIN GROUP (ORDER BY production)
FROM t_oil
GROUP BY ROLLUP (1);
region | percentile_disc ----------------+----------------- Middle East | 1082 North America | 3054 | 1696 (3 rows)

In this case, PostgreSQL will again inject additional lines into the result set.

As proposed by the ANSI SQL standard, PostgreSQL provides you with two percentile_ functions. The percentile_disc function will return a value that is really contained by the dataset, while the percentile_cont function will interpolate a value if no exact match is found. The following example shows how this works:

test=# SELECT percentile_disc(0.62) WITHIN GROUP (ORDER BY id),    
percentile_cont(0.62) WITHIN GROUP (ORDER BY id) FROM generate_series(1, 5) AS id;
percentile_disc | percentile_cont -----------------+----------------- 4 | 3.48 (1 row)

4 is a value that really exists—3.48 has been interpolated. The percentile_ functions are not the only ones provided by PostgreSQL. To find the most frequent value within a group, the mode function is available. Before showing an example of how to use the mode function, I have compiled a query telling us a bit more about the contents of the table:

test=# SELECT production, count(*) 
FROM t_oil
WHERE country = 'Other Middle East'
GROUP BY production
ORDER BY 2 DESC
LIMIT 4;
production | count ------------+-------
50 | 5
48 | 5
52 | 5
53 | 4

(4 rows)

Three different values occur exactly five times. Of course, the mode function can only give us one of them:

test=# SELECT country, mode() WITHIN GROUP (ORDER BY production) 
FROM t_oil
WHERE country = 'Other Middle East' GROUP BY 1;
country | mode --------------------+------ Other Middle East | 48 (1 row)

The most frequent value is returned, but SQL won't tell us how often the number actually shows up. It might be that the number only shows up once.

主站蜘蛛池模板: 罗甸县| 郴州市| 南陵县| 高密市| 仲巴县| 江油市| 余干县| 华安县| 上栗县| 城步| 东乡县| 喜德县| 南汇区| 什邡市| 芷江| 万宁市| 敦化市| 平凉市| 衡阳市| 神木县| 诸城市| 咸宁市| 磐石市| 吉木乃县| 闽清县| 邯郸县| 吴旗县| 德令哈市| 平南县| 石棉县| 两当县| 明溪县| 苍山县| 西贡区| 辉县市| 凯里市| 惠安县| 边坝县| 玉屏| 康马县| 凤凰县|