官术网_书友最值得收藏!

  • Mastering PostgreSQL 9.6
  • Hans Jurgen Schonig
  • 450字
  • 2021-07-09 19:57:22

Taking advantage of pg_trgm

To do fuzzy searching with PostgreSQL, you can add the pg_trgm extension. To activate the extension, just run the following instruction:

test=# CREATE EXTENSION pg_trgm;
CREATE EXTENSION

The pg_trgm extension is pretty powerful, and to show what it is capable of, I have compiled some sample data consisting of 2,354 names of villages and cities here in Austria, Europe.

Our sample data can be stored in a simple table:

test=# CREATE TABLE t_location (name text);
CREATE TABLE

My company website has all the data and PostgreSQL enables you to load the data directly:

test=# COPY t_location FROM PROGRAM 'curl www.cybertec.at/secret/orte.txt';
COPY 2354
Note that curl (a command-line tool to fetch data) has to be installed. If you don't have this tool, download the file normally and import it from your local filesystem.

Once the data has been loaded, it is possible to check out the content of the table:

test=# SELECT * FROM t_location LIMIT 4; 
name
--------------------------------
Eisenstadt
Rust
Breitenbrunn am Neusiedler See
Donnerskirchen
(4 rows)

If German is not your mother tongue, it will be impossible to spell the names of those locations without severe mistakes. Fortunately, pg_trgm will come to the rescue:

test=# CREATE EXTENSION pg_trgm; 
CREATE EXTENSION

The pg_trgm provides us with a distance operator, which computes the distance between two strings:

test=# SELECT 'abcde' <-> 'abdeacb'; 
?column?
----------
0.833333
(1 row)

The distance is a number between zero and one. The lower the number, the more similar two strings are.

How does that work? Trigrams take a string and dissect it into sequences of three characters each:

test=# SELECT show_trgm('abcdef'); 
show_trgm
-------------------------------------
{" a"," ab",abc,bcd,cde,def,"ef "}
(1 row)

Those sequences will then be used to come up with the distance you have just seen. Of course, the distance operator can be used inside a query to find the closest match:

test=# SELECT * FROM  t_location ORDER BY name <-> 'Kramertneusiedel' LIMIT 3; 
name
-----------------
Gramatneusiedl
Klein-Neusiedl
Potzneusiedl
(3 rows)

Gramatneusiedl is pretty close to Kramertneusiedel. It sounds similar and using a K instead of a G is a pretty common mistake. On Google, you will sometimes see Did you mean.... It is quite likely that Google is using n-grams here to do that.

In PostgreSQL, it is possible to use GiST to index on text using trigrams:

test=# CREATE INDEX idx_trgm ON t_location USING GiST(name GiST_trgm_ops); 
CREATE INDEX

pg_trgm provides us with the GiST_trgm_ops operator class designed to do similarity searches. The following listing shows that the index is used as expected:

test=# explain SELECT * FROM t_location ORDER BY name <-> 'Kramertneusiedel' LIMIT 5; 
QUERY PLAN
-----------------------------------------------------------------
Limit (cost=0.14..0.58 rows=5 width=17)
-> Index Scan using idx_trgm on t_location
(cost=0.14..207.22 rows=2354 width=17)
Order By: (name <-> 'Kramertneusiedel'::text)
(3 rows)
主站蜘蛛池模板: 神木县| 屏东市| 通河县| 武冈市| 射洪县| 茶陵县| 浠水县| 衡阳市| 西乌珠穆沁旗| 富顺县| 瑞昌市| 海城市| 武胜县| 新宁县| 青河县| 阜南县| 正安县| 隆林| 平江县| 喀什市| 东丰县| 安平县| 宁夏| 威海市| 西和县| 吴忠市| 合山市| 通州区| 南郑县| 禹城市| 阳曲县| 东港市| 文水县| 旬邑县| 天台县| 犍为县| 准格尔旗| 阜新市| 穆棱市| 民权县| 桐城市|