官术网_书友最值得收藏!

Taking advantage of pg_trgm

To do fuzzy searching with PostgreSQL, you can add the pg_trgm extension. To activate this extension, just run the following command:

test=# CREATE EXTENSION pg_trgm;  
CREATE EXTENSION 

The pg_trgm extension is pretty powerful, and to show you what it is capable of, I have compiled some sample data consisting of 2,354 names of villages and cities here in Austria, Europe.

Our sample data can be stored in a simple table:

test=# CREATE TABLE t_location (name text); 
CREATE TABLE 

My company website has all the data available, and PostgreSQL allows you to load the data directly:

test=# COPY t_location FROM PROGRAM 
'curl https://www.cybertec-postgresql.com/secret/orte.txt'; COPY 2354
curl (a command-line tool for fetching data) has to be installed. If you don't have this tool, download the file normally and import it from your local filesystem.

Once the data has been loaded, it is possible to check out the contents of the table:

test=# SELECT * FROM t_location LIMIT 4; 
              name 
--------------------------------  
 Eisenstadt 
 Rust 
 Breitenbrunn am Neusiedler See 
 Donnerskirchen 
(4 rows) 

If German is not your mother tongue, it will be impossible to spell the names of those locations without severe mistakes.

 pg_trgm provides us with a distance operator that computes the distance between two strings:

test=# SELECT 'abcde' <-> 'abdeacb'; 
 ?column? 
---------- 
 0.833333 
(1 row)

The distance is a number between 0 and 1. The lower the number, the more similar the two strings are.

How does this work? Trigrams take a string and dissect it into sequences of three characters each:

test=# SELECT show_trgm('abcdef'); 
              show_trgm 
-------------------------------------
{" a"," ab",abc,bcd,cde,def,"ef "}
(1 row)

These sequences will then be used to come up with the distance you have just seen. Of course, the distance operator can be used inside a query to find the closest match:

test=# SELECT *
FROM t_location
ORDER BY name <-> 'Kramertneusiedel'
LIMIT 3; name
-----------------
Gramatneusiedl
Klein-Neusiedl
Potzneusiedl
(3 rows)

Gramatneusiedl is pretty close to Kramertneusiedel. It sounds similar, and using a K instead of a G is a pretty common mistake. On Google, you will sometimes see Did you mean. It is quite likely that Google is using n-grams here to do that.

In PostgreSQL, it is possible to use GiST to index on text using trigrams:

test=# CREATE INDEX idx_trgm ON t_location 
USING GiST(name GiST_trgm_ops); CREATE INDEX

pg_trgm provides us with the GiST_trgm_ops operator class, which is designed to do similarity searches. The following code shows that the index is used as expected:

test=# explain SELECT * 
FROM t_location
ORDER BY name <-> 'Kramertneusiedel'
LIMIT 5; QUERY PLAN
-----------------------------------------------------------------
Limit (cost=0.14..0.58 rows=5 width=17)
-> Index Scan using idx_trgm on t_location
(cost=0.14..207.22 rows=2354 width=17)
Order By: (name <-> 'Kramertneusiedel'::text)
(3 rows)
主站蜘蛛池模板: 元谋县| 湘阴县| 灵山县| 闽清县| 福建省| 高密市| 商水县| 南开区| 司法| 响水县| 赤城县| 紫金县| 汾西县| 桐乡市| 德阳市| 武强县| 都江堰市| 石台县| 临洮县| 缙云县| 同德县| 营口市| 惠东县| 田东县| 屏东市| 大同市| 泰兴市| 通州市| 石棉县| 叶城县| 太湖县| 海伦市| 舞阳县| 平度市| 灵川县| 芦山县| 蓝山县| 开阳县| 旅游| 库车县| 尼勒克县|