Do you ever have to go sifting through text?
Like. . . A lot of text?
Row after row, file after file.
How will you ever find that one little thing you need‽
In the SQL standard you have two types:
character (char)
character varying (varchar)
In Postgres, we have both of those, but we also have:
text
Just use text.
“While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs.”
"char"
is not char(1)
name
is reserved for system catalogs
We also have structured text.
xml
json
But not jsonb
And let's not even talk about character encodings. . .
¯\_(ツ)_/¯
like
& ilike
SIMILAR
regex col ~ E'^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$'
You need to know the correct spelling in advance
Words with the same root don't match (e.g. democracy, democracies)
Lack of scalability
btree indexes don't help much
. . . and expression indexes must match the search that's going to happen
pg_trgm is an extension for determining word similarity.
It divides each word in the string into three character segments (left padded with two spaces, and right padded with one).
It ignores non-alphanumeric characters.
CREATE EXTENSION pg_trgm;
SELECT show_trgm('spleling'), show_trgm('spelling!')
, similarity('spleling', 'spelling!')
, 'spleling' <-> 'spelling!' as distance
, show_limit();
show_trgm | {" s"," sp",eli,ing,lel,lin,"ng ",ple,spl}
show_trgm | {" s"," sp",ell,ing,lin,lli,"ng ",pel,spe}
similarity | 0.384615
distance | 0.615385
show_limit | 0.3
SELECT CASE when 'spleling' % 'spelling' then true else false END as default;
SELECT set_limit(.4) ;
SELECT CASE when 'spleling' % 'spelling' then true else false END as stricter;
default | t
set_limit | 0.4
stricter | f
Not case sensitive
Word order does not matter
But accents do
SELECT similarity('I AM SAM', 'Sam I am') ;
similarity | 1
SELECT similarity('e', 'è') ;
similarity | 0
If you've installed the pg_trgm extension, you will also be able to see massive speed increases for
like
, ilike
and regular expression comparisons by indexing with the GIN or GiST index operator class.
These indexes will not aid in an equality search, though. You can still use a btree index for this purpose.
GiST - Generalized Search Tree is lossy because the underlying data is hashed, and this requires a heap hit to guarantee results.
GIN - The Generalized Inverted Index is not lossy. And it's faster than GiST.
Although, GIN takes much longer to build, much longer to update, and is much larger on disk.
In fact, it's not even the text
type. Instead we get the tsvector type.
Eliminates case
Removes stop words (and, or, not, she, him, and hundreds of others)
Replaces synonyms, and takes word stems (elephant -> eleph)
Can (and should) be indexed with GIN or GiST
Custom ranking with weights & ts_rank
Many natural languages (just need the language's dictionary)
Array (or json) style contain operators @> <@
Combining search terms
CREATE TABLE namez(name text);
with letterz (letter) AS (
VALUES('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H')
,('I'),('J'),('K'),('L'),('M'),('N'),('O'),('P'),('Q')
,('R'),('S'),('T'),('U'),('V'),('W'),('X'),('Y'),('Z')
)
INSERT INTO namez
SELECT surname || z1.letter || z2.letter
from names as s
CROSS JOIN letterz as z1
CROSS JOIN letterz as z2
;
INSERT 0 676000
Time: 834.743 ms
CREATE UNIQUE INDEX uq_namez_name on namez(upper(surname));
CREATE INDEX
Time: 10021.628 ms
SELECT count(*) from namez where surname % 'miller' ;
-- no index, pg_trgm CI operator
count | 1361
Time: 658.001 ms
SELECT count(*) from namez where surname ilike 'mill%' ;
-- no index, ilike
count | 1352
Time: 271.624 ms
CREATE INDEX ix_trgm_namez on namez USING gist (surname gist_trgm_ops);
CREATE INDEX
Time: 11230.830 ms
SELECT count(*) from namez where surname % 'miller' ;
count | 1361
Time: 64.845 ms
SELECT count(*) from namez where surname ilike 'mill%' ;
count | 1352
Time: 17.044 ms
with letterz (letter) AS (
VALUES('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H')
,('I'),('J'),('K'),('L'),('M'),('N'),('O'),('P'),('Q')
,('R'),('S'),('T'),('U'),('V'),('W'),('X'),('Y'),('Z')
)
INSERT INTO namez
SELECT surname || z1.letter || z2.letter ||z3.letter || z4.letter
from names as s
CROSS JOIN letterz as z1
CROSS JOIN letterz as z2
CROSS JOIN letterz as z3
CROSS JOIN (SELECT letter from letterz where letter < 'I') as z4
;
INSERT 0 140608000
Time: 123886.457 ms
CREATE UNIQUE INDEX uq_namez_name on namez(upper(surname));
Time: 4391822.101 msANALYZE FULL namez;
SELECT count(*)
from pages
where content_vector @@ to_tsquery('elephant') ;
count | 8792
Time: 22767.201 ms
time cat x3 |grep 'elephant' |wc -l
3039
real 1m25.309s
This presentation was built in reveal.js