Searching Text in PostgreSQL

A Primer

@PhilVacca

Do you ever have to go sifting through text?

Like. . . A lot of text?

Row after row, file after file.

How will you ever find that one little thing you need‽

Postgres can do this!

The Basics

How text is stored

The Character Data Types

In the SQL standard you have two types:

character (char)
character varying (varchar)

In Postgres, we have both of those, but we also have:

text

Just use text.

From the 9.4 entry on character datatypes. . .

“While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs.”

And some weird ones:

"char" is not char(1)
name is reserved for system catalogs

And. . .

We also have structured text.

xml
json
But not jsonb

And let's not even talk about character encodings. . .

¯\_(ツ)_/¯

Querying

like & ilike
SIMILAR
regex
col ~ E'^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$'

But these methods aren't always that good

You need to know the correct spelling in advance
Words with the same root don't match (e.g. democracy, democracies)
Lack of scalability

btree indexes don't help much

. . . and expression indexes must match the search that's going to happen

Getting around bad spleling with pg_trgm

pg_trgm is an extension for determining word similarity.

It divides each word in the string into three character segments (left padded with two spaces, and right padded with one).

It ignores non-alphanumeric characters.

CREATE EXTENSION pg_trgm;
SELECT show_trgm('spleling'), show_trgm('spelling!')
	, similarity('spleling', 'spelling!')
	, 'spleling' <-> 'spelling!' as distance
	, show_limit();
show_trgm  | {"  s"," sp",eli,ing,lel,lin,"ng ",ple,spl}
show_trgm  | {"  s"," sp",ell,ing,lin,lli,"ng ",pel,spe}
similarity | 0.384615
distance   | 0.615385
show_limit | 0.3

SELECT CASE when 'spleling' % 'spelling' then true else false END as default;
SELECT set_limit(.4) ;
SELECT CASE when 'spleling' % 'spelling' then true else false END as stricter;
default | t
set_limit | 0.4
stricter | f

Some facts about pg_trgm

Not case sensitive
Word order does not matter
But accents do

SELECT similarity('I AM SAM', 'Sam I am') ;
similarity | 1

SELECT similarity('e', 'è') ;
similarity | 0

Scalability

I will not be pushed, filed, stamped, indexed, briefed, debriefed or numbered.

Wait, indexed?

Ok, that one.

If you've installed the pg_trgm extension, you will also be able to see massive speed increases for like, ilike and regular expression comparisons by indexing with the GIN or GiST index operator class.

These indexes will not aid in an equality search, though. You can still use a btree index for this purpose.

Fancy new index types

GiST - Generalized Search Tree is lossy because the underlying data is hashed, and this requires a heap hit to guarantee results.

GIN - The Generalized Inverted Index is not lossy. And it's faster than GiST.

Although, GIN takes much longer to build, much longer to update, and is much larger on disk.

Full Text Search

Full text search is not the literal text

In fact, it's not even the text type. Instead we get the tsvector type.

Eliminates case
Removes stop words (and, or, not, she, him, and hundreds of others)
Replaces synonyms, and takes word stems (elephant -> eleph)
Can (and should) be indexed with GIN or GiST

Much, much more

Custom ranking with weights & ts_rank
Many natural languages (just need the language's dictionary)
Array (or json) style contain operators @> <@
Combining search terms

Performance

How does pg_trgm scale?

							CREATE TABLE namez(name text);

with letterz (letter) AS (
VALUES('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H')
    ,('I'),('J'),('K'),('L'),('M'),('N'),('O'),('P'),('Q')
    ,('R'),('S'),('T'),('U'),('V'),('W'),('X'),('Y'),('Z')
)
INSERT INTO namez
SELECT surname || z1.letter || z2.letter
from names as s
CROSS JOIN letterz as z1
CROSS JOIN letterz as z2
;
INSERT 0 676000
Time: 834.743 ms

CREATE UNIQUE INDEX uq_namez_name on namez(upper(surname));
CREATE INDEX
Time: 10021.628 ms

SELECT count(*) from namez where surname % 'miller' ;
-- no index, pg_trgm CI operator
count | 1361
Time: 658.001 ms

SELECT count(*) from namez where surname ilike 'mill%' ;
-- no index, ilike
count | 1352
Time: 271.624 ms

CREATE INDEX ix_trgm_namez on namez USING gist (surname gist_trgm_ops);
CREATE INDEX
Time: 11230.830 ms

SELECT count(*) from namez where surname % 'miller' ;
count | 1361
Time: 64.845 ms

SELECT count(*) from namez where surname ilike 'mill%' ;
count | 1352
Time: 17.044 ms

Let's do it again, bigger

with letterz (letter) AS (
VALUES('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H')
    ,('I'),('J'),('K'),('L'),('M'),('N'),('O'),('P'),('Q')
    ,('R'),('S'),('T'),('U'),('V'),('W'),('X'),('Y'),('Z')
)
INSERT INTO namez
SELECT surname || z1.letter || z2.letter ||z3.letter || z4.letter
from names as s
CROSS JOIN letterz as z1
CROSS JOIN letterz as z2
CROSS JOIN letterz as z3
CROSS JOIN (SELECT letter from letterz where letter < 'I') as z4
;
INSERT 0 140608000
Time: 123886.457 ms

CREATE UNIQUE INDEX uq_namez_name on namez(upper(surname));
Time: 4391822.101 ms
ANALYZE FULL namez;

Full text search

						SELECT count(*)
  from pages
 where content_vector @@ to_tsquery('elephant') ;
count | 8792
Time: 22767.201 ms

						time cat x3 |grep 'elephant' |wc -l
3039
real	1m25.309s

Thanks, #pgOpen!

This presentation was built in reveal.js