The Principle of Inverted Index

The inverted index originates from search engine technology and can be regarded as the cornerstone of search engines. Thanks to the inverted index technology, search engines can efficiently perform operations such as database searching and deletion. Before elaborating on the inverted index, we will first introduce the relevant forward index and compare the two.

Forward Index

In a search engine, the forward table uses the document ID as the keyword, and the table records the position information of each word in the document. When searching, the system will scan the word information in each document in the table until all documents containing the query keyword are found.

The structure of the forward table can be represented by the following box diagram:

+---------------------+
|     Forward Index Table      |
+---------------------+
|  Document ID  |  Position Information |
+---------------------+
|   Doc1   |  word1@3 |
|          |  word2@7 |
+---------------------+
|   Doc2   |  word1@2 |
|          |  word3@5 |
+---------------------+
|   Doc3   |  word2@4 |
|          |  word4@6 |
+---------------------+

This organization method has a relatively simple structure when building the index, and it is convenient to build and easy to maintain. Since the index is built based on documents, if a new document is added, only a new index block needs to be created for this document and attached to the end of the original index file; if a document is deleted, the index information corresponding to the document number can be directly found and deleted. However, when querying, all documents need to be scanned to ensure no omissions, which will greatly extend the retrieval time and lead to low retrieval efficiency.

Although the working principle of the forward table is very simple, its retrieval efficiency is too low, and it has little practical value unless in specific situations.

Inverted Index

The inverted table uses words or terms as keywords for indexing, and the record entries corresponding to the keywords in the table record all the documents in which this word or term appears.

The structure of the inverted table can be represented by the following box diagram:

+---------------------+
|     Inverted Index Table      |
+---------------------+
|  Keyword   |  Document List |
+---------------------+
|  word1    |  Doc1,Doc2 |
+---------------------+
|  word2    |  Doc1,Doc3 |
+---------------------+
|  word3    |  Doc2      |
+---------------------+
|  word4    |  Doc3      |
+---------------------+

Since the number of documents corresponding to each word or term changes dynamically, the establishment and maintenance of the inverted table are relatively complex. However, when querying, since all the documents corresponding to the query keyword can be obtained at once, the efficiency is higher than that of the forward table. In full-text retrieval, the fast response of retrieval is a crucial performance. Although the index establishment is relatively slow as it is carried out in the background, it will not affect the efficiency of the entire search engine.

The Gin Index in PostgreSQL

Overview

GIN is the abbreviation of Generalized Inverted Index, that is, the so-called inverted index. The values of the data types it processes are not atomic but composed of elements, which we call composite types. For example, in (hank, 15:3 21:4), it means that hank appears at the positions 15:3 and 21:4. The following will help us have a clearer understanding of the GIN index through specific examples.

GIN Index Structure

Physical Structure

The physical storage of the GIN index contains the following content:

Entry: An element in the GIN index, which can be regarded as a word position and can also be understood as a key.
Entry tree: A B-tree constructed on the Entry.
Posting list: A linked list of the physical positions (heap ctid, heap table row number) where an Entry appears.
Posting tree: A B-tree constructed on the linked list of the physical positions (heap ctid, heap table row number) where an Entry appears. So the KEY of the posting tree is ctid, and the KEY of the entry tree is the value of the indexed column.
Pending list: A temporary storage linked list of index tuples, used for insertion operations in the fastupdate mode.

From the above, it can be seen that the GIN index is mainly composed of the Entry tree and the posting tree (or posting list), where the Entry tree is the main structure tree of the GIN index, and the posting tree is the auxiliary tree.

The entry tree is similar to the b+tree, and the posting tree is similar to the b-tree.

In addition, both the entry tree and the posting tree are organized in an orderly manner according to the KEY.

Logical Structure

Logically, the GIN index can be regarded as a relation, and this relation has two structures:

Indexing only one column of the base table

key	value
key1	Posting list (or posting tree)
key2	Posting list (or posting tree)
…	…

Indexing multiple columns of the base table (composite, multi-column index)

column_id	key	value
Column1 num	key1	Posting list (or posting tree)
Column2 num	key1	Posting list (or posting tree)
Column3 num	key1	Posting list (or posting tree)
…	…	…

It can be seen from this that under this structure, for the same key in different columns of the base table, it will also be treated as a different key in the GIN index.

Full-text Retrieval

The main application field of GIN is to accelerate full-text search. Therefore, here we will introduce the GIN index using an example of full-text search.

Create a table, and doc_tsv is of the text search type, which can automatically sort and eliminate duplicate elements:

pg_study=# create table ts(doc text, doc_tsv tsvector);
CREATE TABLE

pg_study=# insert into ts(doc) values
  ('Can a sheet slitter slit sheets?'), 
  ('How many sheets could a sheet slitter slit?'),
  ('I slit a sheet, a sheet I slit.'),
  ('Upon a slitted sheet I sit.'), 
  ('Whoever slit the sheets is a good sheet slitter.'), 
  ('I am a sheet slitter.'),
  ('I slit sheets.'),
  ('I am the sleekest sheet slitter that ever slit sheets.'),
  ('She slits the sheet she sits on.');
INSERT 0 9

pg_study=# update ts set doc_tsv = to_tsvector(doc);
UPDATE 9

pg_study=# create index on ts using gin(doc_tsv);
CREATE INDEX

The structure of this GIN index is as follows. The black squares are the TID numbers, and the white ones are the words. Note that this is a single-linked list, which is different from the double-linked list of the B-tree:

+--------+     +--------+     +--------+
|  sheet |---->|  slit  |---->| slitter|
+--------+     +--------+     +--------+
   |             |             |
   v             v             v
+--------+   +--------+   +--------+
| (0,10) |   | (0,10) |   | (0,10) |
+--------+   +--------+   +--------+
   |             |             |
   v             v             v
+--------+   +--------+   +--------+
| (0,11) |   | (0,11) |   | (0,11) |
+--------+   +--------+   +--------+
   |             |             |
   v             v             v
   ...           ...           ...

Let's look at another example:

pg_study=# select ctid,doc, doc_tsv from ts;
  ctid  |                          doc                           |                         doc_tsv
--------+--------------------------------------------------------+---------------------------------------------------------
 (0,10) | Can a sheet slitter slit sheets?                       | 'sheet':3,6 'slit':5 'slitter':4
 (0,11) | How many sheets could a sheet slitter slit?            | 'could':4 'mani':2 'sheet':3,6 'slit':8 'slitter':7
 (0,12) | I slit a sheet, a sheet I slit.                        | 'sheet':4,6 'slit':2,8
 (0,13) | Upon a slitted sheet I sit.                            | 'sheet':4 'sit':6 'slit':3 'upon':1
 (0,14) | Whoever slit the sheets is a good sheet slitter.       | 'good':7 'sheet':4,8 'slit':2 'slitter':9 'whoever':1
 (0,15) | I am a sheet slitter.                                  | 'sheet':4 'slitter':5
 (0,16) | I slit sheets.                                         | 'sheet':3 'slit':2
 (0,17) | I am the sleekest sheet slitter that ever slit sheets. | 'ever':8 'sheet':5,10 'sleekest':4 'slit':9 'slitter':6
 (0,18) | She slits the sheet she sits on.                       | 'sheet':4 'sit':6 'slit':2
(9 rows)

It can be seen from the above that sheet, slit, and slitter appear in multiple rows, so there will be multiple TIDs. In this case, a TID list will be generated, and a separate B-tree will be generated for it.

The following statement can find out how many rows the word appears in.

pg_study=# select (unnest(doc_tsv)).lexeme, count(*) from ts
group by 1 order by 2 desc;
  lexeme  | count
----------+-------
 sheet    |     9
 slit     |     8
 slitter  |     5
 sit      |     2
 upon     |     1
 mani     |     1
 whoever  |     1
 sleekest |     1
 good     |     1
 could    |     1
 ever     |     1
(11 rows)

Query Example

How does the following query execute?

// Since the amount of data is small here, full table scan is disabled
pg_study=# set enable_seqscan TO off;
SET

pg_study=# explain(costs off)                                 
select doc from ts where doc_tsv @@ to_tsquery('many & slitter');
                             QUERY PLAN
---------------------------------------------------------------------
 Bitmap Heap Scan on ts
   Recheck Cond: (doc_tsv @@ to_tsquery('many & slitter'::text))
   ->  Bitmap Index Scan on ts_doc_tsv_idx
         Index Cond: (doc_tsv @@ to_tsquery('many & slitter'::text))
(4 rows)

First, extract each word (query key) from the query: mani and slitter. This is completed by a special API, which will consider the strategies of data types and operators:

pg_study=# select amop.amopopr::regoperator, amop.amopstrategy
from pg_opclass opc, pg_opfamily opf, pg_am am, pg_amop amop
where opc.opcname = 'tsvector_ops'
and opf.oid = opc.opcfamily
and am.oid = opf.opfmethod
and amop.amopfamily = opc.opcfamily
and am.amname = 'gin'
and amop.amoplefttype = opc.opcintype;
        amopopr        | amopstrategy
-----------------------+--------------
 @@(tsvector,tsquery)  |            1
 @@@(tsvector,tsquery) |            2
(2 rows)

In the B-tree containing the words, find the TID lists corresponding to the two keys:

mani: (0,2) slitter: (0,1), (0,2), (1,2), (1,3), (2,2)

Finally, for each found TID, call the consistency function in turn. This consistency function can determine whether the row pointed to by the TID meets the query conditions. Since the words in the query are connected by and, the returned row is only (0,2).

       |      |         |  consistency
       |      |         |    function
  TID  | mani | slitter | slit & slitter
-------+------+---------+----------------
 (0,1) |    f |       T |              f 
 (0,2) |    T |       T |              T
 (1,2) |    f |       T |              f
 (1,3) |    f |       T |              f
 (2,2) |    f |       T |              f

The result is:

pg_study=# select doc from ts where doc_tsv @@ to_tsquery('many & slitter');
                     doc
---------------------------------------------
 How many sheets could a sheet slitter slit?
(1 row)

The Problem of Slow Update Speed

The insertion or update of data in the GIN index is very slow. Because each row usually contains many word elements to be indexed. Therefore, when adding or updating a row, we have to update the index tree a lot.

On the other hand, if multiple rows are updated at the same time, some of their word elements may be the same, so the total cost is less than the cost of updating documents row by row.

The GIN index has a fastupdate storage parameter, which we can specify when creating the index and update it later:

pg_study=# create index on ts using gin(doc_tsv) with (fastupdate = true);
CREATE INDEX

After enabling this parameter, the updates will accumulate in a separate unordered list. When this list is large enough or during vacuum (garbage collection), all accumulated updates will be operated on the index immediately. The "large enough" list is determined by the "gin_pending_list_limit" configuration parameter or the storage parameter with the same name when creating the index.

Partial Match Search

Query the doc starting with slit

pg_study=# select doc from ts where doc_tsv @@ to_tsquery('slit:*');
                          doc
--------------------------------------------------------
 Can a sheet slitter slit sheets?
 How many sheets could a sheet slitter slit?
 I slit a sheet, a sheet I slit.
 Upon a slitted sheet I sit.
 Whoever slit the sheets is a good sheet slitter.
 I am a sheet slitter.
 I slit sheets.
 I am the sleekest sheet slitter that ever slit sheets.
 She slits the sheet she sits on.
(9 rows)

The index can also be used to accelerate:

pg_study=# explain (costs off)
select doc from ts where doc_tsv @@ to_tsquery('slit:*');
                    QUERY PLAN
---------------------------------------------------
 Seq Scan on ts
   Filter: (doc_tsv @@ to_tsquery('slit:*'::text))
(2 rows)

The Frequency of Keywords

Generate some data:

fts=# alter table mail_messages add column tsv tsvector;
fts=# update mail_messages set tsv = to_tsvector(body_plain);
fts=# create index on mail_messages using gin(tsv);
fts=# \timing on

-- A total of 466125 pieces of data
fts=# select count(*) from mail_messages;
 count
--------
 356125
(1 row)

-- Here, instead of using unnest to count the number of times a word appears in a row, because the amount of data is relatively large, we use the ts_stat function to calculate
fts=# select word, ndoc
fts-# from ts_stat('select tsv from mail_messages')
fts-# order by ndoc desc limit 3;
 word  |  ndoc  
-------+--------
 wrote | 231174
 use   | 173833
 would | 157169
(3 rows)

Time: 11556.976 ms

For example, we query the word that rarely appears in the email information, such as "tattoo":

fts=# select word, ndoc from ts_stat('select tsv from mail_messages') where word = 'tattoo';
  word  | ndoc 
--------+------
 tattoo |    2
(1 row)

Time: 11236.340 ms

The number of times two words appear in the same row. There is only one row where wrote and tattoo appear at the same time

fts=# select count(*) from mail_messages where tsv @@ to_tsquery('wrote & tattoo');
 count 
-------
     1
(1 row)

Time: 0.550 ms

Let's see how it is executed. As mentioned above, if we want to obtain the TID lists of two words, the search efficiency is obviously very low: because we have to traverse more than 200,000 values, and only one value is taken. However, through the statistical information, the algorithm can know that "wrote" appears frequently, while "tattoo" appears rarely. Therefore, the search for the word that is not frequently used will be executed, and then it will check whether "wrote" exists in the two retrieved rows. In this way, the query result can be obtained quickly:

fts=# select count(*) from mail_messages where tsv @@ to_tsquery('wrote & tattoo');
 count 
-------
     1
(1 row)

Time: 0.419 ms

Limiting Query Results

One characteristic of the GIN index is that it can only return a bitmap and cannot return each TID one by one. Therefore, all the plans in this article use bitmap scan.

Compression Method

One of the advantages of GIN is its compression feature. Firstly, if the same word appears in multiple rows, it will only be stored once in the index. Secondly, TIDs are stored in an ordered manner in the index, which enables us to use a simple compression method: the next TID in the list is actually different from the previous TID; this number is usually very small, and compared with the complete six-byte TID, the number of bits required is much smaller.

Compare the sizes of different indexes:

Create a B-tree index: GIN is built on a different data type (tsvector instead of text), and this data type is smaller. At the same time, the B-tree will truncate the message to within 2K.

fts=# create index mail_messages_btree on mail_messages(substring(body_plain for 2048));
CREATE INDEX
Time: 32709.807 ms

Create a gist index:

fts=# create index mail_messages_gist on mail_messages using gist(tsv);
CREATE INDEX
Time: 14651.884 ms

Look at the sizes of gin, gist, and btree respectively:

fts=# select pg_size_pretty(pg_relation_size('mail_messages_tsv_idx')) as gin,
fts-#              pg_size_pretty(pg_relation_size('mail_messages_gist')) as gist,
fts-#              pg_size_pretty(pg_relation_size('mail_messages_btree')) as btree;
  gin   |  gist  | btree  
--------+--------+--------
 189 MB | 111 MB | 526 MB
(1 row)

Time: 2.961 ms

Since the GIN index saves more space, we can use the GIN index instead of the bitmap index during the process of migrating from Oracle to PostgreSQL. Generally, the bitmap index is used for fields with very few unique values, which is also very effective for GIN. Moreover, PostgreSQL can dynamically build a bitmap based on any index (including GIN).

Leapcell: The Best of Serverless Web Hosting

Finally, we recommend a platform that is most suitable for deploying web services: Leapcell

🚀 Build with Your Favorite Language

Develop effortlessly in JavaScript, Python, Go, or Rust.

🌍 Deploy Unlimited Projects for Free

Only pay for what you use—no requests, no charges.

⚡ Pay-as-You-Go, No Hidden Costs

No idle fees, just seamless scalability.

📖 Explore Our Documentation

🔹 Follow us on Twitter: @LeapcellHQ

Using PostgreSQL as a Search Engine? Yes, You Might Not Need Elasticsearch