What is an inverted index, and why should you care?

What is an inverted index, and why should you care?
[ Blog ]

Make your database faster.

Indexing best practices for performance

Indexes can have a significant impact on database performance. Let’s take a look at one type that’s especially important for searching text: the inverted index.

What is an inverted index?

In the context of databases, an inverted index is a type of index that stores a record of where search terms – such as words or numbers – are located in a table.

This concept may be easier to understand visually, so let’s take a look at a simple example. Imagine we have the following database table, which stores different written phrases (in this case, it’s a list of CockroachDB features):

id content
101 ‘Multi cloud’
102 ‘Elastic scale’
103 ‘Multi region’
104 ‘Cloud native’

Below is an inverted index for that table. As you can see, this index lists the location of each word (called a token) in the table.

token id
multi 101, 103
cloud 101, 104
elastic 102
scale 102
region 103
native 104

Why use inverted indexes?

Inverted indexes are used to facilitate more efficient full-text searches in a database.

Let’s look at that example table and index from the previous section again to illustrate how it works. Imagine we want to search our database for entries that include the word “multi.” We might use a SQL query like this:

SELECT * FROM table WHERE content LIKE '%multi%';

If our table does not have an inverted index, this query will execute a full table scan. In other words, the database will read every single row to check whether the word “multi” appears in it.

In a table with only four rows, having a query that runs a full table scan isn’t a big problem. But imagine if that database had 10,000 rows, or a million rows. Checking every single row one by one is going to take a while! And in real-world databases, text content (whether it’s stored as a string, JSONB, or something else) is rarely limited to two words per row. In a large database containing large amounts of text, full table scans can quickly bottleneck database performance.

Inverted indexes allow text search to run much more efficiently. With an inverted index created, the database does not need to perform a full table scan. Instead, it can simply refer to the index entry for multi and immediately find that it appears in rows 101 and 103.

In the case of our example database above, this means it would end up reading three rows (the index entry and rows 101 and 103) to return our results, versus having to read four rows without the inverted index.

That’s already a slight efficiency improvement, and this is only a very simple example! In a large, real-world database, creating an inverted index may result in much larger efficiency gains when you’re running full-text searches.

What are the downsides of inverted indexes?

The only real downside to creating an inverted index is that, like any type of SQL index, it will slightly slow down writes. This is because when (for example) a row is committed to the database table, those new values also have to be copied to the index and sorted accordingly.

This is generally a mild performance hit, and if your application queries text-based data like the above with any regularity, the minor performance drop you see with writes will be greatly overshadowed by the significant performance improvements you see with reads.

However, it’s still worth keeping in mind that adding another index isn’t always the right answer, and there will be some use cases – very write-heavy workloads, for example – where the write penalty incurred by adding an inverted index may not be worth the trade-off of improved performance on reads.

[ Course ]

Learn more about optimizing SQL database performance

Start now →

How to use inverted indexes

First, double-check that inverted indexes are supported with the database software and datatype you’re using. In CockroachDB, for example, the following datatypes can be stored in generalized inverted (GIN) indexes: JSONB, ARRAY, GEOMETRY, GEOGRAPHY, TSVECTOR (for full-text search), and STRING (using trigram indexes, which are a subtype of inverted index).

While our simple example uses whole words, this isn’t always the most efficient way to search text. Someone searching for the word “run,” for example, might also be interested in entries that include other forms of the verb such as “running” or “ran.” For this reason, depending on the specifics of your use case, it may be worth considering transforming the tokens you’ll use in your inverted index. Common methods for this include:

  • Stemming, which transforms words into their roots by cutting off the end. For example, converting instances of “running” to “run.”
  • Lemmatization, which is similar to stemming but reduces words to their dictionary entry (so again, “running” would be converted to “run”).
  • Removing stop words, which means getting rid of grammatically common but meaningless-without-context words such as “the”, “of”, “and”, etc.

There are a variety of ways to approach each of these methods. How you apply them to your inverted index will depend on your use case. Often, these tasks can be accomplished automatically – some or all of these methods may already be built into your database software.

In CockroachDB, for example, string (text) data can be converted to the TSVECTOR datatype to facilitate full-text search. This is accomplished with a built-in function, to_tsvector(), that automatically removes stop words and performs stemming as part of the conversion process.

Creating inverted indexes with SQL

Let’s look at how to create and add inverted indexes in relational databases. Note that the specific syntax used for this will vary a bit depending on the specific flavor of SQL your database uses. Here, we’re using CockroachDB syntax, which will look very familiar to anyone who’s familiar with PostgreSQL (CockroachDB is Postgres-compatible, although it includes some advanced features and syntax Postgres does not).

To create an inverted index for an existing table:

CREATE INDEX index_name ON table_name USING GIN (column_to_index);

If creating a trigram index to facilitate searching STRING data, it’s also necessary to specify an opclass for the trigram index like so:

CREATE INDEX index_name ON table_name USING GIN({column_to_index} gin_trgm_ops);

CockroachDB also allows for the creation of partial inverted indexes, which index only a specified subset of the data, like so:

CREATE TABLE table (
  id INT,
  data JSONB,
  INVERTED INDEX index_name(data) WHERE id > 10
);

The above query would create an inverted index for table that only indexed the values in data if the corresponding id was higher than 10.

You can also create multi-column GIN indexes in CockroachDB, although there are some restrictions in terms of how this can be used. The syntax is as follows:

CREATE TABLE users (
  profile_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_type STRING,
  user_profile JSONB,
  INVERTED INDEX (user_type, user_profile)
);

Of course, creating the right sort of index is just the beginning. Learn more about how to use indexes to optimize your database’s performance.

About the author

Charlie Custer github link linkedin link

Charlie is a former teacher, tech journalist, and filmmaker who’s now combined those three professions into writing and making videos about databases and application development (and occasionally messing with NLP and Python to create weird things in his spare time).

Keep Reading

Full text indexing and search in CockroachDB

In this post, I’ll skim the surface of a very common pattern in application development: full text …

Read more
Relational database entities vs. domain-driven design entities

Relational database developers have long used the term “Entity” when designing database schemas. Meanwhile, on the …

Read more
Time, TIMETZ, Timestamp, and TimestampTZ in PostgreSQL

At Cockroach Labs, we’ve spent a lot of time getting our SQL semantics to match PostgreSQL as much as possible - …

Read more