Learn SQL from scratch with a focus on what app devs need to know, from schema design to writing transactions that perform at scale.
Start LearningFull text indexing and search is such a common feature of applications these days. Users expect to be able to find a restaurant, a product, a movie review, or any number of other things quickly by entering a few search terms into a user interface. Many of these apps are built using a relational database as the data store, so the apps aren’t generally dedicated to search, but incorporate that as an added feature.
For developers using CockroachDB, the full text search options really don’t yet exist within the database itself. Though CockroachDB does support Levenshtein and Soundex, and these can be combined in novel ways to support fuzzy matching of shorter data elements, the effort to build out full text search is just beginning.
Having said that, CockroachDB does support changefeeds, or Change Data Capture (CDC). Changefeeds allow you to configure an event stream based on changes to tables in the DB, and these can be emitted into a message queue (e.g. Kafka), an S3 bucket, or even just an HTTP(S) endpoint. Once configured, any INSERT, UPDATE, DELETE generates an event which can be acted on by another process. This process could even be something like Elasticsearch, which would provide that full text index and search capability. Back in August 2020, I mentioned this in a blog post on full text search in CockroachDB and I just realized I hadn’t yet found a demo of this integration. That’s the motivation for this post.
The goal of this example is to illustrate a pattern for integrating CockroachDB with Elasticsearch, so it’s simplified to just a single table which is oriented to storing data taken from URLs. In the real world, we’d probably have multiple tables with foreign key relationships.
I opted to run this in a VM deployed in Google Cloud Platform, in the us-central1 region, because CockroachDB Serverless is available in that region. Of the various types of endpoints currently available for CREATE CHANGEFEED, there isn’t one which lines up directly with what Elasticsearch expects as input, so I needed to create a Python Flask app which acts as the adapter between CockroachDB and Elasticsearch. I also decided to just run the Elasticsearch instance right there on that VM. Since encrypting all this communication seemed like a good idea, I ran an Nginx instance as a proxy for both the Flask app and Elasticsearch. Let’s Encrypt handles the SSL certificates. I used Google Domains for DNS so that my Let’s Encrypt setup would work. Finally, I kind of used this VM as a “base camp” and ran the HTML indexer from there as well.
e2-standard-2
VM with 64 GB SSDdefault
, where port 443 was open for HTTPS trafficsudo apt update
, then sudo apt install postgresql-client
git clone https://github.com/mgoddard/crdb-cdc-elastic.git
)sudo apt install python3-pip
sudo pip3 install -r requirements.txt
sudo apt install -y nginx
/etc/nginx/nginx.conf
with an edited version of this file, then restarted NginxCC_cred.txt
.defaultdb
(CockroachDB database maps to Elasticsearch index):$ export ES_PASSWD="that password recorded earlier"
$ curl -XPUT -s -k -u elastic:$ES_PASSWD https://localhost:9200/defaultdb | jq
$ export ES_PASSWD="that password recorded earlier"
$ nohup ./cdc_http.py > cdc.log 2>&1 </dev/null &
psql
CLI out of habit):$ psql $( cat ./CC_cred.txt )
psql (14.4 (Ubuntu 14.4-0ubuntu0.22.04.1), server 13.0.0)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_128_GCM_SHA256, bits: 128, compression: off)
Type "help" for help.
defaultdb=> CREATE TABLE docs
(
uri STRING PRIMARY KEY
, content STRING NOT NULL
, ts TIMESTAMP DEFAULT NOW()
);
CREATE TABLE
defaultdb=> SET CLUSTER SETTING kv.rangefeed.enabled = true;
SET CLUSTER SETTING
defaultdb=> CREATE CHANGEFEED FOR TABLE docs
INTO 'https://cdc.la-cucaracha.net/cdc/'
WITH full_table_name, updated;
job_id
--------------------
777046877208477698
(1 row)
I’m running the following from the VM.
sudo pip3 install beautifulsoup4
export DB_CONN_STR=$( cat ../CC_cred.txt )
$ ./html_indexer.py https://www.cockroachlabs.com/blog/admission-control-in-cockroachdb/ https://www.cockroachlabs.com/blog/how-to-choose-db-index-keys/ https://www.cockroachlabs.com/docs/stable/example-apps https://www.cockroachlabs.com/docs/stable/multiregion-overview https://www.cockroachlabs.com/blog/can-i-scale/ https://www.cockroachlabs.com/blog/sigmod-2022-cockroachdb-multi-region-paper/ https://www.cockroachlabs.com/blog/living-without-atomic-clocks/ https://github.com/cockroachdb/pebble https://www.cockroachlabs.com/blog/netflix-media-infrastructure/ https://www.cockroachlabs.com/blog/full-text-indexing-search/
Indexing uri https://www.cockroachlabs.com/blog/admission-control-in-cockroachdb/ now ...
Indexing uri https://www.cockroachlabs.com/blog/how-to-choose-db-index-keys/ now ...
Indexing uri https://www.cockroachlabs.com/docs/stable/example-apps now ...
Indexing uri https://www.cockroachlabs.com/docs/stable/multiregion-overview now ...
Indexing uri https://www.cockroachlabs.com/blog/can-i-scale/ now ...
Indexing uri https://www.cockroachlabs.com/blog/sigmod-2022-cockroachdb-multi-region-paper/ now ...
Indexing uri https://www.cockroachlabs.com/blog/living-without-atomic-clocks/ now ...
Indexing uri https://github.com/cockroachdb/pebble now ...
Indexing uri https://www.cockroachlabs.com/blog/netflix-media-infrastructure/ now ...
Indexing uri https://www.cockroachlabs.com/blog/full-text-indexing-search/ now ...
Total time: 5.30757999420166 s
defaultdb=> select uri, length(content), ts from docs order by 2 desc;
uri | length | ts
--------------------------------------------------------------------------------+--------+----------------------------
https://www.cockroachlabs.com/blog/admission-control-in-cockroachdb/ | 24985 | 2022-07-07 15:33:32.685422
https://www.cockroachlabs.com/docs/stable/multiregion-overview | 21300 | 2022-07-07 15:33:33.672853
https://www.cockroachlabs.com/blog/living-without-atomic-clocks/ | 20184 | 2022-07-07 15:33:35.000319
https://www.cockroachlabs.com/blog/how-to-choose-db-index-keys/ | 16958 | 2022-07-07 15:33:32.970703
https://www.cockroachlabs.com/blog/can-i-scale/ | 14654 | 2022-07-07 15:33:34.196958
https://www.cockroachlabs.com/blog/full-text-indexing-search/ | 13387 | 2022-07-07 15:33:36.789336
https://www.cockroachlabs.com/blog/netflix-media-infrastructure/ | 12859 | 2022-07-07 15:33:36.247931
https://github.com/cockroachdb/pebble | 12504 | 2022-07-07 15:33:35.708953
https://www.cockroachlabs.com/blog/sigmod-2022-cockroachdb-multi-region-paper/ | 9454 | 2022-07-07 15:33:34.70964
https://www.cockroachlabs.com/docs/stable/example-apps | 9373 | 2022-07-07 15:33:33.307314
(10 rows)
The search script runs a “phrase query” based on the terms provided to it via command line arguments. The returned value, if there were any search hits, includes metadata about the search along with the values of the fields in the table as well as up to four highlighted matching snippets of up to 80 characters in length. Feel free to adjust the script to suit your needs.
$ export ES_PASSWD="ad7yMq3nrg+2bGz-QWe*"
$ ./es_search.py database region
{
"_shards": {
"failed": 0,
"skipped": 0,
"successful": 1,
"total": 1
},
"hits": {
"hits": [
{
"_id": "public-docs-https://www.cockroachlabs.com/docs/stable/multiregionoverview",
"_ignored": [
"content.keyword"
],
"_index": "defaultdb",
"_score": 1.8865218,
"highlight": {
"content": [
"<em>Database</em> <em>region</em> is a geographic region in which a database operates.",
"You must choose a <em>database</em> <em>region</em> from the list of available cluster regions.",
"To add another <em>database</em> <em>region</em>, use the ALTER DATABASE ...",
"Each <em>database</em> <em>region</em> can only belong to one super region."
]
}
}
],
"max_score": 1.8865218,
"total": {
"relation": "eq",
"value": 1
}
},
"timed_out": false,
"took": 16
}
defaultdb=> delete from docs where uri = 'https://www.cockroachlabs.com/docs/stable/multiregion-overview';
DELETE 1
$ ./es_search.py database region
{
"_shards": {
"failed": 0,
"skipped": 0,
"successful": 1,
"total": 1
},
"hits": {
"hits": [],
"max_score": null,
"total": {
"relation": "eq",
"value": 0
}
},
"timed_out": false,
"took": 4
}
And the search returned no results, as expected.
That completes the little tour of CDC from CockroachDB to Elasticsearch. Looking back at that original blog post mentioned above, I was a little cavalier in saying you could just use CDC to Elasticsearch for search, so now I feel a bit better about it. It’s been quite a while since I’ve done much work with Lucene, Solr, or Elasticsearch, and it’s changing all the time, so most likely my example here is pretty basic. In any case, I hope it inspires you to explore this type of integration should the need arise.
Thank you for taking the time to read this over!
Batch data sucks — it’s slow, manual, cumbersome, and often stale. We know because we’ve dealt with these problems …
Read moreIn this article, we’re building a full-stack web app that simulates a game leaderboard. The idea is to make it as simple …
Read moreTo help people get outdoors and improve their physical well-being, we’ll create an outdoor activity tracker. Although we …
Read more