Cloud integration made easy: CockroachDB and Google Pub/Sub

[ Guides ]

O'Reilly Definitive Guide

We wrote the book on distributed scale. Literally.

Free O'Reilly Book

Have you ever wanted to stream data from your database without having to install additional resources that require more time and energy to maintain?

Just recently, I spoke with a {person who shall remain nameless} about how they wanted to dump new data to a file, move it from one cloud provider to another, and then place it in storage for consumption by downstream systems. There was more to it and it hurt my head to listen. I had only one question for {person to remain nameless} “When did you start hating your fellow coworkers?” I mean, really… other people will have to maintain that mess.

The point I am trying to make, admittedly in a slightly snarky way, is that it’s important to simplify your IT assets. And to avoid creating more hurdles or complexity. Why use Google pub/sub?

Google’s Pub/Sub integration unlocks an entire suite of Google Cloud tools that can be used in conjunction with it, including BigQuery, Cloud Dataflow, Cloud Storage, Cloud Functions, Cloud Dataproc, Firebase, EventArc and more.It just makes sense to integrate with Pub/Sub to get data from CockroachDB for downstream systems.

Google’s Pub/Sub allows enterprises the ability to:

No-ops, secure, scalable messaging or queue system
In-order and any-order at-least-once message delivery with pull and push modes
Secure data with fine-grained access controls and always-on encryption

CockroachDB is a distributed SQL database built on a transactional and strongly-consistent key-value store. It scales horizontally; survives disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention; supports strongly-consistent ACID transactions; and provides a familiar SQL API for structuring, manipulating, and querying data. CockroachDB is typically deployed via our managed service and can be deployed easily as a single logical database across multiple regions (and even clouds).

To get started with our managed service, which is free for the first 30 days, check out the following article: How to get started CockroachDB Dedicated. (However, if you prefer to deploy the database on prem or on your own cloud resources, you can get started using the directions below for CockroachDB Self Hosted.)

In this post, we will cover the Change Data Capture(CDC) capabilities of CockroachDB. CDC provides efficient, distributed, row-level changefeeds into a configurable sink for downstream processing such as reporting, caching, or full-text indexing.

Combining the CDC capability with Google’s Pub/Sub opens the entire Google Cloud platform, so that any application that can subscribe and receive change data feeds.

Google Cloud Quote

In this post we walk you through how to integrate CockroachDB into your Google Cloud architecture.

CDC & Google Pub/Sub Tutorial Overview

Setup a CockroachDB Dedicated cluster
Create a GCP instance to run our queries from
Create Google Pub/Sub Topic
Create an account database and tables
Enable the CDC feature of CockroachDB and stream data to Google Pub/sub
Stream data to Google Cloud Storage with Google Dataflow

Getting Started

Step 1: Create CockroachDB Dedicated cluster

Getting started with CockroachDB Dedicated is easy. In a few steps, you can get started on Google Cloud.

Navigate to: https://cockroachlabs.cloud/ Create an account
Several options are pre-filled when you choose a plan, such as cloud provider(GCP), region, number of nodes (we recommend 3 at a minimum) and a dynamically created cluster name. For the sake of this article we will not be covering topics, such as compute, storage and VPC peering. The defaults for these settings will be sufficient for the sake of this discussion.

Note: One can change the cluster name to something more meaningful, Unless you like the preset names for your clusters.

To approve your cluster settings click: create cluster.

Your cluster will be created in 10-15 minutes and the Connect dialog will display.

Once your cluster is ready, you will see a window with information on how to connect to your cluster.
The ‘Setup’ screen will walk you through the setting the ‘IP Allowlist’, SQL User and default database. A. Click on ‘allowlist an IP’ B. Choose your local IP (it should be listed as a default).

Create CockroachDB Dedicated cluster

For this this blog, we are using the ‘For Mac’, you can use Linux or Windows.

Copy each command-line prompt and the connection string (separate tab) to a separate notepad for use later in this lab.

Note: Make sure to label each command line command with its purpose: CRDB client, CA cert, DB connect and Connection String

Create CockroachDB Dedicated cluster

Step 2: Configuring a Google Cloud Platform Account

Go to cloud.google.com and login with your google account. If you don’t have a google account, please create a free trial account by following the instructions at this link.
If not already there, go to https://console.cloud.google.com/
Create a new project, by selecting the following dropdown in the top left:
A new window will pop up. In it, select “New Project” in the top right.
Given your project a proper new project name. Let’s go with cockroach-cdc-demo and click the “Create” button:
After your new project is done being created, go back to the dropdown in Step 3, select your new project name.
When the right project is selected, the name will change to reflect this in the dropdown in the top left of your console.

Step 3: Configure a GCE Ubuntu VM Instance on GCP

We will be creating an VM instance. The purpose of the vm will be to utilize CockroachDB cli to connect to the database, run queries and maybe kick off a workload. Let’s go crazy and have fun!

Create a Ubuntu Linux GCE VM instance using the instructions given here. Note: Our instance is labeled: ‘cli-instance’ for the purposes of this lab.

Choose N2 Machine Family Series: