Integrating OLTP and OLAP systems: Enhanced decision making with CockroachDB, BigQueryML and Vertex AI

Integrating OLTP and OLAP systems: Enhanced decision making with CockroachDB, BigQueryML and Vertex AI

DoorDash's Journey from Aurora PostgreSQL to CockroachDB

Watch now

In the evolving landscape of database technology, the distinction between Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) systems is fundamental. OLTP systems, designed for managing transactional data, prioritize speed and reliability in data handling. In contrast, OLAP systems are structured to facilitate complex queries and analytics, analyzing massive quantities of data and offering deep insights into data patterns over time.

By integrating OLTP and OLAP systems and facilitating the immediate accessibility of transactional data, organizations can significantly enhance their capabilities around business intelligence, predictive analytics, and operational efficiency.

CockroachDB and Google Cloud BigQuery stand out in their respective domains of OLTP and OLAP for their scalability, reliability, and performance. This blog explores how enterprises can leverage CockroachDB as an OLTP database and utilize BigQuery for OLAP and further enhance data capabilities using BigQueryML and Vertex AI for machine learning.

Understanding CockroachDB for OLTP

The Essence of OLTP

Online Transaction Processing (OLTP) databases are at the heart of most mission critical business operations, handling everything from customer orders to real-time inventory management. These systems are optimized for high transaction volume, ensuring data integrity and speed in processing multiple concurrent transactions.

Why CockroachDB?

CockroachDB is a distributed SQL database designed for cloud services, offering a unique blend of traditional SQL database ease-of-use with the scalability and resilience of NoSQL systems. It excels in OLTP scenarios thanks to its:

  • Distributed Nature: Automatically replicates and distributes data across multiple nodes, ensuring high availability and fault tolerance.
  • Strong Consistency: Provides serializable isolation, the highest level of data consistency, ensuring accurate and reliable transaction processing.
  • Scalability: Effortlessly scales horizontally, allowing businesses to handle growing transaction volumes without compromising performance or consistency.
  • Operational Simplicity: Streamlines database management tasks like scaling, rebalancing, and recovery, significantly reducing operational overhead.
  • Survivability: CockroachDB is designed to survive node, availability zone and even cloud region failures with minimal latency disruption and no manual intervention. This level of resilience makes it exceptionally reliable for mission-critical applications.

By leveraging CockroachDB for OLTP, businesses can ensure that their transactional systems are not only robust and efficient but also ready to meet future demands. You can read more about CockroachDB’s architecture here.

Elevating Analytical Processing with BigQuery

The Significance of OLAP

Online Analytical Processing (OLAP) systems are designed for query-intensive data analysis, offering insights into historical data across various dimensions. Unlike OLTP systems that focus on fast transaction processing, OLAP is optimized for complex queries and analytics, enabling businesses to derive meaningful insights from their data.

BigQuery as an OLAP Solution

BigQuery stands as a leading OLAP solution, delivering unparalleled query performance, scalability, and operational efficiency for enterprise-grade data warehousing and analytics initiatives. BigQuery’s serverless architecture allows analysts and data scientists to run queries on massive datasets without managing underlying infrastructure, making it an ideal platform for exploring and analyzing transactional data processed by OLTP systems like CockroachDB.

Benefits of Using BigQuery

  • Speed and Scalability: BigQuery’s distributed architecture enables rapid query execution across petabytes of data, facilitating real-time analytics and decision-making.
  • Serverless: No infrastructure to manage means you can focus on analyzing data without worrying about database administration.
  • Integration Ecosystem: Seamless integration with other Google Cloud services, including AI and machine learning tools, enhances analytical capabilities.

By leveraging BigQuery for OLAP, organizations can unlock deep insights into their data, driving better business decisions and strategies.

Integrating CockroachDB with BigQuery via Pub/Sub

To harness the full power of transactional data for analytical processing, integrating CockroachDB with BigQuery is essential. This integration enables the seamless flow of data from the OLTP system (CockroachDB) to the OLAP system (BigQuery) for in-depth analysis.

A Practical Use Case

diagram of an oltp to olap workflow using cockroachdb, bigquery, vertex AI, and gemini

In the example below, we will create a pipeline to replicate data from CockroachDB to BigQuery using Google Cloud Pub/Sub.  The data we’ll be using is sales data from a fictional eCommerce clothing store called Atom Fashion. Atom Fashion is using CockroachDB as the OLTP database for its online sales. As new orders are placed, the sales data is replicated to BigQuery.

Once the data is in BigQuery, the Atom Fashion Marketing team will take advantage of the BigQuery Machine Learning (BQML) integration with Vertex AI, Google Cloud’s comprehensive platform for AI, as well as Gemini Pro, Google’s most powerful multi-modal AI model.   We’ll ask the model to generate customized emails to a specific subset of customers offering them a discount on future purchases based on their past purchase history.

Performing this email marketing campaign manually could take days or even weeks. However, using the combined power of CockroachDB and Google Cloud, you should be able to walk through this tutorial and create a successful custom email marketing campaign based on your most recent sales data in under an hour!

Leveraging BigQueryML & Vertex AI for Machine Learning

With data seamlessly flowing from CockroachDB to BigQuery, leveraging BigQueryML can unlock powerful predictive insights and automate decision-making processes. BigQueryML enables users to create and execute machine learning models directly within Google BigQuery using SQL queries, simplifying the path from data analytics to machine learning.

When it comes to elevating these capabilities, the integration between BigQueryML and Vertex AI, Google Cloud’s unified machine learning platform, takes it a step further. This integration provides an intuitive way for BigQueryML to handle communication with Vertex AI, thus providing a cohesive workflow for users. Vertex AI provides an end-to-end environment for building, deploying and scaling ML models while also offering tools that integrate smoothly with BigQueryML models.

BigQueryML supports various machine learning models, including linear regression for forecasting, logistic regression for classification, k-means clustering for segmentation, etc. Once you have your foundational model trained, Vertex AI allows you to refine and optimize these models with its advanced MLOps tools. With Vertex AI, you can also experiment with more complex models, automate the ML lifecycle and deploy models to production with ease.

Step-by-Step Integration Guide

The instructions in this guide demonstrate how to build this integration using the Google Cloud command line interface. All of these steps can be executed using the Google Cloud UI as well.

1. Create a new Google Cloud Project (This step is optional)

If you’d like to keep your work from this tutorial separate from other work that you have in your Google Cloud Cloud Console, you may want to consider creating a separate Google Cloud Project for storing all of the objects that you’ll be creating in the following steps. The instructions for creating a new project can be found here. The examples in this document will use a project named cockroach-project. If you use a different project, please replace all references to cockroach-project in this document with your own project id.

2. Create a CockroachDB Cluster on Google Cloud

In this tutorial, you can set up a fully managed CockroachDB cluster on Google Cloud. You can choose between two deployment options, CockroachDB Dedicated or CockroachDB Serverless. To set up your CockroachDB Dedicated cluster on Google Cloud, you can follow these steps, or you can follow the steps here to set up a CockroachDB Serverless cluster on Google Cloud. Once you have your database cluster configured, you can move to the next step to load data onto your CockroachDB cluster.

3. Load the sample data into CockroachDB

This tutorial uses data from a fictional eCommerce Store.  For detailed instructions on loading this data to CockroachDB, see Appendix A at the end of this document

4. Enable Pub/Sub in CockroachDB

The next step is to configure CockroachDB to publish changes to a Google Cloud Pub/Sub topic. This involves setting up change data capture (CDC) on the tables you want to analyze, ensuring that every insert, update, or delete is captured in real-time. Read the note below and then follow the instructions in the official Cockroach Labs documentation to set up Pub/Sub as the changefeed sync for CockroachDB.

Note: For this tutorial, we will be making the following changes to the instructions specified in the previous paragraph:

  • Since we will be providing our own data source for this tutorial, Step #3 in the instructions above (running the CockroachDB Movr application to load data) is not necessary.
  • We’ll be replacing the Topic and Subscription names with Topic and Subscription names that more accurately reflect our data
Old Name New Name
Topic = movr-users Topic = order_items_crdb
Subscription = movr-users-sub Subscription = order_items_crdb_sub

You’ll make these changes in Step #8 and Step #10

5. Create a BigQuery Dataset

In the following step, we will be creating a table in BigQuery to hold the data that will be replicated from CockroachDB.  We’ll create a new dataset named cockroach_pubsub_data to keep the BigQuery objects that will be created in this tutorial organized.

bq --location=US mk --dataset \
--default_table_expiration=0 \
cockroach-project:cockroach_pubsub_data

6. Create a Table for the Pub/Sub Data

Now that we have a dataset to keep us organized, we’ll create a table named order_items_data to hold the data that Pub/Sub will be writing to BigQuery. The table will contain rows from the CockroachDB table order_items as a single JSON object. We will add a few additional columns for storing Pub/Sub metadata.

bq mk --table \
cockroach_pubsub_data.order_items_data \
subscription_name:STRING,\
message_id:STRING,\
publish_time:TIMESTAMP,\
data:STRING,\
attributes:STRING

7. Grant permissions for the Pub/Sub service account

The Pub/Sub Service Account needs to be granted permissions to interact with the new BigQuery Dataset

gcloud projects add-iam-policy-binding cockroach-project \
--member=serviceAccount:service-$(gcloud projects describe cockroach-project --format="value(projectNumber)")@gcp-sa-pubsub.iam.gserviceaccount.com \
--role=roles/bigquery.dataEditor

gcloud projects add-iam-policy-binding cockroach-project \
--member=serviceAccount:service-$(gcloud projects describe cockroach-project --format="value(projectNumber)")@gcp-sa-pubsub.iam.gserviceaccount.com \
--role=roles/bigquery.user

8. Configure a BigQuery Subscription

Create a subscription in Google Cloud Pub/Sub that consumes messages from the topic set up in Step #2 above (order_items_crdb). This subscription will write all of the changes made to the order_items table in CockroachDB directly into BigQuery.

gcloud pubsub subscriptions create order_items_crdb_bq_sub \
--project=cockroach-project \
--topic=order_items_crdb \
--bigquery-table=cockroach-project:cockroach_pubsub_data:order_items_data \
--write-metadata

9. Start the flow of data

In order to simulate the flow of data into the CockroachDB table order_items, we’ll insert a subset of the data that is in the order_items_stage table. Inserting the data into the order_items table will result in the changefeed that you created on the order_items table flowing all of the newly inserted rows to the order_items_crdb Pub/Sub topic. The subscription that you created in the previous step, order_items_crdb_bq_sub, will take all of the data from the Pub/Sub topic and insert it into the order_items_data table in BigQuery.

INSERT INTO public.order_items\
SELECT *
FROM public.order_items_stage
WHERE id >=1
AND id < 1001;

As you continue your testing, you can run this statement multiple times using different ranges on the id column to simulate more order data flowing through the system.

10. Running BQML with Gemini PRO on the Replicated Data

Now that we have the changefeed operational and order data flowing into BigQuery, we can take advantage of BigQuery’s integrated machine learning functionality (BQML) and the BQML integration with Vertex AI to create a marketing campaign that utilizes the power of Gemini Pro and Generative AI.

Prerequisite work:

  1. Open a notebook from the BigQuery console

screencap: Open a notebook in Google BigQuery console

  1. Create an external connection to Cloud Resources and name it crdb_bqml. An external connection allows BigQuery to call other GCP services, such as Vertex AI. Also install the jq package which will be used later for extracting JSON data.

    Use the +Code button to add a new code block to the notebook:

    screencap: create a code block in Google&rsquo;s BigQuery notebook

# Double check the data is in the dataset 
# Create a BigQuery Connection to use Gemini Pro, UPDATE THE PROJECT_ID

!bq mk \
  --connection \
  --location=US \
  --project_id=cockroach-project \
  --connection_type=CLOUD_RESOURCE crdb_bqml

# install jq for json extraction
!apt install jq
  1. Get the email for the Account associated with the new connection:
# from the output, copy the service account email
!bq show --format=json --connection cockroach-project.us.crdb_bqml  | jq '.cloudResource.serviceAccountId'

The output will look similar to this:
bqcx-7504964XXXX-5odo@gcp-sa-bigquery-condel.iam.gserviceaccount.com

  1. Using the Service Account Email from the previous step, assign the aiplatform.user role to the Service Account. This allows the Service Account to use Vertex AI services.
# Update the SERVICE_ACCOUNT field

!gcloud projects add-iam-policy-binding cockroach-project\
  --member='serviceAccount:<SERVICE ACCOUNT EMAIL FROM THE PREVIOUS STEP>' \
    --role='roles/aiplatform.user' \
  --condition=None
  1. Create a model to allow BigQuery to use the connection that we created earlier for making calls Gemini Pro.
# Make sure in the notebook, use %%bigquery for magic function in the cell

%%bigquery

CREATE OR REPLACE MODEL `cockroach-project.cockroach_pubsub_data.crdb_bq_model`
REMOTE WITH CONNECTION `us.crdb_bqml`
OPTIONS (ENDPOINT = 'gemini-pro');
  1. Run a Job to identify users that have purchased sunglasses and create an email template offering them 5% discount for a second pair. This job may take a couple of minutes to run.
%%bigquery

WITH item_order AS (
SELECT
  json_extract(data, "$.Value.after.product_id") as product_id,
  json_extract(data, "$.Value.after.user_id") as user_id
FROM \`cockroach-project:cockroach_pubsub_data:order_items_data\`
),

stg_pre_ai AS (
  SELECT
    user_id,
    name as product_name,
    brand,
    first_name,
    last_name,
  FROM item_order
  JOIN \`bigquery-public-data.thelook_ecommerce.products\` AS product
  ON item_order.product_id = CAST(product.id AS STRING)
  JOIN \`bigquery-public-data.thelook_ecommerce.users\` AS users
  ON item_order.user_id = CAST(users.id AS STRING)
  GROUP BY 1,2,3,4,5
)

SELECT
  ml_generate_text_llm_result,
  product_name,
  prompt
FROM
  ML.GENERATE_TEXT(
    MODEL `cockroach-project:cockroach_pubsub_data.crdb_bq_model`,
    (
      SELECT
        CONCAT('''You are an expert marketer from Atom Fashion working on an engaging marketing campaign. You company uses informal tone and uses references to the beach.

Your task is to generate an html email for a specific customer.

Follow these steps:

1. Identify from the context if the customer has bought sunglasses, if they purchased sunglasses before generate an email, with the following steps. If they bought a different product, do not generate an email and return NO_MATCH.
2. If the customer purchased sunglasses, congratulate them for their previous purchase and offer a 5% discount (code SUPERFUN5)
3. If the customer has not purchased sunglasses, return an empty html.
4. The email should not have more than 500 tokens in length.
5. Return the code as plain html code
6. The email should begin with <html> and end with </html>

Always think step by step.

context:\nProduct Name:''', product_name,'''\nCustomer Name:''', first_name, '''\n HTML CODE: ''') AS prompt,
        product_name, first_name, last_name
      FROM
        stg_pre_ai
      LIMIT 200
    ),
    STRUCT(
      0.2 AS temperature,
      500 AS max_output_tokens,
      true AS flatten_json_output
    )
  )
WHERE ml_generate_text_llm_result like '%html%'
  1. Once your results have been returned (which may take a few minutes), press the “Prompt” button to see the results in an easier-to-read format.

screencap: click the prompt button in Google BigQuery&rsquo;s notebook

  1. Your results should look similar to this:

Screencap: HTML output from Google BigQuery&rsquo;s notebook

  1. After copying the html document to a file and opening it with a web browser, the resulting email should look similar to this. Notice how the email is personalized in the style that we specified in the prompt. You’ll also see that the message content may be unique for each customer, while still following the guideline that we specified.

Example of HTML email output using Google BigQuery and Vertex.ai

Advantages of using CockroachDB, BigQueryML & Vertex AI

  • Robust Data Management: CockroachDB offers distributed SQL capabilities, ensuring high availability and consistency for transactional data.
  • BigQuery DataFrames: Leveraging BigQuery dataframes enables efficient handling of large datasets, offering access to a rich set of API’s for analysis of data without having to move it out of BigQuery.
  • Seamless Analytics Integration: BigQueryML allows for direct machine learning model training within the data warehouse, streamlining the transition from analytics to insights. 
  • Advanced Machine Learning Operations: Vertex AI provides comprehensive tools for refining, deploying, and managing machine learning models, enhancing AI-driven solutions.
  • Operational Efficiency: The integration optimizes data flow from transactional processing to analytical insights and machine learning application, improving overall business operations.
  • Strategic Insight and Agility: Together, these technologies equip organizations with the means to swiftly adapt and innovate, leveraging data for strategic advantage.

Conclusion

The strategic integration between CockroachDB, Pub/Sub, BigQuery and Vertex AI offers a powerful pathway for enterprises to extract actionable insights to make quick decisions from their transactional data. CockroachDB ensures robust, scalable and consistent transactional data management and its integration with the Google Cloud ecosystem sets the stage for deep analytical processing. This lays the foundation to leverage BigQueryML to initiate machine learning models within the data warehouse itself, only requiring SQL proficiency.

Vertex AI extends this capability by offering tools to refine, deploy, scale machine learning models, thereby enhancing applications with predictive insights and AI driven functionality. This approach not only streamlines the journey from the data collected during the initial transaction to analytical insight but it also allows the enterprise to extract the full potential of their data. By adopting the capabilities of CockroachDB, BigQueryML and Vertex AI, enterprises can make a significant leap in operational agility and strategic foresight.

Appendix A: Loading the sample data

If you’d like to try this example yourself, the data that is being used comes from the BigQuery public dataset bigquery-public-data.thelook_ecommerce\. The data from the table bigquery-public-data.thelook_ecommerce.order_items\ will be exported from BigQuery to a .csv file and then imported into CockroachDB.

NOTE: The data in this table is regenerated daily, so expect some variation between the output in this document and the actual output that you receive.

Step 1: Create a Google Cloud Storage Bucket

Create a bucket in Google Cloud storage for the data that you will be exporting from the BigQuery Public Dataset.

gcloud storage buckets create gs://cockroach-sample-data --project=cockroach-project --location=US

Step 2: Export the data

Export the data from the BigQuery Public Dataset into the bucket that you created in the previous step.

bq extract --destination_format=CSV  bigquery-public-data:thelook_ecommerce.order_items
gs://cockroach-sample-data/order_items.csv

Step 3: Create a Service Account

In order for CockroachDB to be able to access the CSV file in your Google Cloud bucket, you need to create a Service Account with sufficient permissions for CockroachDB to access the data.

gcloud iam service-accounts create cockroach-bigquery-import
--project cockroach-project

Step 4: Add Permissions to the new Service Account

Add the permissions required for Cockroach 

gsutil iam ch
serviceAccount:cockroach-bigquery-import@cockroach-project.iam.gserviceaccount.com:objectViewer,legacyBucketReader
gs://cockroach-sample-data

Step 5: Create and Download a Service Account Key

Create a JSON Key file for your service account. This key file is used by CockroachDB for accessing the data in your Google Cloud Storage bucket. Running this command will result in the JSON key file cockroach_biquery_export_key.json being automatically created and downloaded to the default Downloads directory on your local machine.

gcloud iam service-accounts keys create cockroach_biquery_export_key.json --project=cockroach-project --iam-account=cockroach-bigquery-import@cockroach-project.iam.gserviceaccount.com

Step 6: Create a staging table in CockroachDB

Create a table to store the data that you’ll be importing by running the command in the SQL Shell or any other SQL editor you prefer.

CREATE TABLE public.order_items_stage (
  id INT8 NOT NULL,
  order_id INT8 NULL,
  user_id INT8 NULL,
  product_id INT8 NULL,
  inventory_item_id INT8 NULL,
  status STRING NULL,
  created_at TIMESTAMP NULL,
  shipped_at TIMESTAMP NULL,
  delivered_at TIMESTAMP NULL,
  returned_at TIMESTAMP NULL,
  sale_price FLOAT8 NULL,
  CONSTRAINT order_items_pkey PRIMARY KEY (id ASC)
);

Step 7: Base64 encode the key file

From a terminal window on your local machine, base64 encode the key file that you downloaded in Step 5.

On a Mac:

cat cockroach_bigquery_export_key.json | base64

On Linux: (to prevent the output lines from wrapping):

cat cockroach_bigquery_export_key.json | base64 -w 0

Step 8: Import the data

Import the data from your Google Cloud storage bucket into the table that you created in step 6 and provide your base64 encoded key file for authentication. The base64 encoded data follows the CREDENTIALS= in the IMPORT statement. The import may take a few minutes and the table will be unavailable while the data is being imported.

IMPORT INTO public.order_items_stage (
  id, order_id, user_id, product_id, inventory_item_id, status, created_at, shipped_at, delivered_at, returned_at, sale_price)
CSV DATA ('gs://cockroach-sample-data/order_items.csv?AUTH=specified&CREDENTIALS=ewog......p9Cg==')
WITH skip = '1',
DETACHED;

Step 9: Create an empty table for replication

Create an empty table with the same structure as the order_items_stage table. This table will be configured for Change Data Capture and Pub/Sub in a later step.

CREATE TABLE public.order_items (LIKE public.order_items_stage INCLUDING ALL);

Keep Reading

Migrating PostgreSQL to CockroachDB For Multicloud

You need to send cash from your bank to a friend, who happens to use another bank. The money should take a fairly …

Read more
Accidentally multi-cloud: Mergers, acquisitions, and deployments

This story was originally published in ITPro Today: Merger Maneuvers in the Multi-Cloud.

Software mergers and …

Read more
What is a multi-cloud database, and how to deploy one

Do you remember December 7, 2021?

If you work in ops, you might – that’s one of the recent examples of a major public …

Read more