Logical Data Replication Monitoring

On this page

Note:

This feature is in preview and subject to change. To share feedback and/or issues, contact Support.

Logical data replication is only supported in CockroachDB self-hosted clusters.

You can monitor logical data replication (LDR) using:

SHOW LOGICAL REPLICATION JOBS in the SQL shell to view a list of LDR jobs on the cluster.
The Logical Data Replication dashboard on the DB Console to view metrics at the cluster level.
Prometheus and Alertmanager to track and alert on LDR metrics.
Metrics export with Datadog.
Metrics labels to view metrics at the job level.

Note:

There are some tradeoffs between enabling one table per LDR job versus multiple tables in one LDR job. Multiple tables in one LDR job can be easier to operate. For example, if you pause and resume the single job, LDR will stop and resume for all the tables. However, the most granular level observability will be at the job level. One table in one LDR job will allow for table-level observability.

When you start an LDR stream, one job is created on each cluster:

The history retention job on the source cluster, which runs while the LDR job is active to protect changes in the table from garbage collection until they have been applied to the destination cluster. The history retention job is viewable in the DB Console or with SHOW JOBS. Any manual changes to the history retention job could disrupt the LDR job.
The logical replication job on the destination cluster. You can view the status of this job in the SQL shell with SHOW LOGICAL REPLICATION JOBS and the DB Console Jobs page.

SQL Shell

In the destination cluster's SQL shell, you can query SHOW LOGICAL REPLICATION JOBS to view the LDR jobs running on the cluster:

SHOW LOGICAL REPLICATION JOBS;

        job_id        | status  |          targets          | replicated_time
----------------------+---------+---------------------------+------------------
1012877040439033857   | running | {database.public.table}   | NULL
(1 row)

For additional detail on each LDR job, use the WITH details option:

SHOW LOGICAL REPLICATION JOBS WITH details;

        job_id        |  status  |            targets             |        replicated_time        |    replication_start_time     | conflict_resolution_type |                                      description
----------------------+----------+--------------------------------+-------------------------------+-------------------------------+--------------------------+-----------------------------------------------------------------------------------------
  1010959260799270913 | running  | {movr.public.promo_codes}      | 2024-10-24 17:50:05+00        | 2024-10-10 20:04:42.196982+00 | LWW                      | LOGICAL REPLICATION STREAM into movr.public.promo_codes from external://cluster_a
  1014047902397333505 | canceled | {defaultdb.public.office_dogs} | 2024-10-24 17:30:25+00        | 2024-10-21 17:54:20.797643+00 | LWW                      | LOGICAL REPLICATION STREAM into defaultdb.public.office_dogs from external://cluster_a

You can also use SHOW JOBS for general job details.

Responses

Field	Response
`job_id`	The job's ID. Use with `CANCEL JOB`, `PAUSE JOB`, `RESUME JOB`, `SHOW JOB`.
`status`	The job's current state. Possible values: `pending`, `paused`, `pause-requested`, `failed`, `succeeded`, `canceled`, `cancel-requested`, `running`, `retry-running`, `retry-reverting`, `reverting`, `revert-failed`. Refer to Jobs status for a description of each status.
`targets`	The fully qualified name of the table(s) that are part of the LDR job.
`replicated_time`	The latest timestamp at which the destination cluster has consistent data. This time advances automatically as long as the LDR job proceeds without error. `replicated_time` is updated periodically (every 30s).
`replication_start_time`	The start time of the LDR job.
`conflict_resolution_type`	The type of conflict resolution: `LWW` last write wins.
`description`	Description of the job including the replicating table(s) and the source cluster connection.

Recommended LDR metrics to track

Replication latency: The commit-to-commit replication latency, which is tracked from when a row is committed on the source cluster, to when it is applied on the destination cluster. An LDR commit is when the job either applies a row successfully to the destination cluster or adds a row to the dead letter queue (DLQ).
- logical_replication.commit_latency-p50
- logical_replication.commit_latency-p99
Replication lag: How far behind the source cluster is from the destination cluster at a specific point in time. The replication lag is equivalent to RPO during a disaster. Calculate the replication lag with this metric. For example, time.now() - replicated_time_seconds.
- logical_replication.replicated_time_seconds
Row updates applied: These metrics indicate whether the destination cluster is actively receiving and applying data from the source cluster.
- logical_replication.events_ingested
- logical_replication.events_dlqed

DB Console

In the DB Console, you can use:

The Metrics dashboard for LDR to view metrics for the job on the destination cluster.
The Jobs page to view the history retention job on the source cluster and the LDR job on the destination cluster

The metrics for LDR in the DB Console metrics are at the cluster level. This means that if there are multiple LDR jobs running on a cluster the DB Console will show the average metrics across jobs.

Metrics dashboard

You can use the Logical Data Replication dashboard of the destination cluster to monitor the following metric graphs at the cluster level:

Replication latency
Replication lag
Row updates applied
Logical bytes received
Batch application processing time: 50th percentile
Batch application processing time: 99th percentile
DLQ causes
Retry queue size

To track replicated time, ingested events, and events added to the DLQ at the job level, refer to Metrics labels.

Jobs page

On the Jobs page, select:

The Replication Producer in the source cluster's DB Console to view the history retention job.
The Logical Replication Ingestion job in the destination cluster's DB Console. When you start LDR, the Logical Replication Ingestion job will show a bar that tracks the initial scan progress of the source table's existing data.

Monitoring and alerting

Prometheus

You can use Prometheus and Alertmanager to track and alert on LDR metrics. Refer to the Monitor CockroachDB with Prometheus tutorial for steps to set up Prometheus.

Metrics labels

To view metrics at the job level, you can use the label option when you start LDR to add a metrics label to the LDR job. This enables child metric export, which are Prometheus time series with extra labels. You can track the following metrics for an LDR job with labels:

logical_replication.catchup_ranges_by_label
logical_replication.events_dlqed_by_label
logical_replication.events_ingested_by_label
logical_replication.replicated_time_by_label
logical_replication.scanning_ranges_by_label

To use metrics labels, ensure you have enabled the child metrics cluster setting:

SET CLUSTER SETTING server.child_metrics.enabled = true;

When you start LDR, include the label option:

CREATE LOGICAL REPLICATION STREAM FROM TABLE {database.public.table_name} 
ON 'external://{source_external_connection}' 
INTO TABLE {database.public.table_name} WITH label=ldr_job;

For a full reference on tracking metrics with labels, refer to the Child Metrics page.

Datadog

You can export metrics to Datadog for LDR jobs. For steps to set up metrics export, refer to the Monitor CockroachDB Self-Hosted with Datadog.

Pricing

Contact us