Logical Data Replication Monitoring

On this page Carat arrow pointing down
Note:

This feature is in preview. This feature is subject to change. To share feedback and/or issues, contact Support.

You can monitor logical data replication (LDR) using:

Note:

There are some tradeoffs between enabling one table per LDR job versus multiple tables in one LDR job. Multiple tables in one LDR job can be easier to operate. For example, if you pause and resume the single job, LDR will stop and resume for all the tables. However, the most granular level observability will be at the job level. One table in one LDR job will allow for table-level observability.

When you start an LDR stream, one job is created on each cluster:

  • The history retention job on the source cluster, which runs while the LDR job is active to protect changes in the table from garbage collection until they have been applied to the destination cluster. The history retention job is viewable in the DB Console or with SHOW JOBS. Any manual changes to the history retention job could disrupt the LDR job.
  • The logical replication job on the destination cluster. You can view the status of this job in the SQL shell with SHOW LOGICAL REPLICATION JOBS and the DB Console Jobs page.

SQL Shell

In the destination cluster's SQL shell, you can query SHOW LOGICAL REPLICATION JOBS to view the LDR jobs running on the cluster:

icon/buttons/copy
SHOW LOGICAL REPLICATION JOBS;
        job_id        | status  |          targets          | replicated_time
----------------------+---------+---------------------------+------------------
1012877040439033857   | running | {database.public.table}   | NULL
(1 row)

For additional detail on each LDR job, use the WITH details option:

icon/buttons/copy
SHOW LOGICAL REPLICATION JOBS WITH details;
        job_id        |  status  |            targets             |        replicated_time        |    replication_start_time     | conflict_resolution_type |                                      description
----------------------+----------+--------------------------------+-------------------------------+-------------------------------+--------------------------+-----------------------------------------------------------------------------------------
  1010959260799270913 | running  | {movr.public.promo_codes}      | 2024-10-24 17:50:05+00        | 2024-10-10 20:04:42.196982+00 | LWW                      | LOGICAL REPLICATION STREAM into movr.public.promo_codes from external://cluster_a
  1014047902397333505 | canceled | {defaultdb.public.office_dogs} | 2024-10-24 17:30:25+00        | 2024-10-21 17:54:20.797643+00 | LWW                      | LOGICAL REPLICATION STREAM into defaultdb.public.office_dogs from external://cluster_a

You can also use SHOW JOBS for general job details.

Responses

Field Response
job_id The job's ID. Use with CANCEL JOB, PAUSE JOB, RESUME JOB, SHOW JOB.
status The job's current state. Possible values: pending, paused, pause-requested, failed, succeeded, canceled, cancel-requested, running, retry-running, retry-reverting, reverting, revert-failed.

Refer to Jobs status for a description of each status.
targets The fully qualified name of the table(s) that are part of the LDR job.
replicated_time The latest timestamp at which the destination cluster has consistent data. This time advances automatically as long as the LDR job proceeds without error. replicated_time is updated periodically (every 30s).
replication_start_time The start time of the LDR job.
conflict_resolution_type The type of conflict resolution: LWW last write wins.
description Description of the job including the replicating table(s) and the source cluster connection.
  • Replication latency: The commit-to-commit replication latency, which is tracked from when a row is committed on the source cluster, to when it is applied on the destination cluster. An LDR commit is when the job either applies a row successfully to the destination cluster or adds a row to the dead letter queue (DLQ).
    • logical_replication.commit_latency-p50
    • logical_replication.commit_latency-p99
  • Replication lag: How far behind the source cluster is from the destination cluster at a specific point in time. The replication lag is equivalent to RPO during a disaster. Calculate the replication lag with this metric. For example, time.now() - replicated_time_seconds.
    • logical_replication.replicated_time_seconds
  • Row updates applied: These metrics indicate whether the destination cluster is actively receiving and applying data from the source cluster.
    • logical_replication.events_ingested
    • logical_replication.events_dlqed

DB Console

In the DB Console, you can use:

  • The Metrics dashboard for LDR to view metrics for the job on the destination cluster.
  • The Jobs page to view the history retention job on the source cluster and the LDR job on the destination cluster

The metrics for LDR in the DB Console metrics are at the cluster level. This means that if there are multiple LDR jobs running on a cluster the DB Console will show the average metrics across jobs.

Metrics dashboard

You can use the Logical Data Replication dashboard of the destination cluster to monitor the following metric graphs at the cluster level:

  • Replication latency
  • Replication lag
  • Row updates applied
  • Logical bytes reviewed
  • Batch application processing time: 50th percentile
  • Batch application processing time: 99th percentile
  • DLQ causes
  • Retry queue size

To track replicated time, ingested events, and events added to the DLQ at the job level, refer to Metrics labels.

Jobs page

On the Jobs page, select:

  • The Replication Producer in the source cluster's DB Console to view the history retention job.
  • The Logical Replication Ingestion job in the destination cluster's DB Console. When you start LDR, the Logical Replication Ingestion job will show a bar that tracks the initial scan progress of the source table's existing data.

Monitoring and alerting

Prometheus

You can use Prometheus and Alertmanager to track and alert on LDR metrics. Refer to the Monitor CockroachDB with Prometheus tutorial for steps to set up Prometheus.

Metrics labels

To view metrics at the job level, you can use the label option when you start LDR to add a metrics label to the LDR job. This enables child metric export, which are Prometheus time series with extra labels. You can track the following metrics for an LDR job with labels:

  • logical_replication.replicated_time_seconds
  • logical_replication.events_ingested
  • logical_replication.events_dlqed
  • logical_replication.scanning_ranges
  • logical_replication.catchup_ranges

To use metrics labels, ensure you have enabled the child metrics cluster setting:

icon/buttons/copy
SET CLUSTER SETTING server.child_metrics.enabled = true;

When you start LDR, include the label option:

icon/buttons/copy
CREATE LOGICAL REPLICATION STREAM FROM TABLE {database.public.table_name} 
ON 'external://{source_external_connection}' 
INTO TABLE {database.public.table_name} WITH label=ldr_job;

For a full reference on tracking metrics with labels, refer to the Child Metrics page.

Datadog

You can export metrics to Datadog for LDR jobs. For steps to set up metrics export, refer to the Monitor CockroachDB Self-Hosted with Datadog.

See also


Yes No
On this page

Yes No