Essential Alerts for CockroachDB Advanced Deployments

On this page Carat arrow pointing down

Storage

Node storage capacity

A CockroachDB node will not able to operate if there is no free disk space on a CockroachDB store volume.

Metric
capacity
capacity.available

Rule
Set alerts for each node:
WARNING: capacity.available/capacity is less than 0.30 for 24 hours
CRITICAL: capacity.available/capacity is less than 0.10 for 1 hour

Action

SQL

Node not executing SQL

Send an alert when a node is not executing SQL despite having connections. sql.conns shows the number of connections as well as the distribution, or balancing, of connections across cluster nodes. An imbalance can lead to nodes becoming overloaded.

Metric
sql.conns
sql.query.count

Rule
Set alerts for each node:
WARNING: sql.conns greater than 0 while sql.query.count equals 0

Action

SQL query failure

Send an alert when the query failure count exceeds a user-determined threshold based on their application's SLA.

Metric
sql.failure.count

Rule
WARNING: sql.failure.count is greater than a threshold (based on the user’s application SLA)

Action

  • Use the Insights page to find failed executions with their error code to troubleshoot or use application-level logs, if instrumented, to determine the cause of error.

SQL queries experiencing high latency

Send an alert when the query latency exceeds a user-determined threshold based on their application’s SLA.

Metric
sql.service.latency
sql.conn.latency

Rule
WARNING: (p99 or p90 of sql.service.latency plus average of sql.conn.latency) is greater than a threshold (based on the user’s application SLA)

Action

Changefeeds

Note:

During rolling maintenance, changefeed jobs restart following node restarts. Mute changefeed alerts described in the following sections during routine maintenance procedures to avoid unnecessary distractions.

Changefeed failure

Changefeeds can suffer permanent failures (that the jobs system will not try to restart). Any increase in this metric counter should prompt investigative action.

Metric
changefeed.failures

Rule
CRITICAL:  If changefeed.failures is greater than 0

Action

  1. If the alert is triggered during cluster maintenance, mute it. Otherwise start investigation with the following query:

    icon/buttons/copy
    SELECT job_id, status, ((high_water_timestamp/1000000000)::INT::TIMESTAMP) - NOW() AS "changefeed latency", created, LEFT(description, 60), high_water_timestamp FROM crdb_internal.jobs WHERE job_type = 'CHANGEFEED' AND status IN ('running', 'paused', 'pause-requested') ORDER BY created DESC;
    
  2. If the cluster is not undergoing maintenance, check the health of sink endpoints. If the sink is Kafka, check for sink connection errors such as ERROR: connecting to kafka: path.to.cluster:port: kafka: client has run out of available brokers to talk to (Is your cluster reachable?).

Frequent changefeed restarts

Changefeeds automatically restart in case of transient errors. However too many restarts outside of a routine maintenance procedure may be due to a systemic condition and should be investigated.

Metric
changefeed.error_retries

Rule
WARNING:  If changefeed.error_retries is greater than 50 for more than 15 minutes

Action

Changefeed falling behind

Changefeed has fallen behind. This is determined by the end-to-end lag between a committed change and that change applied at the destination. This can be due to cluster capacity or changefeed sink availability.

Metric
changefeed.commit_latency

Rule
WARNING:  changefeed.commit_latency is greater than 10 minutes
CRITICAL: changefeed.commit_latency is greater than 15 minutes

Action

  1. In the DB Console, navigate to Metrics, Changefeeds dashboard for the cluster and check the maximum values on the Commit Latency graph. Alternatively, individual changefeed latency can be verified by using the following SQL query:

    icon/buttons/copy
    SELECT job_id, status, ((high_water_timestamp/1000000000)::INT::TIMESTAMP) - NOW() AS "changefeed latency", created, LEFT(description, 60), high_water_timestamp FROM crdb_internal.jobs WHERE job_type = 'CHANGEFEED' AND status IN ('running', 'paused', 'pause-requested') ORDER BY created DESC;
    
  2. Copy the job_id for the changefeed job with highest changefeed latency and pause the job:

    icon/buttons/copy
    PAUSE JOB 681491311976841286;
    
  3. Check the status of the pause request by running the query from step 1. If the job status is pause-requested, check again in a few minutes.

  4. After the job is paused, resume the job.

    icon/buttons/copy
    RESUME JOB 681491311976841286;
    
  5. If the changefeed latency does not progress after these steps due to lack of cluster resources or availability of the changefeed sink, contact Support.

Changefeed has been paused a long time

Changefeed jobs should not be paused for a long time because the protected timestamp prevents garbage collection. To protect against an operational error, this alert guards against an inadvertently forgotten pause.

Metric
jobs.changefeed.currently_paused

Rule
WARNING: jobs.changefeed.currently_paused is greater than 0 for more than 15 minutes
CRITICAL: jobs.changefeed.currently_paused is greater than 0 for more than 60 minutes

Action

  1. Check the status of each changefeed using the following SQL query:

    icon/buttons/copy
    SELECT job_id, status, ((high_water_timestamp/1000000000)::INT::TIMESTAMP) - NOW() AS "changefeed latency",created, LEFT(description, 60), high_water_timestamp FROM crdb_internal.jobs WHERE job_type = 'CHANGEFEED' AND status IN ('running', 'paused','pause-requested') ORDER BY created DESC;
    
  2. If all the changefeeds have status as running, one or more changefeeds may have run into an error and recovered. In the DB Console, navigate to Metrics, Changefeeds dashboard for the cluster and check the Changefeed Restarts graph.

  3. Resume paused changefeed(s) with the job_id using:

    icon/buttons/copy
    RESUME JOB 681491311976841286;
    

Changefeed experiencing high latency

Send an alert when the maximum latency of any running changefeed exceeds a specified threshold, which is less than the gc.ttlseconds variable set in the cluster. This alert ensures that the changefeed progresses faster than the garbage collection TTL, preventing a changefeed's protected timestamp from delaying garbage collection.

Metric
changefeed.checkpoint_progress

Rule
WARNING: (current time minus changefeed.checkpoint_progress) is greater than a threshold (that is less than gc.ttlseconds variable)

Action

See also


Yes No
On this page

Yes No