Platform
High CPU
A node with a high CPU utilization, an overloaded node, has a limited ability to process the user workload and increases the risks of cluster instability.
Metric
sys.cpu.combined.percent-normalized
sys.cpu.host.combined.percent-normalized
Rule
Set alerts for each node for each of the listed metrics:
WARNING: Metric greater than 0.80
for 4 hours
CRITICAL: Metric greater than 0.90
for 1 hour
Action
Refer to CPU Usage and Workload Concurrency.
In the DB Console, navigate to Metrics, Hardware dashboard for the cluster and check for high values on the CPU Percent graph and the Host CPU Percent graph.
In the DB Console, navigate to Metrics, SQL dashboard for the cluster and check for high values on the Active SQL Statements graph. This graph shows the true concurrency of the workload, which may exceed the cluster capacity planning guidance of no more than 4 active statements per vCPU or core.
A persistently high CPU utilization of all nodes in a CockroachDB cluster suggests the current compute resources may be insufficient to support the user workload's concurrency requirements. If confirmed, the number of processors (vCPUs or cores) in the CockroachDB cluster needs to be adjusted to sustain the required level of workload concurrency. For a prompt resolution, either add cluster nodes or throttle the workload concurrency, for example, by reducing the number of concurrent connections to not exceed 4 active statements per vCPU or core.
Hot node (hot spot)
Unbalanced utilization of CockroachDB nodes in a cluster may negatively affect the cluster's performance and stability, with some nodes getting overloaded while others remain relatively underutilized.
Metric
sys.cpu.combined.percent-normalized
sys.cpu.host.combined.percent-normalized
Rule
Set alerts for each of the listed metrics:
WARNING: The max CPU utilization across all nodes exceeds the cluster's median CPU utilization by 30
for 2 hours
Action
- Refer to Hot spots.
Node memory utilization
One node with high memory utilization is a cluster stability risk. High memory utilization is a prelude to a node's out-of-memory (OOM) crash — the process is terminated by the OS when the system is critically low on memory. An OOM condition is not expected to occur if a CockroachDB cluster is provisioned and sized per Cockroach Labs guidance.
Metric
sys.rss
Rule
Set alerts for each node:
WARNING: sys.rss
greater than 0.80
for 4 hours
CRITICAL: sys.rss
greater than 0.90
for 1 hour
Action
- Provision all CockroachDB VMs or machines with sufficient RAM.
Node storage performance
Under-configured or under-provisioned disk storage is a common root cause of inconsistent CockroachDB cluster performance and could also lead to cluster instability. Refer to Disk IOPS.
Metric
sys.host.disk.iopsinprogress
Rule
WARNING: sys.host.disk.iopsinprogress
greater than 10
for 10 seconds
CRITICAL: sys.host.disk.iopsinprogress
greater than 20
for 10 seconds
Action
- Provision enough storage capacity for CockroachDB data, and configure your volumes to maximize disk I/O. Refer to Storage and disk I/O.
Version mismatch
All CockroachDB cluster nodes should be running the same exact executable (with identical build label). This warning guards against an operational error where some nodes were not upgraded.
Metric
build.timestamp
Rule
Set alerts for each node:
WARNING: build.timestamp
not the same across cluster nodes for more than 4 hours
Action
- Ensure all cluster nodes are running exactly the same CockroachDB version, including the patch release version number.
High open file descriptor count
Send an alert when a cluster is getting close to the open file descriptor limit.
Metric
sys.fd.open
sys.fd.softlimit
Rule
Set alerts for each node:
WARNING: sys_fd_open
/ sys_fd_softlimit
greater than 0.8
for 10 minutes
Action
- Refer to File descriptors limit.
Storage
Node storage capacity
A CockroachDB node will not able to operate if there is no free disk space on a CockroachDB store volume.
Metric
capacity
capacity.available
Rule
Set alerts for each node:
WARNING: capacity.available
/capacity
is less than 0.30
for 24 hours
CRITICAL: capacity.available
/capacity
is less than 0.10
for 1 hour
Action
- Refer to Storage Capacity.
- Increase the size of CockroachDB node storage capacity. CockroachDB storage volumes should not be utilized more than 60% (40% free space).
- In a "disk full" situation, you may be able to get a node "unstuck" by removing the automatically created emergency ballast file.
Write stalls
A high write-stalls
value means CockroachDB is unable to write to a disk in an acceptable time, resulting in CockroachDB facing a disk latency issue and not responding to writes.
Metric
storage.write-stalls
Rule
Set alerts for each node:
WARNING: storage.write-stalls
per minute is greater than or equal to 1
per minute
CRITICAL: storage.write-stalls
per second is greater than or equal to 1
per second
Action
- Refer to Disk stalls.
Health
Node restarting too frequently
Send an alert if a node has restarted more than once in the last 10 minutes. Calculate this using the number of times the sys.uptime
metric was reset back to zero.
Metric
sys.uptime
Rule
Set alerts for each node:
WARNING: sys.uptime
resets greater than 1
in the last 10 minutes
Action
- Refer to Node process restarts.
Node LSM storage health
CockroachDB uses the Pebble storage engine that uses a Log-structured Merge-tree (LSM tree) to manage data storage. The health of an LSM tree can be measured by the read amplification, which is the average number of SST files being checked per read operation. A value in the single digits is characteristic of a healthy LSM tree. A value in the double, triple, or quadruple digits suggests an inverted LSM. A node reporting a high read amplification is an indication of a problem on that node that is likely to affect the workload.
Metric
rocksdb.read-amplification
Rule
Set alerts for each node:
WARNING: rocksdb.read-amplification
greater than 50
for 1 hour
CRITICAL: rocksdb.read-amplification
greater than 150
for 15 minutes
Action
- Refer to LSM Health.
Expiration of license and certificates
Enterprise license expiration
Avoid license expiration to avoid any disruption to feature access.
Metric
seconds.until.enterprise.license.expiry
Rule
WARNING: seconds.until.enterprise.license.expiry
is greater than 0
and less than 1814400
seconds (3 weeks)
CRITICAL: seconds.until.enterprise.license.expiry
is greater than 0
and less than 259200
seconds (3 days)
Action
Security certificate expiration
Avoid security certificate expiration.
Metric
security.certificate.expiration.ca
security.certificate.expiration.client-ca
security.certificate.expiration.ui
security.certificate.expiration.ui-ca
security.certificate.expiration.node
security.certificate.expiration.node-client
Rule
Set alerts for each of the listed metrics:
WARNING: Metric is greater than 0
and less than 1814400
seconds (3 weeks) until enterprise license expiration
CRITICAL: Metric is greater than 0
and less than 259200
seconds (3 days) until enterprise license expiration
Action
Rotate the expiring certificates.
KV distributed
During rolling maintenance or planned cluster resizing, the nodes' state and count will be changing. Mute KV distributed alerts described in the following sections during routine maintenance procedures to avoid unnecessary distractions.
Heartbeat latency
Monitor the cluster health for early signs of instability. If this metric exceeds 1 second, it is a sign of instability.
Metric
liveness.heartbeatlatency
Rule
WARNING: liveness.heartbeatlatency
greater than 0.5s
CRITICAL: liveness.heartbeatlatency
greater than 3s
Action
- Refer to Node liveness issues.
Live node count change
The liveness checks reported by a node is inconsistent with the rest of the cluster. Number of live nodes in the cluster (will be 0 if this node is not itself live). This is a critical metric that tracks the live nodes in the cluster.
Metric
liveness.livenodes
Rule
Set alerts for each node:
WARNING: max(liveness.livenodes
) for the cluster - min(liveness.livenodes
) for node > 0
for 2 minutes
CRITICAL: max(liveness.livenodes
) for the cluster - min(liveness.livenodes
) for node > 0
for 5 minutes
Action
- Refer to Node liveness issues.
Intent buildup
Send an alert when very large transactions are locking millions of keys (rows). A common example is a transaction with a DELETE
that affects a large number of rows. Transactions with an excessively large scope are often inadvertent, perhaps due to a non-selective filter and a specific data distribution that was not anticipated by an application developer.
Transactions that create a large number of write intents could have a negative effect on the workload's performance. These transactions may create locking contention, thus limiting concurrency. This would reduce throughput, and in extreme cases, lead to stalled workloads.
Metric
intentcount
Rule
WARNING: intentcount
greater than 10,000,000 for 2 minutes
CRITICAL: intentcount
greater than 10,000,000 for 5 minutes
For tighter transaction scope scrutiny, lower the intentcount
threshold that triggers an alert.
Action
- Identify the large scope transactions that acquire a lot of locks. Consider reducing the scope of large transactions, implementing them as several smaller scope transactions. For example, if the alert is triggered by a large scope
DELETE
, consider "paging"DELETE
s that target thousands of records instead of millions. This is often the most effective resolution, however it generally means an application level refactoring. - After reviewing the workload, you may conclude that a possible performance impact of allowing transactions to take a large number of intents is not a concern. For example, a large delete of obsolete, not-in-use data may create no concurrency implications and the elapsed time to execute that transaction may not be impactful. In that case, no response could be a valid way to handle this alert.
KV replication
Unavailable ranges
Send an alert when the number of ranges with fewer live replicas than needed for quorum is non-zero for too long.
Metric
ranges.unavailable
Rule
WARNING: ranges.unavailable
greater than 0
for 10 minutes
Action
- Refer to Replication issues.
Tripped replica circuit breakers
Send an alert when a replica stops serving traffic due to other replicas being offline for too long.
Metric
kv.replica_circuit_breaker.num_tripped_replicas
Rule
WARNING: kv.replica_circuit_breaker.num_tripped_replicas
greater than 0
for 10 minutes
Action
- Refer to Per-replica circuit breakers and Replication issues.
Under-replicated ranges
Send an alert when the number of ranges with replication below the replication factor is non-zero for too long.
Metric
ranges.underreplicated
Rule
WARNING: ranges.underreplicated
greater than 0
for 1 hour
Action
- Refer to Replication issues.
Requests stuck in raft
Send an alert when requests are taking a very long time in replication. An (evaluated) request has to pass through the replication layer, notably the quota pool and raft. If it fails to do so within a highly permissive duration, the gauge is incremented (and decremented again once the request is either applied or returns an error). A nonzero value indicates range or replica unavailability, and should be investigated.
Metric
requests.slow.raft
Rule
WARNING: requests.slow.raft
greater than 0
for 10 minutes
Action
- Refer to Raft and Replication issues.
SQL
Node not executing SQL
Send an alert when a node is not executing SQL despite having connections. sql.conns
shows the number of connections as well as the distribution, or balancing, of connections across cluster nodes. An imbalance can lead to nodes becoming overloaded.
Metric
sql.conns
sql.query.count
Rule
Set alerts for each node:
WARNING: sql.conns
greater than 0
while sql.query.count
equals 0
Action
- Refer to Connection Pooling.
SQL query failure
Send an alert when the query failure count exceeds a user-determined threshold based on their application's SLA.
Metric
sql.failure.count
Rule
WARNING: sql.failure.count
is greater than a threshold (based on the user’s application SLA)
Action
- Use the Insights page to find failed executions with their error code to troubleshoot or use application-level logs, if instrumented, to determine the cause of error.
SQL queries experiencing high latency
Send an alert when the query latency exceeds a user-determined threshold based on their application’s SLA.
Metric
sql.service.latency
sql.conn.latency
Rule
WARNING: (p99 or p90 of sql.service.latency
plus average of sql.conn.latency
) is greater than a threshold (based on the user’s application SLA)
Action
- Apply the time range of the alert to the SQL Activity pages to investigate. Use the Statements page P90 Latency and P99 latency columns to correlate statement fingerprints with this alert.
Backup
Backup failure
While CockroachDB is a distributed product, there is always a need to ensure backups complete.
Metric
schedules.BACKUP.failed
Rule
Set alerts for each node:
WARNING: schedules.BACKUP.failed
is greater than 0
Action
- Refer to Backup and Restore Monitoring.
Changefeeds
During rolling maintenance, changefeed jobs restart following node restarts. Mute changefeed alerts described in the following sections during routine maintenance procedures to avoid unnecessary distractions.
Changefeed failure
Changefeeds can suffer permanent failures (that the jobs system will not try to restart). Any increase in this metric counter should prompt investigative action.
Metric
changefeed.failures
Rule
CRITICAL:Â Â If changefeed.failures
is greater than 0
Action
If the alert is triggered during cluster maintenance, mute it. Otherwise start investigation with the following query:
SELECT job_id, status, ((high_water_timestamp/1000000000)::INT::TIMESTAMP) - NOW() AS "changefeed latency", created, LEFT(description, 60), high_water_timestamp FROM crdb_internal.jobs WHERE job_type = 'CHANGEFEED' AND status IN ('running', 'paused', 'pause-requested') ORDER BY created DESC;
If the cluster is not undergoing maintenance, check the health of sink endpoints. If the sink is Kafka, check for sink connection errors such as
ERROR: connecting to kafka: path.to.cluster:port: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)
.
Frequent changefeed restarts
Changefeeds automatically restart in case of transient errors. However too many restarts outside of a routine maintenance procedure may be due to a systemic condition and should be investigated.
Metric
changefeed.error_retries
Rule
WARNING:Â Â If changefeed.error_retries
is greater than 50
for more than 15 minutes
Action
- Follow the action for a changefeed failure.
Changefeed falling behind
Changefeed has fallen behind. This is determined by the end-to-end lag between a committed change and that change applied at the destination. This can be due to cluster capacity or changefeed sink availability.
Metric
changefeed.commit_latency
Rule
WARNING:Â Â changefeed.commit_latency
is greater than 10 minutes
CRITICAL: changefeed.commit_latency
is greater than 15 minutes
Action
In the DB Console, navigate to Metrics, Changefeeds dashboard for the cluster and check the maximum values on the Commit Latency graph. Alternatively, individual changefeed latency can be verified by using the following SQL query:
SELECT job_id, status, ((high_water_timestamp/1000000000)::INT::TIMESTAMP) - NOW() AS "changefeed latency", created, LEFT(description, 60), high_water_timestamp FROM crdb_internal.jobs WHERE job_type = 'CHANGEFEED' AND status IN ('running', 'paused', 'pause-requested') ORDER BY created DESC;
Copy the
job_id
for the changefeed job with highestchangefeed latency
and pause the job:PAUSE JOB 681491311976841286;
Check the status of the pause request by running the query from step 1. If the job status is
pause-requested
, check again in a few minutes.After the job is
paused
, resume the job.RESUME JOB 681491311976841286;
If the changefeed latency does not progress after these steps due to lack of cluster resources or availability of the changefeed sink, contact Support.
Changefeed has been paused a long time
Changefeed jobs should not be paused for a long time because the protected timestamp prevents garbage collection. To protect against an operational error, this alert guards against an inadvertently forgotten pause.
Metric
jobs.changefeed.currently_paused
Rule
WARNING: jobs.changefeed.currently_paused
is greater than 0
for more than 15 minutes
CRITICAL: jobs.changefeed.currently_paused
is greater than 0
for more than 60 minutes
Action
Check the status of each changefeed using the following SQL query:
SELECT job_id, status, ((high_water_timestamp/1000000000)::INT::TIMESTAMP) - NOW() AS "changefeed latency",created, LEFT(description, 60), high_water_timestamp FROM crdb_internal.jobs WHERE job_type = 'CHANGEFEED' AND status IN ('running', 'paused','pause-requested') ORDER BY created DESC;
If all the changefeeds have status as
running
, one or more changefeeds may have run into an error and recovered. In the DB Console, navigate to Metrics, Changefeeds dashboard for the cluster and check the Changefeed Restarts graph.Resume paused changefeed(s) with the
job_id
using:RESUME JOB 681491311976841286;
Changefeed experiencing high latency
Send an alert when the maximum latency of any running changefeed exceeds a specified threshold, which is less than the gc.ttlseconds
variable set in the cluster. This alert ensures that the changefeed progresses faster than the garbage collection TTL, preventing a changefeed's protected timestamp from delaying garbage collection.
Metric
changefeed.checkpoint_progress
Rule
WARNING: (current time minus changefeed.checkpoint_progress
) is greater than a threshold (that is less than gc.ttlseconds
variable)
Action
- Refer to Monitor and Debug Changefeeds.