Data Resilience

On this page Carat arrow pointing down

CockroachDB provides built-in high availability (HA) features and disaster recovery (DR) tooling to achieve operational resiliency in various deployment topologies and use cases.

  • HA features ensure continuous access to data without interruption even in the presence of failures or disruptions to maximize uptime.
  • DR tools allow for recovery from major incidents to minimize downtime and data loss.

Diagram showing how HA features and DR tools create a resilient CockroachDB deployment.

You can balance required SLAs and recovery objectives with the cost and management of each of these features to build a resilient deployment.

  • Recovery Point Objective (RPO): The maximum amount of data loss (measured by time) that an organization can tolerate.
  • Recovery Time Objective (RTO): The maximum length of time it should take to restore normal operations following an outage.
Tip:

For a practical guide on how CockroachDB uses Raft to replicate, distribute, and rebalance data, refer to the CockroachDB Resilience demo.

High availability

  • Multi-active availability: CockroachDB's built-in Raft replication stores data safely and consistently on multiple nodes to ensure no downtime even during a temporary node outage. Replication controls allow you to configure the number and location of replicas to suit a deployment.
  • Advanced fault tolerance: Capabilities built in to CockroachDB to perform routine maintenance operations with minimal impact to foreground performance. For example, online schema changes, write-ahead log failover.
  • Logical data replication (LDR) (Preview): A cross-cluster replication tool between active CockroachDB clusters, which supports a range of topologies. LDR provides eventually consistent, table-level replication between the clusters. Individually, each active cluster uses CockroachDB multi-active availability to achieve low, single-region write latency with transactionally consistent writes using Raft replication.

Choose an HA strategy

Single-region replication (synchronous) Multi-region replication (synchronous) Logical data replication (asynchronous)
RPO 0 seconds 0 seconds Immediate mode 0.5 seconds
RTO Zero RTO
Potential increased latency for 1-9 seconds
Zero RTO
Potential increased latency for 1-9 seconds
Zero RTO
Application traffic failover time
Write latency Region-local write latency
p50 latency < 5ms (for multiple availability zones in us-east1)
Cross-region write latency
p50 latency > 50ms (for a multi-region cluster in us-east1, us-east-2, us-west-1)
Region-local latency depending on design
p50 latency < 5ms (for multiple availability zones in us-east1)
Recovery Automatic Automatic Semi-automatic
Fault tolerance Zero RPO node, availability zone failures within a cluster Zero RPO node, availability zone failures, region failures within a cluster Zero RPO node, availability zone within a cluster, region failures with loss up to RPO in a two-region (or two datacenter) setup
Minimum regions to achieve fault tolerance 1 3 2

For details on designing your cluster topology for HA with replication, refer to the Disaster Recovery Planning page.

Disaster recovery

  • Backup and point-in-time restore: Point-in-time backup and restore allows you to roll back to a specific point in time. Multiple supported cloud storage providers means that you can store backups in your chosen provider. Incremental backups allow you to configure backup frequency for lower RPO.
  • Physical cluster replication (PCR): A cross-cluster replication tool between an active primary CockroachDB cluster and a passive standby CockroachDB cluster. PCR provides transactionally consistent full-cluster replication.

Choose a DR strategy

CockroachDB is designed to recover automatically; however, building backups or PCR into your DR plan protects against unforeseen incidents.

Point-in-time backup & restore Physical cluster replication (asynchronous)
RPO >=5 minutes 10s of seconds
RTO Minutes to hours, depending on data size and number of nodes Seconds to minutes, depending on cluster size, and time of failover
Write latency No impact No impact
Recovery Manual restore Manual failover
Fault tolerance Not applicable Zero RPO node, availability zone within a cluster, region failures with loss up to RPO in a two-region (or two-datacenter) setup
Minimum regions to achieve fault tolerance 1 2

See also


Yes No
On this page

Yes No