CockroachDB provides built-in high availability (HA) features and disaster recovery (DR) tooling to achieve operational resiliency in various deployment topologies and use cases.
- HA features ensure continuous access to data without interruption even in the presence of failures or disruptions to maximize uptime.
- DR tools allow for recovery from major incidents to minimize downtime and data loss.
You can balance required SLAs and recovery objectives with the cost and management of each of these features to build a resilient deployment.
- Recovery Point Objective (RPO): The maximum amount of data loss (measured by time) that an organization can tolerate.
- Recovery Time Objective (RTO): The maximum length of time it should take to restore normal operations following an outage.
For a practical guide on how CockroachDB uses Raft to replicate, distribute, and rebalance data, refer to the CockroachDB Resilience demo.
High availability
- Multi-active availability: CockroachDB's built-in Raft replication stores data safely and consistently on multiple nodes to ensure no downtime even during a temporary node outage. Replication controls allow you to configure the number and location of replicas to suit a deployment.
- For more detail on planning for single-region or multi-region recovery, refer to Single-region survivability planning or Multi-region survivability planning.
- Advanced fault tolerance: Capabilities built in to CockroachDB to perform routine maintenance operations with minimal impact to foreground performance. For example, online schema changes, write-ahead log failover.
- Logical data replication (LDR) (Preview): A cross-cluster replication tool between active CockroachDB clusters, which supports a range of topologies. LDR provides eventually consistent, table-level replication between the clusters. Individually, each active cluster uses CockroachDB multi-active availability to achieve low, single-region write latency with transactionally consistent writes using Raft replication.
Choose an HA strategy
Single-region replication (synchronous) | Multi-region replication (synchronous) | Logical data replication (asynchronous) | |
---|---|---|---|
RPO | 0 seconds | 0 seconds | Immediate mode 0.5 seconds |
RTO | Zero RTO Potential increased latency for 1-9 seconds |
Zero RTO Potential increased latency for 1-9 seconds |
Zero RTO Application traffic failover time |
Write latency | Region-local write latency p50 latency < 5ms (for multiple availability zones in us-east1) |
Cross-region write latency p50 latency > 50ms (for a multi-region cluster in us-east1, us-east-2, us-west-1) |
Region-local latency depending on design p50 latency < 5ms (for multiple availability zones in us-east1) |
Recovery | Automatic | Automatic | Semi-automatic |
Fault tolerance | Zero RPO node, availability zone failures within a cluster | Zero RPO node, availability zone failures, region failures within a cluster | Zero RPO node, availability zone within a cluster, region failures with loss up to RPO in a two-region (or two datacenter) setup |
Minimum regions to achieve fault tolerance | 1 | 3 | 2 |
For details on designing your cluster topology for HA with replication, refer to the Disaster Recovery Planning page.
Disaster recovery
- Backup and point-in-time restore: Point-in-time backup and restore allows you to roll back to a specific point in time. Multiple supported cloud storage providers means that you can store backups in your chosen provider. Incremental backups allow you to configure backup frequency for lower RPO.
- Physical cluster replication (PCR): A cross-cluster replication tool between an active primary CockroachDB cluster and a passive standby CockroachDB cluster. PCR provides transactionally consistent full-cluster replication.
Choose a DR strategy
CockroachDB is designed to recover automatically; however, building backups or PCR into your DR plan protects against unforeseen incidents.
Point-in-time backup & restore | Physical cluster replication (asynchronous) | |
---|---|---|
RPO | >=5 minutes | 10s of seconds |
RTO | Minutes to hours, depending on data size and number of nodes | Seconds to minutes, depending on cluster size, and time of failover |
Write latency | No impact | No impact |
Recovery | Manual restore | Manual failover |
Fault tolerance | Not applicable | Zero RPO node, availability zone within a cluster, region failures with loss up to RPO in a two-region (or two-datacenter) setup |
Minimum regions to achieve fault tolerance | 1 | 2 |