Production Checklist

On this page Carat arrow pointing down

Before deploying CockroachDB Cloud clusters in production, it is important to understand the Shared Responsibility Model that delineates the responsibilities of Cockroach Labs and the customer in managing CockroachDB Cloud clusters.

Under the Shared Responsibility Model, Cockroach Labs is responsible for the following tasks:

  • Cluster and cloud service availability and reliability.
  • Maintenance and security of hardware and operating systems.
  • Database and security patches.
  • Automated cluster backups.

The customer is responsible for the following tasks:

  • Estimating workload requirements and scaling clusters as required to ensure sufficient storage, compute, and memory capacity for each cluster.
  • Monitoring cluster health and application performance.
  • Ensuring that the workload is distributed appropriately across the nodes of the cluster.
  • Performance tuning of SQL queries and schema.
  • Initiating major version upgrades and selecting maintenance windows for patch releases.
  • (Optional) Taking self-managed backups.

This page provides important recommendations for CockroachDB Cloud production tasks for which the customer is responsible.

Deployment options

When planning your deployment, it is important to carefully review and choose the deployment options that best meet your scale, cost, security, and resiliency requirements.

Make sure your cluster has sufficient storage, CPU, and memory to handle the workload. The general formula to calculate the storage requirement is as follows:

raw data (storage, in GB) * replication factor (3 by default) * remove 40% to account for compression (0.6) * headroom (1.5-2)

Note:

CockroachDB Advanced clusters can be created with a minimum of 4 vCPUs per node.

For an example, refer to Plan your Advanced cluster.

Topology patterns

When planning your deployment, it is important to carefully review and choose the topology patterns that best meet your latency and resiliency requirements. This is especially crucial for multi-region deployments.

Cluster management

You can create and manage CockroachDB Cloud clusters using the Cloud Console, Cloud API, ccloud CLI, or the Terraform provider.

Transaction retries

When several transactions try to modify the same underlying data concurrently, they may experience contention that leads to transaction retries. To avoid failures in production, your application should be engineered to handle transaction retries using client-side retry handling.

SQL best practices

To ensure optimal SQL performance for your CockroachDB Cloud cluster, follow the best practices described in the SQL Performance Best Practices guide.

Network authorization

CockroachDB Cloud requires you to authorize the networks that can access the cluster in order to prevent denial-of-service and brute force password attacks. During the application development phase, you might have authorized only your local machine’s network. To move into production, you need to authorize your the networks used by your application servers.

To verify that you have authorized an application server's network, navigate to the Networking page on the CockroachDB Cloud Console and verify that the application server network is listed under Authorized Networks. If the network is not listed, you can add it to authorize the network.

Warning:

Production clusters should not authorize 0.0.0.0/0, which allows all networks. While developing and testing your application on CockroachDB Advanced, you may have manually added 0.0.0.0/0 to the allowlist. CockroachDB Basic and Standard allowlists 0.0.0.0/0 by default. Before moving into production, make sure you delete the allowlist entry for the 0.0.0.0/0 network.

For enhanced network security and reduced network latency, you can set up private connectivity so that inbound connections to your cluster from your cloud tenant are made over the cloud provider's private network rather than over the public internet. For CockroachDB Advanced clusters deployed on GCP, refer to Google Cloud Platform (GCP) Virtual Private Cloud (VPC) peering. For CockroachDB Advanced clusters or multi-region CockroachDB Basic or Standard clusters deployed on AWS, refer to Amazon Web Service (AWS) PrivateLink.

SQL connection handling

The following guidelines can help you to configure your cluster and application server to mitigate against connection disruptions.

Keep connections current

After an application establishes a connection to CockroachDB Cloud, the connection may become invalid. This could be due to a variety of factors, such as a change in the cluster topography, a rolling upgrade, cluster or hardware maintenance, network disruption, or cloud infrastructure unavailability.

CockroachDB Basic

In your application server, set the maximum lifetime of a connection to between 5 and 30 minutes. Clients connected for a longer duration may be reset during maintenance, with the potential to disrupt applications.

CockroachDB Standard

In your application server, set the maximum lifetime of a connection to between 5 and 30 minutes, and server.shutdown.connections.timeout equal to the maximum connection lifetime. When a node is shut down or restarted, clients connected after server.shutdown.connections.timeout elapses may be reset, with the potential to disrupt applications.

The following cluster settings relate to node shutdown for maintenance, upgrades, or scaling. Depending on the requirements of your applications and workloads, you may need to modify them.

Cluster setting Default Maximum Details
server.shutdown.connections.timeout
Alias: server.shutdown.connection_wait
0 seconds 30 minutes (1800 seconds) How long to wait for client connections to drain before forcibly disconnecting them from the node. A connection with a lifetime that exceeds server.shutdown.connections.timeout may be interrupted during node restarts.
server.shutdown.transactions.timeout
Alias: server.shutdown.query_wait
90 seconds 90 seconds The maximum duration after server.shutdown.connections.timeout elapses to wait for incomplete transactions to complete. Transactions lasting longer than server.shutdown.transactions.timeout may be canceled to allow the node to restart. Cockroach Labs recommends lowering server.shutdown.transactions.timeout if the duration of your workload's longest-running transaction is typically shorter than 90 seconds. A higher value will result in slower cluster operations such as upgrades and scaling events. Decreasing this value reduces node shutdown time at the expense of running transactions being cancelled during node restarts.

Connection pooling

Creating the appropriate size pool of connections is critical to gaining maximum performance in an application. The best pool size depends upon the workload and the resources available to the cluster. Too few connections in the pool can increase latency if an operation must wait for a connection to open up, while too many connections can increase latency if the system is overloaded running too many connections in parallel. It can take more time and resources for many connections to complete in parallel than for a smaller number of connections to run sequentially.

For guidance on sizing, validating, and using connection pools with CockroachDB, refer to the following sections and to Use Connection Pools.

Monitoring and alerting

Even with CockroachDB's various built-in safeguards against failure, it is critical to actively monitor the overall health and performance of a cluster running in production and to create alerting rules that promptly send notifications when there are events that require investigation or intervention.

To use the CockroachDB Cloud Console to monitor and set alerts on important events and metrics, refer to Monitoring and Alerting. You can also set up monitoring with Datadog or CloudWatch.

Backup and restore

For CockroachDB Basic and Standard clusters, Cockroach Labs takes full cluster backups hourly, and retains them for 30 days. Full backups for a deleted cluster are retained for 30 days after it is deleted.

For CockroachDB Advanced clusters, Cockroach Labs takes full cluster backups daily and incremental cluster backups hourly. Full backups are retained for 30 days, and incremental backups are retained for 7 days. After a cluster is deleted, Cockroach Labs will retain daily full backups for 30 days from when the backup was originally taken. There are no newly created backups after a cluster is deleted.

Backups are stored in a single-region cluster's region or a multi-region cluster's primary region.

Cluster data can be restored to the current cluster or a different cluster in the same organization. A table or database can be selectively restored from the Backups tab.

Warning:

Restoring to a cluster will completely erase all data in the destination cluster. All cluster data will be replaced with the data from the backup. The destination cluster will be unavailable while this operation is in progress. This operation cannot be canceled, paused, or reversed.

You can manage your own backups, including incremental, database, and table-level backups. When you perform a manual backup, you must specify a storage location, which can be on your local system or in cloud storage.

Patches and upgrades

CockroachDB Cloud supports the latest major version of CockroachDB and the version immediately preceding it. Support for these versions includes patch version upgrades and security patches.

Major version upgrades

Major version upgrades are automatic for CockroachDB Basic and Standard clusters and opt-in for CockroachDB Advanced clusters. Cluster Operators must initiate major version upgrades for CockroachDB Advanced clusters. When a major version upgrade is initiated for a cluster, it subsequently will be upgraded to the latest patch version automatically.

Since upgrading a cluster can have a significant impact on your workload, make sure you review the release notes for the latest version for backward compatibility, cluster setting changes, deprecations, and known limitations. Cockroach Labs recommends initiating the upgrade during off-peak periods. After the upgrade, carefully monitor cluster and application health. If you notice functional or performance regression, you can roll back the changes for up to 72 hours before the upgrade is automatically finalized. After an upgrade, some features might be unavailable until the upgrade is finalized. For more information, refer to Major version upgrades.

Patch upgrades

For CockroachDB Advanced clusters, Organization Admins can set a weekly 6-hour maintenance window during which available maintenance and patch upgrades will be applied. Patch upgrades can also be deferred for 60 days. If no maintenance window is configured, CockroachDB Advanced clusters will be automatically upgraded to the latest supported patch version as soon as it becomes available.

For more information, refer to Patch version upgrades.

PCI ready features

CockroachDB Advanced has access to all features required for PCI readiness. You must configure these settings to make your cluster PCI-ready:

You can check the status of these features on the PCI ready page of the CockroachDB Cloud Console.


Yes No
On this page

Yes No