In this blog post, we provide some details on how CockroachDB uses system memory on each node, and what you can do to keep memory usage in CockroachDB under control.
To understand memory usage in CockroachDB, and specifically within a CockroachDB node, it is perhaps useful to imagine memory like a giant cake that is being split up in pieces and distributed to “eat” CockroachDB’s various components.
There are three main cake eaters in CockroachDB; in approximately decreasing order of appetite:
These consumers share in common that the amount of memory they need varies depending on run-time parameters. (There are other fixed-size usages of memory in CockroachDB, but it is negligible in comparison with those above.)
RocksDB’s cache is pretty straightforward: it will consume up to the --cache-size
parameter’s value worth of memory, by default 25% of total system memory, and no more.
The node metadata, in comparison with the other consumers, is negligible, even for clusters with dozens of nodes.
The other items – metadata for replicas and memory used for activity – need more attention and are the subject of this blog post.
On each node, store memory usage increases mainly with the number of stores times the number of replicas per store.
How do database parameters and status influence per-store memory consumption?
Therefore we can say that memory usage related to stores at rest evolves approximately with the following formula:
k1 × (DataSize × ReplicationFactor) / NumNodes + other
Where k1 is the metadata size per replica; and “other” is the rest of the memory usage per store, which is negligible in comparison. From a complexity theory perspective, we say that store memory usage is:
O( (1 / NumNodes) × ( DataSize × ReplicationFactor) ).
The main variables in this equation are available from CockroachDB’s admin UI: the number of nodes and replicas per node.
Although the precise value of the parameter k1 is not directly available, an operator can estimate the overall memory incurred by stores. This can be done simply by first disabling SQL clients temporarily from one of the cluster nodes (or alternatively, by adding a new node without clients and letting the replicas migrate to the new node), ensure that distributed SQL queries are not run on the isolated node, then observe how much system memory is needed for the node, and subtract the RocksDB cache size from this. The result gives the overall cost of replica metadata.
Another characteristic of the at-rest memory usage incurred by stores is that it evolves slowly and predictably: the overall database size is typically an application-level metric that closely follows business growth, and the number of nodes is is typically changed under operator supervision.
Finally, note how how the formula above has the number of nodes as divisor: adding more nodes, keeping all parameters equal, will decrease memory usage per node. This property can be used advantageously when running off cloud resources, where memory is often not priced linearly: adding 10 new nodes with 2GB of RAM each is usually cheaper than switching five existing nodes from 2GB to 6GB each. Scale can lower your costs!
Under load, each node receives requests to update stores locally and across the network. These requests are queued until they are processed. Bursts of activity cause data to pile up in queues.
How do database parameters and status influence this memory consumption?
Therefore we can say that memory usage related to store activity evolves approximately with the following formula:
k2 × NumInFlightOperations / (NumNodes × NodeSpeed) + other
Where k2 is the memory costs incurred per data accumulated in some queues per active transaction; and “other” is the rest of the memory usage due to activity load, which is negligible in comparison. From a complexity theory perspective, we say that memory usage related to store activity is:
O( (1 / NumNodes) × ( NumInFlightOperations / NodeSpeed) ).
Again the admin UI provides some insight by showing the number of SQL clients and queries executed, which in a production database is likely to correspond more or less with the load on store.
This usage is less predictable than at-rest memory usage. Although the number of nodes is a known and relatively stable value, the number of in-flight operations can evolve unpredictable. For example, a sudden increase in contention (e.g. a single post going viral, leading to contention on transactions updating a hit counter) could cause large spikes in the number of in-flight operations.
This said, notice how the processing speed of nodes is also the divisor in this formula. As long as the nodes running CockroachDB are fast enough (this includes both CPU and disk speeds), the stores will be able to absorb spikes in activity more effectively. In a way, hardware performance dampens the unpredictability of store activity.
Each SQL client in CockroachDB that issues queries or statements will incur memory allocations on (at least) the node where the query was received.
There are three overarching concepts that you need to understand before analyzing SQL memory usage further.
Understanding which channels exist in CockroachDB helps answering the question “Where are sessions coming from?” and thus “How many sessions are open on this node?”
Once you grok these three concepts, you can understand the ground formula that determines SQL-related memory usage in CockroachDB:
NumSessions × ( SQL-level memory usage + indirect costs)
NumSessions depends on activity over the various channels. We detail these variables in the following sub-sections.
The most obvious channel through which SQL enters CockroachDB is client connections using the PostgreSQL wire protocol (“pgwire”).
In addition to this “main” channel, sessions are also created via:
cockroach user
. Here too there is at most one active session per command running.Notice how table lease updates are not performed on behalf of client sessions, and are instead performed using an internal session. This is mainly because lease cache updates require a different set of database permissions than those available in client session.
When a SQL session appears on a node, a couple of data structures of negligible size are allocated in memory, then the session starts accepting statement (or queries) by clients.
The statements are organized in transactions: single-statement “implicit” transactions for statements issued outside of a BEGIN-COMMIT block, and “explicit” transactions where the BEGIN-COMMIT blocks is defined by the client.
Memory allocations on behalf of a session can thus be further divided into:
For now, the only session-wide allocations are for prepared statements and portals. The pgwire protocol allows clients to “prepare” statements once in the session, which can be reused multiple times after that without spelling out their SQL in full. After a statement has been prepared, its corresponding SQL code and data structures remain in the node’s memory until the session ends or the client deallocates the prepared statement. Portals add parameters to prepared statements in the node’s memory, and have a similar lifetime.
Next to that, the only session-lingering allocations are result sets. After a query completes, and until the response of the query is fully transmitted back to the client, the results of that query are retained in the node’s memory. We call these packets of results “buffered result sets”. These result sets can become arbitrarily large depending on the SQL code being executed. (Note that we have plans to reduce the amount of buffering for results sets, but this is not implemented at the time of this writing.)
The remainder of SQL memory usage is caused by transaction-specific allocations.
Of these, the bulk of memory usage is caused by in-memory row storage:
pg_catalog.schemata
or information_schema.columns
), its contents are populated in memory for the duration of the transaction. The memory cost of these virtual tables is approximately proportional to the complexity of the database schema visible to the user issuing the query.We have some plans to support disk-based temporary storage for operations involving large tables, however regardless of further work on that front, CockroachDB will continue to support in-memory row storage so as to keep responding quickly to common queries.
Beyond SQL-level row storage, there are two additional per-session indirect costs which can grow arbitrarily large depending on which SQL code is executed:
At this point we would like to bring your attention to the following additional observations:
Therefore, we would like to posit that the SQL-related memory usage incurred by non-pgwire channels is comparatively negligible. You can perhaps invalidate this postulate by connecting many web browsers to the admin UI of a single node, but this is unlikely to happen, or even be allowed, in production databases.
To enable operators to inspect SQL memory growth, we have recently added new metrics inside each CockroachDB node, which can also be viewed in the admin UI. These metrics separately track the approximate amount of memory used by pgwire clients, the admin RPC and “internal” sessions inside each node. You can now use these metrics to dimension system memory capacity as your database and application grow.
If you followed the previous analysis closely, you may have noticed that the amount of memory that can be allocated on behalf of pgwire-originating sessions is virtually unbounded!
There are many SQL queries, even seemingly simple ones, that a client can send which can cause a very large amount of memory to be allocated by the node processing the query. So much memory can be allocated, in fact, that it’s possible to exceed the total system capacity and cause the operating system to terminate the server process abnormally.
Here are a few examples:
SELECT * FROM sometable
when sometable
contains many rows. This can blow up because the result set will be buffered in server memory.SELECT * FROM sometable ORDER BY somecolumn LIMIT 1
. Although LIMIT restricts the size of the result set, this can blow up if somecolumn
is not indexed and the table contains many rows.SELECT * FROM sometable, someothertable LIMIT 1
. This syntax defines a cross join between sometable
and someothertable
. In the current implementation, sometable
will be loaded first entirely in memory, and this can again blow up if it contains many rows.SELECT COUNT(DISTINCT *) FROM sometable
. This requires the server to keep a copy of the unique rows in sometable
, unless there is a unique index covering all columns in sometable`.UPDATE sometable SET x = x + 1
. This will lay as many write intents as there are rows in sometable
, and keep track of all intents in the node’s memory until the enclosing transaction commits.When we started working on a SQL interface for CockroachDB, we initially thought the risk of memory blow-up could be sidestepped merely by thoroughly documenting how CockroachDB uses memory, including e.g. this blog post and appropriate documentation yet to be written.
However, as time went by we also (re-)discovered an old truth: humans are fallible and developers make mistakes. When trying out SQL statements, whether as a CockroachDB or as an app developer, it is easy to forget about or misjudge the incurred memory costs and issue a query that will allocate a lot of memory in the node. This just happens!
Seeing a node crash because of an accidental user mistake is frustrating. Therefore, in addition to appropriate documentation and monitoring in the admin UI, we now also provide a mechanism that provides a modicum of protection against accidental blow-ups.
The total amount of SQL-related allocations performed by pgwire sessions (ie not by sessions coming via other channels) is now limited by a global setting --max-sql-memory
, set to 25% of total system memory by default. This includes both session-wide (prepared statements, result sets) and transaction-specific (in-memory row storage) allocations. When a session causes allocations to reach the global maximum, it now fails with a client-side error, preventing the server’s effective memory usage from growing beyond the limit at which the operating system would kill the process.
The analysis up to this point makes a few assumptions that are design goals for CockroachDB, but which are currently neither tested nor enforced:
We pledge to increase visibility on these overheads over time, however until then please contact us whenever you suspect that some significant memory usage has been left unaccounted for.
In particular, because of the last caveat in the list above, do not set --max-sql-memory
too high. If you bring it so high that, in combination with the other items discussed in this blog post, it reaches the total memory available to the CockroachDB process, you run the risk that the Go runtime’s invisible allocations are not accounted for and will cause the process to tip over the limit and crash (or become very slow by going to disk swap). If you wish to tune this setting, try different values and monitor in the admin UI how it influences your effective total memory usage under client load. You should probably avoid increasing the value further as soon the total memory usage under load grows beyond 80% of overall capacity available to the process.
Also, at the time of this writing, pgwire-related memory usage is not fully tracked: KV batches and write intents are not accounted for yet. We plan to address this, too.
Finally, we acknowledge that the information in this blog post is more qualitative than quantitative. We have introduced formulas for which some parameters are still largely unmeasured. It may be hard for the reader to step back and create a concrete picture in her mind: which is the largest memory consumer in practice? Store metadata at rest or in-flight store activity? Or SQL activity? To provide this intuition, we intend to run and publish some additional measurements in a later blog post; of course we welcome external validations of this as well.
This is our memory story: a CockroachDB node mainly consumes memory for the RocksDB cache, replica metadata, in-flight storage operations and data structures needed for SQL processing. Most of this memory usage is relatively stable and/or predictable, following general principles that we have outlined above. We aim at improving our admin UI as the main interface available to operators to inspect, detail, and monitor memory usage.
Additionally, to guard against unpredictable memory usage, which should be caused mostly by accident in external SQL client applications, CockroachDB now implements a simple but effective node-wide soft limit: runaway SQL sessions now simply fail with a client error before they incur a memory blow-up in the node.
CockroachDB is a distributed SQL database built on top of a transactional key value store. We don’t …
Read more
As we’ve built CockroachDB, correctness has been our primary concern. But as we’ve drawn closer …
Read more
Cgo is a pretty important part of Go: It’s your window to calling anything that isn’t Go (or, …
Read more