To update multiple rows in a table, you can use a single UPDATE
statement, with a WHERE
clause that filters the rows you want to update.
To update a large number of rows (i.e., tens of thousands of rows or more), we recommend iteratively updating subsets of the rows that you want to update, until all of the rows have been updated. You can write a script to do this, or you can write a loop into your application.
This page provides guidance on writing batch-update loops with a pattern that executes SELECT
and UPDATE
statements at different levels of a nested loop.
Exercise caution when batch-updating rows from tables with foreign key constraints and explicit ON UPDATE
foreign key actions. To preserve UPDATE
performance on tables with foreign key actions, we recommend using smaller batch sizes, as additional rows updated due to ON UPDATE
actions can make batch loops significantly slower.
Before you begin
Before reading this page, do the following:
- Create a CockroachDB Standard cluster or start a local cluster.
Install a Driver or ORM Framework.
For the example on this page, we use the
psycopg2
Python driver.Insert data that you now want to update.
For the example on this page, we load a cluster with the
movr
database and data fromcockroach workload
.
Write a batch-update loop
At the top level of a loop in your application, or in a script, execute a
SELECT
query that returns a large batch of primary key values for the rows that you want to update. When defining theSELECT
query:- Use a
WHERE
clause to filter on columns that identify the rows that you want to update. This clause should also filter out the rows that have been updated by previous iterations of the nestedUPDATE
loop:- For optimal performance, the first condition of the filter should evaluate the last primary key value returned by the last
UPDATE
query that was executed. This narrows eachSELECT
query's scan to the fewest rows possible, and preserves the performance of the row updates over time. - Another condition of the filter should evaluate column values persisted to the database that signal whether or not a row has been updated. This prevents rows from being updated more than once, in the event that the application or script crashes and needs to be restarted. If there is no way to distinguish between an updated row and a row that has not yet been updated, you might need to add a new column to the table (e.g.,
ALTER TABLE ... ADD COLUMN updated BOOL;
).
- For optimal performance, the first condition of the filter should evaluate the last primary key value returned by the last
- Add an
AS OF SYSTEM TIME
clause to the end of the selection subquery, or run the selection query in a separate, read-only transaction withSET TRANSACTION AS OF SYSTEM TIME
. This helps to reduce transaction contention. - Use a
LIMIT
clause to limit the number of rows queried to a subset of the rows that you want to update. To determine the optimalSELECT
batch size, try out different sizes (10,000 rows, 20,000 rows, etc.), and monitor the change in performance. Note that thisSELECT
batch size can be much larger than the batch size of rows that are updated in the subsequentUPDATE
query. - To ensure that rows are efficiently scanned in the subsequent
UPDATE
query, include anORDER BY
clause on the primary key.
- Use a
Under the
SELECT
query, write a nested loop that executesUPDATE
queries over the primary key values returned by theSELECT
query, in batches smaller than the initialSELECT
batch size. When defining theUPDATE
query:- Use a
WHERE
clause that filters on a subset of the primary key values returned by the top-levelSELECT
query. To determine the optimalUPDATE
batch size, try out different sizes (1,000 rows, 2,000 rows, etc.), and monitor the change in performance. - Make sure that the
UPDATE
query updates a column that signals whether or not the row has been updated. This column might be different from the column whose values you want to update. - Add a
RETURNING
clause to the end of the query that returns the primary key values of the rows being updated. TheWHERE
clause in the top-levelSELECT
query should filter out the primary key value of the last row that was updated, using the values returned by the lastUPDATE
query executed. - Where possible, we recommend executing each
UPDATE
in a separate transaction.
- Use a
Example
Suppose that over the past year, you've recorded hundreds of thousands of MovR rides in a cluster loaded with the movr
database. And suppose that, for the last week of December, you applied a 10% discount to all ride charges billed to users, but you didn't update the rides
table to reflect the discounts.
To get the rides
table up-to-date, you can create a loop that updates the relevant rows of the rides
table in batches, following the query guidance provided above.
In this case, you will also need to add a new column to the rides
table that signals whether or not a row has been updated. Using this column, the top-level SELECT
query can filter out rows that have already been updated, which will prevent rows from being updated more than once if the script crashes.
For example, you could create a column named discounted
, of data type BOOL
:
ALTER TABLE rides ADD COLUMN discounted BOOL DEFAULT false;
The Bulk-update Data
statement performs a schema change. For more information about how online schema changes work in CockroachDB, see Online Schema Changes.
In Python, a batch-update script might look similar to the following:
#!/usr/bin/env python3
import psycopg2
import os
import time
def main():
conn = psycopg2.connect(os.environ.get('DB_URI'))
lastid = None
while True:
with conn:
with conn.cursor() as cur:
cur.execute("SET TRANSACTION AS OF SYSTEM TIME '-5s'")
if lastid:
cur.execute("SELECT id FROM rides WHERE id > %s AND discounted != true AND extract('month', start_time) = 12 AND extract('day', start_time) > 23 ORDER BY id LIMIT 10000", (lastid,))
else:
cur.execute("SELECT id FROM rides WHERE discounted != true AND extract('month', start_time) = 12 AND extract('day', start_time) > 23 ORDER BY id LIMIT 10000")
pkvals = list(cur)
if not pkvals:
return
while pkvals:
batch = pkvals[:2000]
pkvals = pkvals[2000:]
with conn:
with conn.cursor() as cur:
cur.execute("UPDATE rides SET discounted = true, revenue = revenue*.9 WHERE id = ANY %s RETURNING id", (batch,))
print(cur.statusmessage)
if not pkvals:
lastid = cur.fetchone()[0]
del batch
del pkvals
time.sleep(5)
conn.close()
if __name__ == '__main__':
main()
At each iteration, the SELECT
query returns the primary key values of up to 10,000 rows of matching historical data from 5 seconds in the past, in a read-only transaction. Then, a nested loop iterates over the returned primary key values in smaller batches of 2,000 rows. At each iteration of the nested UPDATE
loop, a batch of rows is updated. After the nested UPDATE
loop updates all of the rows from the initial selection query, a time delay ensures that the next selection query reads historical data from the table after the last iteration's UPDATE
final update.
Note that the last iteration of the nested loop assigns the primary key value of the last row updated to the lastid
variable. The next SELECT
query uses this variable to decrease the number of rows scanned by the number of rows updated in the last iteration of the loop.