This page provides best practices for optimizing import performance in CockroachDB.
IMPORT INTO
is the fastest method to ingest data into CockroachDB but it requires taking the target table offline for the duration of the import. IMPORT INTO
is a good choice for initial data migrations and data migrations that can tolerate table downtime. If you cannot tolerate table unavailability, we recommend using COPY FROM
instead.
Import performance primarily depends on the amount of data that you want to import. However, there are three actions you should take before importing that have a significant impact on the amount of time it takes to run an import:
If the import size is less than 100 GiB, then you do not need to do anything to optimize performance. For such small datasets, the import should run quickly, regardless of the settings.
Choose a performant import format
Different import file formats do not have the same performance due to the way they are processed by CockroachDB. The fastest import file formats are:
CSV
orDELIMITED DATA
(both have about the same import performance).AVRO
.
We recommend formatting your import files as CSV
or AVRO
. These formats can be processed in parallel by all the nodes in the cluster, which increases performance. To import in these formats, use IMPORT INTO
.
As of v22.1, certain IMPORT TABLE
statements that defined the table schema inline are not supported. See Import — Considerations for more details. To import data into a new table, use CREATE TABLE
followed by IMPORT INTO
. For an example, read Import into a new table from a CSV file.
Import the schema separately from the data
When importing into a new table, split your dump data into two files:
- A SQL file containing the table schema.
- A CSV, delimited, or AVRO file containing the table data.
Convert the schema-only file using the Schema Conversion Tool. The Schema Conversion Tool automatically creates a new CockroachDB Cloud database with the converted schema.
If you are migrating to a CockroachDB self-hosted database, you can export the converted schema from the Schema Conversion Tool and execute the statements in cockroach sql
, or use a third-party schema migration tool such as Alembic, Flyway, or Liquibase.
Then use the IMPORT INTO
statement to import the CSV data into the newly created table:
> IMPORT INTO customers (id, name)
CSV DATA (
'https://s3-us-west-1.amazonaws.com/cockroachdb-movr/datasets/employees-db/pg_dump/customers.csv'
);
This method has the added benefit of alerting on potential issues with the import sooner; that is, you will not have to wait for the file to load both the schema and data just to find an error in the schema.
Import into a schema with secondary indexes
When importing data into a table with secondary indexes, the import job will ingest the table data and required secondary index data concurrently. This may result in a longer import time compared to a table without secondary indexes. However, this typically adds less time to the initial import than following it with a separate pass to add the indexes. As a result, importing tables with their secondary indexes is the default workflow, an effective strategy for most migrations.
However, in large imports (that is, datasets larger than 100 GiB in total size), it may be preferable to temporarily remove the secondary indexes from the schema, perform the import, and then re-create the indexes separately. Write operations on tables with many secondary indexes take longer to complete, and importing large datasets have a greater risk of timeouts. Removing the table's secondary indexes allows you to separate the initial data import from the secondary index creation operation, and the total import time is lower. Separating these operations also provides increased visibility into each operation's progress, and the ability to retry each operation independently if you encounter errors or timeouts.
When recreating the secondary indexes, execute the CREATE INDEX
statements one at a time for best performance. For example:
CREATE INDEX idx1 ON table1;
CREATE INDEX idx2 ON table1;
CREATE INDEX idx3 ON table1;
Temporarily remove foreign keys
When importing data into a table with foreign keys, temporarily remove the foreign keys, and add them after the initial migration with an ALTER TABLE
statement.
Data type sizes
Above a certain size, many data types such as STRING
s, DECIMAL
s, ARRAY
, BYTES
, and JSONB
may run into performance issues due to write amplification. See each data type's documentation for its recommended size limits.
Split your data into multiple files
Splitting the import data into multiple files can have a significant impact on the import performance. The following formats support multi-file import using IMPORT INTO
:
CSV
DELIMITED DATA
AVRO
For these formats, we recommend splitting your data into at least as many files as there are nodes.
For example, if you have a 3-node cluster, split your data into at least 3 files, create your table schema, and import into that table:
CREATE TABLE customers (id UUID PRIMARY KEY, name TEXT, INDEX name_idx(name));
IMPORT INTO customers (id, name)
CSV DATA (
's3://{BUCKET NAME}/{customers.csv}?AWS_ACCESS_KEY_ID={KEY ID}&AWS_SECRET_ACCESS_KEY={SECRET ACCESS KEY}',
's3://{BUCKET NAME}/{customers2.csv}?AWS_ACCESS_KEY_ID={KEY ID}&AWS_SECRET_ACCESS_KEY={SECRET ACCESS KEY}',
's3://{BUCKET NAME}/{customers3.csv}?AWS_ACCESS_KEY_ID={KEY ID}&AWS_SECRET_ACCESS_KEY={SECRET ACCESS KEY}',
's3://{BUCKET NAME}/{customers4.csv}?AWS_ACCESS_KEY_ID={KEY ID}&AWS_SECRET_ACCESS_KEY={SECRET ACCESS KEY}',
);
CockroachDB ingests the files as provided and does not automatically split them further. For example, if you import one large file containing all of your data, CockroachDB will process that file on one node, even if you have more nodes available. However, if you import three files (and your cluster has at least three nodes), each node will process a file in parallel. This is why splitting your data into at least as many files as you have nodes will dramatically decrease the time it takes to import data.
You can split the data into more files than you have nodes. CockroachDB will process the files in parallel across the cluster. When splitting the data Cockroach Labs recommends splitting it into a multiple of the number of nodes in your cluster, if possible. For example, if you have a 3 node cluster, split the dataset into 9, 27, or 300 files.
Cockroach Labs recommends keeping the files to a maximum file size of 4 GB unless the files are streamed (for example, cloud storage locations will stream the data to CockroachDB), and to keep each file size similar across the dataset. For example, if you are importing a 9 GB dataset that was split into 3 files into a 3 node cluster, keep each file around 3 GB in size if possible. Don't split the data into two 4 GB files, and one 1 GB file.
For maximum performance each split file should be sorted within and across all files, meaning that if you were to sort the rows in each file there would be no overlapping data in any other file. For example, suppose your table has an alphabetic string primary key and you were importing the data into a 3 node cluster.
CREATE TABLE contacts (email STRING PRIMARY KEY, first_name TEXT, last_name TEXT);
You should split the files so that the first file contains rows where the primary key begins with the letters "a" to "i," the second file "j" to "r," and the third file "s" to "z". No file should contain duplicate rows, and within each file the data is sorted alphabetically by the primary key.
SELECT * FROM contacts ORDER BY email;
email | first_name | last_name
--------------------------------------+--------------+--------------------
aallabartonb6@google.co.uk | Artemas | Allabarton
aantonijevicae@simplemachines.org | Anselma | Antonijevic
abaakeo2@washington.edu | Adolphus | Baake
abamellg0@edublogs.org | Abey | Bamell
abasson3w@yelp.com | Anastasie | Basson
abilverstone1v@omniture.com | Ajay | Bilverstone
ablackleyfe@craigslist.org | Alberik | Blackley
aboutwelln4@msn.com | Anabella | Boutwell
abremner7j@narod.ru | Aveline | Bremner
abumpusby@opensource.org | Ansel | Bumpus
acaesaref@fda.gov | Allyn | Caesar
acottamav@nymag.com | Alair | Cottam
adackei2@symantec.com | Andrea | Dacke
adaenen@answers.com | Alessandra | Daen
adalgety1x@shinystat.com | Aura | Dalgety
adenyukhinh2@examiner.com | Arvy | Denyukhin
...
File storage during import
During migration, all of the features of IMPORT INTO
that interact with external file storage assume that every node has the exact same view of that storage. In other words, in order to import from a file, every node needs to have the same access to that file.
If a node runs out of disk space while processing an import job, CockroachDB will pause the import job. Make sure each node has enough free disk space to process the file and store the data.
Sort your data
Within each split file with your import data, sort the data based on how it will be stored in CockroachDB. For example, in the previous example where a dataset was partitioned by an alphabetic string primary key, within each split file the data should be sorted by the alphabetic primary key.
In tests performed by Cockroach Labs, the best import performance, measured in both throughput and overall time, was observed when data was split into multiple files and sorted within each file. Imports using an unsplit and unsorted dataset were about 16 times slower than imports using the same dataset with split and sorted files.
If you cannot both split and sort your dataset, the performance of either split or sorted data was similar, around 2 times slower than both split and sorted data, but still much faster than both unsplit and unsorted data.