Loading CSV files into Redshift and BigQuery is hard — our Records Mover tool fixes that

By Vince Broz
September 2, 2020

TL;DR: BlueLabs is thrilled to announce the 1.0 release of Records Mover, a tool to make loading CSV files into databases (and unloading and more) as easy as it ought to be: https://github.com/bluelabsio/records-mover

Cloud-based analytics databases like Amazon Redshift and Google BigQuery have become the de-facto standard for many types of analytics work in the political and commercial data world, and there are good reasons why. They easily scale to the multiple terabytes needed to store voter and consumer information. They automate away pesky tasks like high availability, backups and server maintenance. And best of all, they let folks write queries in SQL, a language that a ton of people from different backgrounds know! This has been revolutionary for many types of data analysis – small technology teams can support skilled groups of analysts and data scientists from specialties relevant to the projects at hand.

So what’s the problem?

Getting data into these databases is still stupid hard

If you’ve ever tried to load a non-trivial amount of data into one of these databases, you know this problem. In my experience, it goes something like this:

Get a copy of the data, almost always in CSV form.
Figure out that you need to create a table.
Realize that you are living the denormalized life and have 150 columns per row.
Stare quietly into your coffee cup for a bit.
Sit down and write 150 lines of CREATE TABLE statement.
Cobble together the SQL to load the file after some intensive Googling.
If you’re using Redshift, remember that you need the file in S3 before it can be loaded and get it loaded from whatever random place it currently lives.
Figure out the Redshift syntax to feed credentials and regions and who knows what else that’s needed to load with.
With some sense of accomplishment, issue your LOAD/COPY command.
Almost immediately receive a plethora of load errors that you can only decode via a super obscure incantation. Hope you wrote it down from last time!
Figure out how to specify the right delimiter, line ending, quoting behavior, escaping behavior and phase of the moon.
Repeat, and find out that (at least in Redshift) that your string columns are too short for some of your data
Field a few questions on why this critical file isn’t loaded yet…
…

Can we make this better?

Announcing Records Mover

We at BlueLabs load a lot of data into Redshift and BigQuery to create insights for our clients. It became clear over time that we needed a better approach to this problem. We built a tool and we’re thrilled to announce its public 1.0 release: https://github.com/bluelabsio/records-mover

Records Mover automates the following:

Constructing a CREATE TABLE statement matching the types of data in the file (including string lengths!).
For Redshift, copying files to a temporary S3 location where they can be loaded from.
Determining exactly what subformat of CSV file we’re dealing with by sampling the file and constructing a command to send to the database to load the database.

It works with the following databases, using the most efficient loading technique on each:

MySQL
PostgreSQL
Amazon Redshift
Microfocus Vertica
Google BigQuery

Here’s an example of loading a file using Records Mover, and its CLI (“mvrec”):

But there’s more

In addition to loading files into databases, Records Mover also knows how to “unload” by exporting files out from databases to files.

In fact, Records Mover can move table-like data (“records”) between lots of different places:

Individual files on disk/S3/GCS (CSV and in some cases Parquet)
Database tables
Tabs in Google Sheets
Redshift Spectrum-compatible S3 buckets
Directories on disk/S3/GCS (including some metadata to save off what Records Mover knows about the original source)

For instance, here’s how you’d use Records Mover to copy a table between two different databases (of different types!):

mvrec table2table my_redshift my_schema my_table my_bigquery my_project.my_dataset my_table

Here’s the full list of modes supported by Records Mover:

table2gsheet: Copy from table to gsheet
table2table: Copy from table to table
table2recordsdir: Copy from table to recordsdir
table2url: Copy from table to url
table2file: Copy from table to file
table2spectrum: Copy from table to spectrum
gsheet2gsheet: Copy from gsheet to gsheet
gsheet2table: Copy from gsheet to table
gsheet2recordsdir: Copy from gsheet to recordsdir
gsheet2url: Copy from gsheet to url
gsheet2file: Copy from gsheet to file
gsheet2spectrum: Copy from gsheet to spectrum
recordsdir2gsheet: Copy from recordsdir to gsheet
recordsdir2table: Copy from recordsdir to table
recordsdir2recordsdir: Copy from recordsdir to recordsdir
recordsdir2url: Copy from recordsdir to url
recordsdir2file: Copy from recordsdir to file
recordsdir2spectrum: Copy from recordsdir to spectrum
url2gsheet: Copy from url to gsheet
url2table: Copy from url to table
url2recordsdir: Copy from url to recordsdir
url2url: Copy from url to url
url2file: Copy from url to file
url2spectrum: Copy from url to spectrum
file2gsheet: Copy from file to gsheet
file2table: Copy from file to table
file2recordsdir: Copy from file to recordsdir
file2url: Copy from file to url
file2file: Copy from file to file
file2spectrum: Copy from file to spectrum

Use it!

Installation and usage instructions for Records Mover are here: https://github.com/bluelabsio/records-mover

We at BlueLabs know the pain of ingesting data, and we hope this is helpful. We’d encourage you to reach out with your experience with Records Mover — either via email to opensource@bluelabs.com or at https://github.com/bluelabsio/records-mover/issues