Loading CSV files into Redshift and BigQuery is hard — our Records Mover tool fixes that
TL;DR: BlueLabs is thrilled to announce the 1.0 release of Records Mover, a tool to make loading CSV files into databases (and unloading and more) as easy as it ought to be: https://github.com/bluelabsio/records-mover
Cloud-based analytics databases like Amazon Redshift and Google BigQuery have become the de-facto standard for many types of analytics work in the political and commercial data world, and there are good reasons why. They easily scale to the multiple terabytes needed to store voter and consumer information. They automate away pesky tasks like high availability, backups and server maintenance. And best of all, they let folks write queries in SQL, a language that a ton of people from different backgrounds know! This has been revolutionary for many types of data analysis – small technology teams can support skilled groups of analysts and data scientists from specialties relevant to the projects at hand.
So what’s the problem?
Getting data into these databases is still stupid hard
If you’ve ever tried to load a non-trivial amount of data into one of these databases, you know this problem. In my experience, it goes something like this:
- Get a copy of the data, almost always in CSV form.
- Figure out that you need to create a table.
- Realize that you are living the denormalized life and have 150 columns per row.
- Stare quietly into your coffee cup for a bit.
- Sit down and write 150 lines of CREATE TABLE statement.
- Cobble together the SQL to load the file after some intensive Googling.
- If you’re using Redshift, remember that you need the file in S3 before it can be loaded and get it loaded from whatever random place it currently lives.
- Figure out the Redshift syntax to feed credentials and regions and who knows what else that’s needed to load with.
- With some sense of accomplishment, issue your LOAD/COPY command.
- Almost immediately receive a plethora of load errors that you can only decode via a super obscure incantation. Hope you wrote it down from last time!
- Figure out how to specify the right delimiter, line ending, quoting behavior, escaping behavior and phase of the moon.
- Repeat, and find out that (at least in Redshift) that your string columns are too short for some of your data
- Field a few questions on why this critical file isn’t loaded yet…
- …
Can we make this better?
Announcing Records Mover
We at BlueLabs load a lot of data into Redshift and BigQuery to create insights for our clients. It became clear over time that we needed a better approach to this problem. We built a tool and we’re thrilled to announce its public 1.0 release: https://github.com/bluelabsio/records-mover
Records Mover automates the following:
- Constructing a CREATE TABLE statement matching the types of data in the file (including string lengths!).
- For Redshift, copying files to a temporary S3 location where they can be loaded from.
- Determining exactly what subformat of CSV file we’re dealing with by sampling the file and constructing a command to send to the database to load the database.
It works with the following databases, using the most efficient loading technique on each:
- MySQL
- PostgreSQL
- Amazon Redshift
- Microfocus Vertica
- Google BigQuery
Here’s an example of loading a file using Records Mover, and its CLI (“mvrec”):
But there’s more
In addition to loading files into databases, Records Mover also knows how to “unload” by exporting files out from databases to files.
In fact, Records Mover can move table-like data (“records”) between lots of different places:
- Individual files on disk/S3/GCS (CSV and in some cases Parquet)
- Database tables
- Tabs in Google Sheets
- Redshift Spectrum-compatible S3 buckets
- Directories on disk/S3/GCS (including some metadata to save off what Records Mover knows about the original source)
For instance, here’s how you’d use Records Mover to copy a table between two different databases (of different types!):
mvrec table2table my_redshift my_schema my_table my_bigquery my_project.my_dataset my_table
Here’s the full list of modes supported by Records Mover:
- table2gsheet: Copy from table to gsheet
- table2table: Copy from table to table
- table2recordsdir: Copy from table to recordsdir
- table2url: Copy from table to url
- table2file: Copy from table to file
- table2spectrum: Copy from table to spectrum
- gsheet2gsheet: Copy from gsheet to gsheet
- gsheet2table: Copy from gsheet to table
- gsheet2recordsdir: Copy from gsheet to recordsdir
- gsheet2url: Copy from gsheet to url
- gsheet2file: Copy from gsheet to file
- gsheet2spectrum: Copy from gsheet to spectrum
- recordsdir2gsheet: Copy from recordsdir to gsheet
- recordsdir2table: Copy from recordsdir to table
- recordsdir2recordsdir: Copy from recordsdir to recordsdir
- recordsdir2url: Copy from recordsdir to url
- recordsdir2file: Copy from recordsdir to file
- recordsdir2spectrum: Copy from recordsdir to spectrum
- url2gsheet: Copy from url to gsheet
- url2table: Copy from url to table
- url2recordsdir: Copy from url to recordsdir
- url2url: Copy from url to url
- url2file: Copy from url to file
- url2spectrum: Copy from url to spectrum
- file2gsheet: Copy from file to gsheet
- file2table: Copy from file to table
- file2recordsdir: Copy from file to recordsdir
- file2url: Copy from file to url
- file2file: Copy from file to file
- file2spectrum: Copy from file to spectrum
Use it!
Installation and usage instructions for Records Mover are here: https://github.com/bluelabsio/records-mover
We at BlueLabs know the pain of ingesting data, and we hope this is helpful. We’d encourage you to reach out with your experience with Records Mover — either via email to opensource@bluelabs.com or at https://github.com/bluelabsio/records-mover/issues