Recently I had to to create a scheduled task to export the result of a SELECT
query against an Amazon Redshift table as CSV file to load it into a third-party business intelligence service.
I was expecting the SELECT
query to return a few million rows. I decided to implement this in Ruby since that is the default language in the company.
First, I tried to select the data in chunks of 100,000 rows using multiple SELECT
queries and append each query result to a CSV file. That approach was too slow and I decided to look for an alternative.
In Redshift docs I found UNLOAD
command that allows to unload the result of a query to one or multiple files on S3. This command accepts SQL query, S3 object path prefix and a few other options. Its syntax looks like this:
UNLOAD ('select_statement')
TO 's3://object_path_prefix'
[ WITH ] CREDENTIALS [AS] 'aws_access_credentials'
[ option [ ... ] ]
where option is
{ MANIFEST
| DELIMITER [ AS ] 'delimiter_char'
| FIXEDWIDTH [ AS ] 'fixedwidth_spec' }
| ENCRYPTED
| GZIP
| ADDQUOTES
| NULL [ AS ] 'null_string'
| ESCAPE
| ALLOWOVERWRITE
| PARALLEL [ { ON | TRUE } | { OFF | FALSE } ]
A simple UNLOAD query would look like this:
UNLOAD ('
SELECT
id,
email
FROM users
')
TO 's3://bucket_name/file_prefix_'
CREDENTIALS 'aws_access_key_id=KEY1;aws_secret_access_key=KEY2'
ESCAPE
PARALLEL OFF;
Some command options are really interesting but in this post I'll be talking only about the relevant ones.
UNLOAD
may create more than one data file. It splits the exported data into multiple chunks of 6.2 GB. The command has the MANIFEST
option to also create a JSON file with the list of data file names. Even though I was exporting a few million rows, I was not expecting the data to take more than 6.4 GB of space so I didn't use this option.
If the data file already exists in the specified location then the UNLOAD
command fails.
ALLOWOVERWRITE
option allows to overwrite the data and manifest files instead.
The data files use |
(pipe) character as the column separator by default. That was fine for the purpose of my task. The service supported pipes as the separators. If you need commas as the separators then use DELIMITER AS ','
option.
The data files don't have column headers but sometimes they may be required. And that was the tricky part for me.
I added the column headers as the result of another query that was merged with the result of the first
query using UNION
.
UNION
in Redshift fails if the queries return rows with different data types for the same columns. Because of that I had to type cast the all the values in the result to VARCHAR
.
In addition, UNION
doesn't guarantee the row order in the result. Its result has to be sorted by some column in the order that ensures that the column names come first.
In my case I sorted the rows by ID column in descending order so that the name always goes before any numerical ID.
The final query looked like this:
UNLOAD ('
(
SELECT
''User ID'',
''E-mail'',
''Order Count'',
''Signup Date'',
''Last Visit Date'',
''Last Visit Date Group''
)
UNION
(
SELECT
CAST(id AS varchar),
email,
CAST(order_count AS varchar),
signup_date,
CAST(last_visit AS varchar),
last_visit_date_group
FROM user_details
WHERE email <> ''''
)
ORDER BY 1 DESC
')
TO 's3://bucket_name/file_prefix_'
CREDENTIALS 'aws_access_key_id=KEY1;aws_secret_access_key=KEY2'
ALLOWOVERWRITE
MANIFEST
ESCAPE
PARALLEL OFF;
I used s3cmd
utility to download the data file and then import it into the service.