How to download data from a Redshift table as a CSV file

Recently I needed to create a scheduled task that stores the result of a SELECT query to an Amazon Redshift table as CSV file and loads that into a third-party business intelligence service. I was expecting the SELECT query to return a few millions rows. I had to implement this task in Ruby.

First, I tried to select the data in chunks of 100,000 rows using multiple SELECT queries and append each query result into a CSV file. That solution was too slow and I decided to look for an alternative.

I immediately found UNLOAD command in Redshift documentation that allows to unload the result of a query to one or multiple files on S3. This command accepts SQL query, S3 object path prefix and a few options. Its syntax looks like this:

UNLOAD ('select_statement')  
TO 's3://object_path_prefix'  
[ WITH ] CREDENTIALS [AS] 'aws_access_credentials' 
[ option [ ... ] ]
where option is  
{ MANIFEST
| DELIMITER [ AS ] 'delimiter_char' 
| FIXEDWIDTH [ AS ] 'fixedwidth_spec' }  
| ENCRYPTED
| GZIP     
| ADDQUOTES 
| NULL [ AS ] 'null_string'
| ESCAPE
| ALLOWOVERWRITE
| PARALLEL [ { ON | TRUE } | { OFF | FALSE } ]
A simple UNLOAD query would look like this:

UNLOAD ('  
  SELECT
    id,
    email
  FROM users
')  
TO 's3://bucket_name/file_prefix_'  
CREDENTIALS 'aws_access_key_id=KEY1;aws_secret_access_key=KEY2'  
ESCAPE  
PARALLEL OFF;  

However, there are a few important things about this command that may need to be configured. In this post I want to cover only few of them.

UNLOAD may create more than one data file as it splits the data into chunks of 6.2 GB. There is MANIFEST option to create a JSON file with the list of data file names. I was not expecting that much data so I didn't use this option.

If the data file already exists in the specified location then the command fails. ALLOWOVERWRITE option allows to avoid that and simply overwrite the data and manifest files.

The data files use | (pipe) character as the column separator by default. That was not important in my case because the service supported pipes as the separators. If you need commas as the separators then use DELIMITER AS ',' option.

The data files don't have column headers but sometimes they may be required.

I added the column headers as the result of another query that was merged with the result of the first query using UNION. As long as UNION fails if the queries return rows with different data types in the same columns, I had to type cast the all the values in the result to VARCHAR. In addition to this, UNION doesn't guarantee the row order in the result, so its result has to be sorted by some column in the order that ensures that the column names come first. In my case I sorted the rows by ID column in descending order so that the name always goes before any numerical ID.

The final query looked like this:

UNLOAD ('  
  (
    SELECT
      ''User ID'',
      ''E-mail'',
      ''Order Count'',
      ''Signup Date'',
      ''Last Visit Date'',
      ''Last Visit Date Group''
  )
  UNION
  (
    SELECT
      CAST(id AS varchar),
      email,
      CAST(order_count AS varchar),
      signup_date,
      CAST(last_visit AS varchar),
      last_visit_date_group
    FROM user_details
    WHERE email <> ''''
  )
  ORDER BY 1 DESC
')  
TO 's3://bucket_name/file_prefix_'  
CREDENTIALS 'aws_access_key_id=KEY1;aws_secret_access_key=KEY2'  
ALLOWOVERWRITE  
MANIFEST  
ESCAPE  
PARALLEL OFF;  

I used s3cmd utility to download the data file and send that to the service.