Have you thought of trying out AWS Athena to query your CSV files in S3? This post outlines some steps you would need to do to get Athena parsing your files correctly. Let's walk through it step by step.
Pet data
Let's start with a simple data about our pets.$ cat pet.csv date_of_birth,pet_type,pet_name,weight,age "July 10, 2016",cat,"Bella, The Great",10.5,1 "January 1, 2015",cat,Tiger,4.2,2 "January 10, 2007",cat,Chole,6.5,3 "March 10, 2007",dog,"Max, The Big",20.1,2 "June 2, 2007",dog,Molly,22.8,3 "August 1, 2011",dog,Lulu,40.4,2 "September 11, 2007",dog,Coco,35.6,5
Our pet data has some dates, doubles, ints, and strings with commas in it. It is a good representation of a typical CSV file data you might have. I uploaded this pet.csv into an S3 location:
S3://test-athena-linh/pet/
, then headed to AWS Athena to create the table. (NOTE: run any required Drop table statement before running create statement).CREATE EXTERNAL TABLE IF NOT EXISTS athena_test.pet_data ( `date_of_birth` date, `pet_type` string, `pet_name` string, `weight` double, `age` int ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = ',', 'field.delim' = ',' ) LOCATION 's3://test-athena-linh/pet/' TBLPROPERTIES ('has_encrypted_data'='false');
Our DDL is quite ambitious with double, int, and date as the data types. Let's see if Athena can parse our data correctly. Viewing the data is interesting because with the above table definition Athena doesn't parse the comma in quote correctly using LazySimpleSerDe. Because of having a comma in the middle of a field, columns are shifted.
Columns are shifted with quoted comma |
Quirk #1: with all default setting, Athena doesn't deal with quoted commas in CSV very well.
LazySimpleSerDe doesn't handle quoted comma very well. I've discovered OpenCSVSerde can work with quoted commas by specifying quoteChar = '"'. Let's give it a try:
CREATE EXTERNAL TABLE IF NOT EXISTS athena_test.pet_data ( `date_of_birth` date, `pet_type` string, `pet_name` string, `weight` double, `age` int ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'serialization.format' = ',', 'quoteChar' = '"', 'field.delim' = ',' ) LOCATION 's3://test-athena-linh/pet/' TBLPROPERTIES ('has_encrypted_data'='false');
Create table ran fine, but I hit an error when viewing the data. The error message wasn't very helpful, it just mentioned the column rather than the actual value causing the error.
Athena error message is not very detailed |
Quirk #2: When there is an error, there is little extra information about why it occurs.
I am guessing error relates to parsing data type. Let's change all types to 'string' and see what the data look like.
CREATE EXTERNAL TABLE IF NOT EXISTS athena_test.pet_data ( `date_of_birth` string, `pet_type` string, `pet_name` string, `weight` string, `age` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'serialization.format' = ',', 'quoteChar' = '"', 'field.delim' = ',' ) LOCATION 's3://test-athena-linh/pet/' TBLPROPERTIES ('has_encrypted_data'='false');
Now with all string data types, our CSV can be read correctly with quoted commas.
No error when parsing String data type |
From the output, we can see header row is included and break type parsing. Searching on the Internet suggested OpenCSVSerde has a config in TBLPROPERTIES 'skip.header.line.count'='1' which could be useful. However, trying it out in Athena didn't lead to the expected outcome. AWS Support confirmed: "it's a known issue in Athena that property "skip.header.line.count"="1" does not work because of the following Presto issue: https://github.com/prestodb/presto/issues/1848. Hive does honour the skip.header.line property and skips header while querying the table. However, Presto displays the header record when querying the same table. The Athena Product team is aware of this issue and is planning to fix it."
Quirk #3: header row is included in the result set when using OpenCSVSerde.
We can certainly exclude header using query condition, but we can't do arithmetic operations (SUM, AVG) on strings.
Arithmetic on Strings doesn't work |
To move forward with our data and accomodating all Athena quirks so far, we will need to run CREATE table as strings and do type conversion on the fly.
SELECT SUM(weight) FROM ( SELECT date_of_birth, pet_type, pet_name, cast(weight AS DOUBLE) as weight, cast(age AS INTEGER) as age FROM athena_test."pet_data" WHERE date_of_birth <> 'date_of_birth' )
Typecasting in an inner query allows an outer query to do arithmetic |
Yayyy, we got the expected result.
Let's attempt to parse the date in the format "MMMM dd, yyyy" (e.g. "July 10, 2017") as we might want to do arithmetic on dates rather than leaving it as a string. Presto has some
date_parse
function taking in a string and a pattern, date patterns are defined here: https://prestodb.io/docs/current/functions/datetime.html. The above date pattern is represented as '%M %d, %Y'
.select date_parse('July 10, 2017', '%M %d, %Y'); -- result: 2017-07-10 00:00:00.000
Our complete casting of pet_data to its right data types looks like this:
SELECT date_parse(date_of_birth, '%M %d, %Y') as date_of_birth, pet_type, pet_name, cast(weight AS DOUBLE) as weight, cast(age AS INTEGER) as age FROM athena_test."pet_data" WHERE date_of_birth <> 'date_of_birth'
Unfortunately, we will need to wrap this as a subquery for any query we want to do on our pet_data table. It would be better to be able to create a view on this. Potentially we could create a view to encapsulate the inner query, but as far as I know, Athena doesn't support View. I have tried a version of Presto Create Table AS but hit an error. I think Athena SQL is not the full set of Presto SQL: https://prestodb.io/docs/current/sql/create-table-as.html
-- This query has error CREATE TABLE athena_test.pet_data_typed(date_of_birth, pet_type, pet_name, weight, age) AS SELECT date_of_birth, pet_type, pet_name, cast(weight AS DOUBLE) as weight, cast(age AS INTEGER) as age FROM athena_test."pet_data" WHERE date_of_birth <> 'date_of_birth'
Athena doesn't support CREATE TABLE AS |
Quirk #4: Athena doesn't support View
From my trial with Athena so far, I am quite disappointed in how Athena handles CSV files. There is a lot of fiddling around with typecasting. Not sure what I did wrong there, please point out how I could improve on the above if you have a better way, and thanks in advance. I've heard Athena might work better with JSON or Avro files, I will try this out next. I will also write about my experience with Google BigQuery in future blog posts. Stay tuned!