- 最新
- 投票最多
- 评论最多
You may want to use property ('skip.header.line.count'='1') as part of your table DDL - and this works for both spectrum and athena as well.
Question mentions issues with the first column, ski.header.line.count is for skipping rows.
Thank you, unfortunately this drops the entire first row. The first row is valid data, and when I run the select in Redshift, all of the columns in the first row are correct, except the first column, which is the filename plus some characters as stated
@AWS-User-3359539 I was able to read it fine. I did not understand why are you trying read an archive. But anyways, Redshift Spectrum handles gzip out of the box with out any special handling.
In order to prove it I have taken the data s3://us-west-2.serverless-analytics/NYC-Pub/green/green_tripdata_2016-01.csv which is publicly available, and part of this Amazon Redshift Immersion Day Labs.
Copy the data to your own S3 bucket say my-bucket. You can use AWS Cloud Shell to create a gzip or a tar file. And move it to your buckets like
[cloudshell-user@ip-XX-XX-XX-XX ~]$ aws s3 ls my-bucket/
PRE csv/
PRE gz/
PRE tar/
- Create a Spectrum Schema
create external schema myspectrum_schema
from data catalog
database 'myspectrum_db'
iam_role 'arn:aws:iam::XXXXXXXXXXXX:role/XXXXXX-RedshiftIAMRole-XXXXXX'
create external database if not exists;
- Create table for CSV (skipping 2 lines as the file has a blank line after header)
create external table myspectrum_schema.green_201601_csv
(
vendorid VARCHAR(4),
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
store_and_fwd_flag VARCHAR(1),
ratecode INT,
pickup_longitude FLOAT4,
pickup_latitude FLOAT4,
dropoff_longitude FLOAT4,
dropoff_latitude FLOAT4,
passenger_count INT,
trip_distance FLOAT4,
fare_amount FLOAT4,
extra FLOAT4,
mta_tax FLOAT4,
tip_amount FLOAT4,
tolls_amount FLOAT4,
ehail_fee FLOAT4,
improvement_surcharge FLOAT4,
total_amount FLOAT4,
payment_type VARCHAR(4),
trip_type VARCHAR(4)
)
row format delimited fields terminated by ','
stored as textfile
location 's3://my-bucket/csv/'
table properties ('skip.header.line.count'='2');
- Create table for gz (skipping 2 lines as the file has a blank line after header)
create external table myspectrum_schema.green_201601_csv_gz
(
vendorid VARCHAR(4),
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
store_and_fwd_flag VARCHAR(1),
ratecode INT,
pickup_longitude FLOAT4,
pickup_latitude FLOAT4,
dropoff_longitude FLOAT4,
dropoff_latitude FLOAT4,
passenger_count INT,
trip_distance FLOAT4,
fare_amount FLOAT4,
extra FLOAT4,
mta_tax FLOAT4,
tip_amount FLOAT4,
tolls_amount FLOAT4,
ehail_fee FLOAT4,
improvement_surcharge FLOAT4,
total_amount FLOAT4,
payment_type VARCHAR(4),
trip_type VARCHAR(4)
)
row format delimited fields terminated by ','
stored as textfile
location 's3://my-bucket/gz/'
table properties ('skip.header.line.count'='2');
- Create table for tar (skipping 2 lines as the file has a blank line after header)
create external table myspectrum_schema.green_201601_csv_tar_gz
(
vendorid VARCHAR(4),
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
store_and_fwd_flag VARCHAR(1),
ratecode INT,
pickup_longitude FLOAT4,
pickup_latitude FLOAT4,
dropoff_longitude FLOAT4,
dropoff_latitude FLOAT4,
passenger_count INT,
trip_distance FLOAT4,
fare_amount FLOAT4,
extra FLOAT4,
mta_tax FLOAT4,
tip_amount FLOAT4,
tolls_amount FLOAT4,
ehail_fee FLOAT4,
improvement_surcharge FLOAT4,
total_amount FLOAT4,
payment_type VARCHAR(4),
trip_type VARCHAR(4)
)
row format delimited fields terminated by ','
stored as textfile
location 's3://my-bucket/tar/'
table properties ('skip.header.line.count'='2');
- Verify results
SELECT * FROM "myspectrum_schema"."green_201601_csv";
SELECT * FROM "myspectrum_schema"."green_201601_csv_gz";
SELECT * FROM "myspectrum_schema"."green_201601_csv_tar_gz";
Hope this helps!
相关内容
- AWS 官方已更新 2 年前
- AWS 官方已更新 2 年前
- AWS 官方已更新 1 年前
Please add the link to the documentation you followed.