Error while running UNLOAD to PARQUET query using column names with spaces in
Introduction
I have a table in Athena with the following column names ["column space 1", "column space 2"]. I cannot run an UNLOAD to Parquet query on a table that contains column names with spaces in.
The documentation states that one can use non-standard column names by wrapping them in "" (backticks no longer supported). This works for standard queries, but not UNLOAD or CTAS queries.
I cannot change the column names -- I must be able to support column names with spaces in.
The table was created using Glue with the following parameters:
{'Name': 'dummy_table', 'Parameters': {'classification': 'parquet'}, 'StorageDescriptor': {'Columns': [{'Name': 'column space 1', 'Type': 'string'}, {'Name': 'column space 2', 'Type': 'string'}], 'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat', 'Location': '<input_parquet_location>', 'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat', 'SerdeInfo': {'Parameters': {'serialization.format': '1'}, 'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'}}, 'TableType': 'EXTERNAL_TABLE'}
The Problem
I can query this table fine using the Athena Query Editor e.g. SELECT "column space 1" FROM "AwsDataCatalog"."<database_name>"."dummy_table" limit 10;
returns the correct result.
However, I receive an error when I try to run the following UNLOAD query to dump the table into a PARQUET file:
UNLOAD (SELECT "column space 1" FROM "AwsDataCatalog"."<database_name>"."dummy_table") TO '<output_s3_location>' WITH (format = 'PARQUET', compression = 'SNAPPY')
Error: GENERIC_INTERNAL_ERROR: field ended by ';': expected ';' but got 'space' at line 1: optional binary column space
Things I've Tried
I can successfully run the same UNLOAD query using {ORC, CSV, TEXTFILE} formats
e.g. UNLOAD (SELECT "column space 1" FROM "AwsDataCatalog"."<database_name>"."dummy_table") TO '<output_s3_location>' WITH (format = 'ORC', compression = 'SNAPPY')
I've tried a CREATE TABLE AS
query, but am faced with the same error - GENERIC_INTERNAL_ERROR: field ended by ';': expected ';' but got 'space' at line 1
CREATE TABLE "{database_name}"."{table_name}_unload"
WITH (
format = 'Parquet',
table_type = 'HIVE',
external_location = '{output_parquet_location}',
is_external = true,
write_compression = 'SNAPPY'
) AS
SELECT "column space 1" FROM "{database_name}"."{table_name}"
My question
It seems there is a problem specifically with trying to run an UNLOAD to Parquet query while using column names with spaces in.
Is this a bug or expected behaviour? I can see no reference in the documentation, and would be surprised if non-standard column names are not supported for the specific case of Unloading to Parquet, but is supported with ORC, CSV, TEXTFILE
Is there any way that I can UNLOAD a table that contains non-standard column names to a PARQUET file?
I'd be very grateful for any help or guidance,
Toby
Many thanks for your answer.
I am not sure it is correct that the Parquet file format has strict requirements around non-standard column names, and believe it to be a limitation enforced by Athena rather than the Parquet file format.
As evidence to back this up, I can use pandas to create a parquet file that contains column names with spaces
df.to_parquet("file.parquet")
. This can be read by other tools such as Dask and Glue.