Athena error while running UNLOAD to PARQUET query using column names with spaces in -- GENERIC_INTERNAL_ERROR: field ended by ';': expected ';'

0

Error while running UNLOAD to PARQUET query using column names with spaces in

Introduction

I have a table in Athena with the following column names ["column space 1", "column space 2"]. I cannot run an UNLOAD to Parquet query on a table that contains column names with spaces in.

The documentation states that one can use non-standard column names by wrapping them in "" (backticks no longer supported). This works for standard queries, but not UNLOAD or CTAS queries.

I cannot change the column names -- I must be able to support column names with spaces in.

The table was created using Glue with the following parameters:

{'Name': 'dummy_table', 'Parameters': {'classification': 'parquet'}, 'StorageDescriptor': {'Columns': [{'Name': 'column space 1', 'Type': 'string'}, {'Name': 'column space 2', 'Type': 'string'}], 'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat', 'Location': '<input_parquet_location>', 'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat', 'SerdeInfo': {'Parameters': {'serialization.format': '1'}, 'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'}}, 'TableType': 'EXTERNAL_TABLE'}

The Problem

I can query this table fine using the Athena Query Editor e.g. SELECT "column space 1" FROM "AwsDataCatalog"."<database_name>"."dummy_table" limit 10; returns the correct result.

However, I receive an error when I try to run the following UNLOAD query to dump the table into a PARQUET file:

UNLOAD (SELECT "column space 1" FROM "AwsDataCatalog"."<database_name>"."dummy_table") TO '<output_s3_location>' WITH (format = 'PARQUET', compression = 'SNAPPY')

Error: GENERIC_INTERNAL_ERROR: field ended by ';': expected ';' but got 'space' at line 1: optional binary column space

Things I've Tried

I can successfully run the same UNLOAD query using {ORC, CSV, TEXTFILE} formats e.g. UNLOAD (SELECT "column space 1" FROM "AwsDataCatalog"."<database_name>"."dummy_table") TO '<output_s3_location>' WITH (format = 'ORC', compression = 'SNAPPY')

I've tried a CREATE TABLE AS query, but am faced with the same error - GENERIC_INTERNAL_ERROR: field ended by ';': expected ';' but got 'space' at line 1

CREATE TABLE "{database_name}"."{table_name}_unload"
WITH (
    format = 'Parquet',
    table_type = 'HIVE',
    external_location = '{output_parquet_location}',
    is_external = true,
    write_compression = 'SNAPPY'
) AS
SELECT "column space 1" FROM "{database_name}"."{table_name}"

My question

It seems there is a problem specifically with trying to run an UNLOAD to Parquet query while using column names with spaces in.

Is this a bug or expected behaviour? I can see no reference in the documentation, and would be surprised if non-standard column names are not supported for the specific case of Unloading to Parquet, but is supported with ORC, CSV, TEXTFILE

Is there any way that I can UNLOAD a table that contains non-standard column names to a PARQUET file?

I'd be very grateful for any help or guidance, Toby

toby
gefragt vor einem Monat591 Aufrufe
1 Antwort
0

The Parquet file format has more strict requirements around column names compared to formats like ORC, CSV or text files.

You can try the following:

  • Rename the columns in your table to remove any spaces before unloading to Parquet.
  • Use a different file format like ORC, CSV or text files that may support non-standard column names.
  • You could unload the data to S3 in a non-Parquet format, then use another tool like Spark to read from S3 and write the data back out as a new Parquet file after renaming the columns appropriately.

Unfortunately there doesn't seem to be a way currently to directly unload a table with non-standard column names to Parquet format from Athena. The column names need to adhere to the stricter naming requirements of the Parquet schema. Renaming columns is likely required as a workaround.

profile picture
EXPERTE
beantwortet vor einem Monat
  • Many thanks for your answer.

    I am not sure it is correct that the Parquet file format has strict requirements around non-standard column names, and believe it to be a limitation enforced by Athena rather than the Parquet file format.

    As evidence to back this up, I can use pandas to create a parquet file that contains column names with spaces df.to_parquet("file.parquet"). This can be read by other tools such as Dask and Glue.

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen