How do I resolve data incompatibility errors in Redshift Spectrum?

6 minute read
0

I'm trying to use an external schema, object, or file format in Amazon Redshift Spectrum. However, I receive an error. How do I troubleshoot this error?

Resolution

Incompatible data format error

To resolve your incompatible data format error in Redshift Spectrum, perform the following steps:

1.    Retrieve the complete error message from the SVL_S3LOG system view:

select * from SVL_S3LOG where query = '<query_ID_of_the_Spectrum_query>';

A mismatch in incompatible Parquet schema produces the following error message:

File 'https://s3bucket/location/file has an incompatible Parquet schema for column ‘s3://s3bucket/location.col1'. Column type: VARCHAR, Par...

2.    Check the Message column to view the error description. The error description explains the data incompatibility between Redshift Spectrum and the external file.

3.    FSPCheck the schema of your external file, and then compare it with the column definition in the CREATE EXTERNAL TABLE definition.

4.    (Optional) If the column definition in the Apache Parquet file differs from the external table, modify the column definition in the external table. The column definition must match the columnar file format of the Apache Parquet file.

5.    Run the following query for the SVV_EXTERNAL_COLUMNS view:

select * from SVV_EXTERNAL_COLUMNS where schemaname = '<ext_schema_name>' and tablename = '<ext_table_name>';

This query checks the data type of the column in the CREATE EXTERNAL TABLE definition.

Note: For columnar file formats such as Apache Parquet, the column type is embedded with the data. The column type in the CREATE EXTERNAL TABLE definition must match the column type of the data file. Mismatched column definitions result in a data incompatibility error.

Invalid type length error

If you select a Redshift Spectrum table with columns that are of DECIMAL data type, you might encounter the following error:

S3 Query Exception (Fetch). Task failed due to an internal error. File ‘https://s3.amazonaws.com/…/<file_name>’ has an incompatible Parquet schema for column ‘<column_name>’
column ‘<column_name>’ has an invalid type length

To resolve the invalid type length error in Redshift Spectrum, use an external table definition. The table definition must match the "precision" and "scale" values defined in the external file. For example:

create external table ext_schema.tablename (c1 int, c2 decimal (6,2)) stored as PARQUET location 's3://.../.../';

In this example, the updated values (in the c2 decimal column) for "precision" and "scale" values are set to 6 and 2, respectively. Therefore, the CREATE EXTERNAL TABLE definition values listed in the c2 column must match the values defined in the Apache Parquet file.

Internal error

If you select an external schema from an Amazon Athena catalog, you might receive the following error in Redshift Spectrum:

Task failed due to an internal error. File 'https://s3...snappy.parquet  has an incompatible Parquet schema for column 's3://.../tbl.a'. Column type: BOOLEAN, Parquet schema:\noptional int32 b [i:26 d:1 r:0]

In Redshift Spectrum, the column ordering in the CREATE EXTERNAL TABLE must match the ordering of the fields in the Parquet file. For Apache Parquet files, all files must have the same field orderings as in the external table definition. If you skip this ordering or rearrange any data type column, you receive an internal error.

Note: Although you can import Amazon Athena data catalogs into Redshift Spectrum, running a query might not work in Redshift Spectrum. In Redshift Spectrum, column names are matched to Apache Parquet file fields. Meanwhile, Amazon Athena uses the names of columns to map to fields in the Apache Parquet file.

To resolve the internal error, specify the following column names in the SELECT statement:

select col_1, col_2, col_3, .... col_n from athena_schema.tablename;

Also, be sure that the AWS Identity and Access Management (IAM) role allows access to Amazon Athena. For more information, see IAM policies for Amazon Redshift Spectrum.

Invalid column type error

If you use Redshift Spectrum to query VARCHAR data type columns from an AWS Glue Data Catalog table, you might receive the following error:

<column_name> - Invalid column type for column <column_name>. Type: varchar"

Both AWS Glue and Redshift Spectrum support the VARCHAR data type. However, the VARCHAR data type defined by AWS Glue Catalog doesn't include a size parameter (such as VARCHAR (256)). When Redshift Spectrum queries a VARCHAR column defined without a size parameter, the result is an invalid column type error.

To resolve the invalid column type error, perform the following steps:

1.    Run the following AWS Command Line Interface (AWS CLI) syntax to retrieve and store the AWS Glue table data in a local file:

aws glue get-table --region us-east-1 --database gluedb --name click_data_json > click-data-table.json

Note: If you receive an error while running an AWS CLI command, be sure to use the most recent version of the AWS CLI.

2.    Open the click-data-table.json file using any text editor and remove the outer {"Table": ...} envelope. For example, the updated configuration should now read like this:

{"Name": "my-table", ...}

3.    Remove any fields that aren't allowed in the UpdateTable action. For example, you can remove the following fields:
DatabaseName
Owner
CreateTime
UpdateTime
LastAccessTime
CreatedBy
IsRegisteredWithLakeFormation.

4.    Modify the STRING column types to "varchar" with the desired size parameter. For example:

"Type": "varchar(1000)"

5.    Use the following command syntax to update your AWS Glue table:

aws glue update-table --region us-east-1 --database gluedb --table-input "$(cat click-data-table.json)"

6.    Check your table definition in AWS Glue and verify that the data types have been modified.

7.    Query the AWS Glue table for the external schema in Amazon Redshift. For example:

create external schema glue_schema from data catalog database ‘gluedb’ iam_role 'arn:aws:iam::111111111111:role/myRedshiftRole' region 'us-east-1';

8.    Run the following query for click_data_json:

select * from glue_schema.click_data_json;

Invalid range error

Redshift Spectrum expects that files in Amazon Simple Storage Service (Amazon S3) that belong to an external table aren't overwritten during a query. If this happens, it can result in an error similar to the following:

Error: Spectrum Scan Error. Error: HTTP response error code: 416 Message: InvalidRange The requested range is not satisfiable

To prevent the preceding error, be sure Amazon S3 files aren't overwritten while they're queried with Redshift Spectrum.

Invalid Parquet version number

Redshift Spectrum verifies the metadata of each Apache Parquet file that it accesses. If the verification fails, it might result in an error similar to the following:

File 'https://s3.region.amazonaws.com/s3bucket/location/file has an invalid version number

The verification can fail for the following reasons:

  • The Parquet file was overwritten during the query
  • The Parquet file is corrupt

Related information

Troubleshooting queries in Amazon Redshift Spectrum

AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago