- Newest
- Most votes
- Most comments
Hi,
The “tempformat” option in Spark’s Redshift connector, specifically in the context of writing to Redshift, supports CSV, GZIP, and Parquet as tempformat values for writing complex types to Redshift. AVRO isn’t supported as a direct “tempformat” for writing to Redshift using Spark. Therefore, attempting to use AVRO will throw an exception. So to address the issue, Double-check that the schema used to read the parquet files in Redshift is accurate and matches the structure of the stored data. Incompatibilities between the specified schema and the actual data can result in parsing errors.
If there’s no apparent schema discrepancy, parsing errors when reading Parquet files into Redshift could be related to the specific data within the Parquet files, such as null values, different data types than expected, or structural issues. Make sure the Parquet files don’t contain unexpected or incompatible data types or null values that might conflict with the Redshift table’s schema.
As you mentioned it when not explicitly specified a schema during data read operations, the Spark DataFrame assumes string types for columns that it can’t infer the schema for, potentially causing issues when dealing with complex data types like arrays. So specifying an accurate schema is crucial for correctly interpreting and working with complex data structures like arrays in your DataFrame. Additional reading on DataFrame DynamicFrame class
Relevant content
- Accepted Answerasked 8 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
Thanks for the response. What version of the connector are you using? I can only write with arvo. If I use parquet I get an error. I am using emr serverless 6.12.