Read/write complex types to redshift from spark

0

I am having difficulty working reading and writing complex types from spark to redshift. I am using the EMR-6.12 image. One point of confusion is that the docs here state that parquet should be used as the temp file format. That does not seem to be supported, see these docs. I also get an error saying parquet is not supported.

Here is a minimal code example:

schema = StructType([
    StructField("name", StringType(), True),
    StructField("hobbies", ArrayType(StringType(), True))])

# Sample data
data = [
    ("Alice", ["reading", "swimming"]),
    ("Bob", ["hiking", "cooking"]),
    ("Charlie", ["gaming", "coding"])
]

df = spark.createDataFrame(data, schema=schema)

# Write 
df.write.format("io.github.spark_redshift_community.spark.redshift").option("url", url).option(
        "dbtable", table_name
    ).option(
        "tempdir", temp_s3
    ).option(  
        "tempformat", "AVRO"  # Throws an error if I use PARQUET
    ).option(
        "forward_spark_s3_credentials", True
    ).mode(
        "error"
    ).mode(
        "overwrite" 
    ).save()

# read it back
df_new = (
        sql_context.read.format("io.github.spark_redshift_community.spark.redshift")
        .option("url", url)
        .option("dbtable",  table_name)
        .option("tempdir", temp_s3)
        .option("forward_spark_s3_credentials", True)
        .schema(schema)
        .load()
)

When I try to do anything with df_new, I get an error about parsing the hobbies column. If I don't specify the schema, the column is read as a string.

已提問 7 個月前檢視次數 452 次
1 個回答
1

Hi,

The “tempformat” option in Spark’s Redshift connector, specifically in the context of writing to Redshift, supports CSV, GZIP, and Parquet as tempformat values for writing complex types to Redshift. AVRO isn’t supported as a direct “tempformat” for writing to Redshift using Spark. Therefore, attempting to use AVRO will throw an exception. So to address the issue, Double-check that the schema used to read the parquet files in Redshift is accurate and matches the structure of the stored data. Incompatibilities between the specified schema and the actual data can result in parsing errors.

If there’s no apparent schema discrepancy, parsing errors when reading Parquet files into Redshift could be related to the specific data within the Parquet files, such as null values, different data types than expected, or structural issues. Make sure the Parquet files don’t contain unexpected or incompatible data types or null values that might conflict with the Redshift table’s schema.

As you mentioned it when not explicitly specified a schema during data read operations, the Spark DataFrame assumes string types for columns that it can’t infer the schema for, potentially causing issues when dealing with complex data types like arrays. So specifying an accurate schema is crucial for correctly interpreting and working with complex data structures like arrays in your DataFrame. Additional reading on DataFrame DynamicFrame class

profile pictureAWS
BezuW
已回答 6 個月前
  • Thanks for the response. What version of the connector are you using? I can only write with arvo. If I use parquet I get an error. I am using emr serverless 6.12.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南