Read/write complex types to redshift from spark

0

I am having difficulty working reading and writing complex types from spark to redshift. I am using the EMR-6.12 image. One point of confusion is that the docs here state that parquet should be used as the temp file format. That does not seem to be supported, see these docs. I also get an error saying parquet is not supported.

Here is a minimal code example:

schema = StructType([
    StructField("name", StringType(), True),
    StructField("hobbies", ArrayType(StringType(), True))])

# Sample data
data = [
    ("Alice", ["reading", "swimming"]),
    ("Bob", ["hiking", "cooking"]),
    ("Charlie", ["gaming", "coding"])
]

df = spark.createDataFrame(data, schema=schema)

# Write 
df.write.format("io.github.spark_redshift_community.spark.redshift").option("url", url).option(
        "dbtable", table_name
    ).option(
        "tempdir", temp_s3
    ).option(  
        "tempformat", "AVRO"  # Throws an error if I use PARQUET
    ).option(
        "forward_spark_s3_credentials", True
    ).mode(
        "error"
    ).mode(
        "overwrite" 
    ).save()

# read it back
df_new = (
        sql_context.read.format("io.github.spark_redshift_community.spark.redshift")
        .option("url", url)
        .option("dbtable",  table_name)
        .option("tempdir", temp_s3)
        .option("forward_spark_s3_credentials", True)
        .schema(schema)
        .load()
)

When I try to do anything with df_new, I get an error about parsing the hobbies column. If I don't specify the schema, the column is read as a string.

asked a year ago774 views
1 Answer
1

Hi,

The “tempformat” option in Spark’s Redshift connector, specifically in the context of writing to Redshift, supports CSV, GZIP, and Parquet as tempformat values for writing complex types to Redshift. AVRO isn’t supported as a direct “tempformat” for writing to Redshift using Spark. Therefore, attempting to use AVRO will throw an exception. So to address the issue, Double-check that the schema used to read the parquet files in Redshift is accurate and matches the structure of the stored data. Incompatibilities between the specified schema and the actual data can result in parsing errors.

If there’s no apparent schema discrepancy, parsing errors when reading Parquet files into Redshift could be related to the specific data within the Parquet files, such as null values, different data types than expected, or structural issues. Make sure the Parquet files don’t contain unexpected or incompatible data types or null values that might conflict with the Redshift table’s schema.

As you mentioned it when not explicitly specified a schema during data read operations, the Spark DataFrame assumes string types for columns that it can’t infer the schema for, potentially causing issues when dealing with complex data types like arrays. So specifying an accurate schema is crucial for correctly interpreting and working with complex data structures like arrays in your DataFrame. Additional reading on DataFrame DynamicFrame class

profile pictureAWS
BezuW
answered a year ago
  • Thanks for the response. What version of the connector are you using? I can only write with arvo. If I use parquet I get an error. I am using emr serverless 6.12.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions