- Newest
- Most votes
- Most comments
Hello,
As you have mentioned this is an expected behaviour with the parquet files. This is because Glue and Apache Spark both use the same Parquet readers. Since the parquet files have their own schema, while reading using Glue Dynamic Frame when defining the schema for a parquet table, it picks the schema from the first file in the S3 location. It does not scan through all the files to define the schema. If the files are stored in multiples folders, then it picks the first file from the first folder. Generally, usage of the mergeSchema in your read should resolve such issue. Can you please verify if the syntax is correct or not:
======
datasource0 = gc.create_dynamic_frame.from_catalog(database = "tsetdb", table_name = "testtable", transformation_ctx = "datasource0",additional_options={"mergeSchema":"true"})
======
Try also using Spark SQL and check if that provides the expected result. You can enable the same for SQL queries in glue by adding this line to your script:
======
glueContext.sql("set spark.sql.parquet.mergeSchema=true")
======
I hope these help. If not, can you also try to rename the new file with the new schema such that it’s on top of the listing in s3. Not a very good approach, but can be tried.
Relevant content
- Accepted Answerasked 6 months ago
- asked 2 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated a year ago