AWS Glue Catalog is not fetching the New fields added to the parquet thru pyspark script

Question

Hi,
 We have a scenario where we have multiple parquet files in S3 where we added new fields to some of the parquet files. After that ran the crawler and the glue catalog table is showing the newly added fields. However when we are reading the parquet files in a Glue Job using Pspark script with glueContext.create_dynamic_frame.from_catalog the new fields are not showing up in the schema at all. We tried with Mergeschema in additional options but it is also not working. We are expecting Glue to read from the catalog as it is updated and show Null values for the parquet files where the new fields were not there.
Please help on how to get the new fields in the Glue job thru catalog read.

Answer

Hello,

As you have mentioned this is an expected behaviour with the parquet files. This is because Glue and Apache Spark both use the same Parquet readers. Since the parquet files have their own schema, while reading using Glue Dynamic Frame when defining the schema for a parquet table, it picks the schema from the first file in the S3 location. It does not scan through all the files to define the schema. If the files are stored in multiples folders, then it picks the first file from the first folder.
 Generally, usage of the mergeSchema in your read should resolve such issue. Can you please verify if the syntax is correct or not:

```
======
datasource0 = gc.create_dynamic_frame.from_catalog(database = "tsetdb", table_name = "testtable", transformation_ctx = "datasource0",additional_options={"mergeSchema":"true"})
======
```

Try also using Spark SQL and check if that provides the expected result. You can enable the same for SQL queries in glue by adding this line to your script:

```
======
glueContext.sql("set spark.sql.parquet.mergeSchema=true")
======
```

I hope these help. If not, can you also try to rename the new file with the new schema such that it’s on top of the listing in s3. Not a very good approach, but can be tried.

AWS Glue Catalog is not fetching the New fields added to the parquet thru pyspark script

Relevant content