- Newest
- Most votes
- Most comments
Hello,
Thanks for reaching out.
As you already know, the issue is related to parquet timestamps format between Spark 2.x and Spark 3.x.
Regarding the "LEGACY" setting, you can also do that Glue as well, the setting is accessible at "Job details" -> "Advanced properties" -> "Job parameters", per documented at [1].
Besides, per documented at [2], Glue 2.0 uses Spark 2.4.3, meanwhile Glue 3.0+ uses Spark 3.x, so you can also try using Glue 2.0 to use Spark 2.4.3 as a workaround.
Hope the above information is helpful.
================ Reference:
[1] - https://docs.aws.amazon.com/glue/latest/dg/migrating-version-40.html#migrating-version-40-from-20
[2] - https://docs.aws.amazon.com/glue/latest/dg/release-notes.html
The issue you're experiencing stems from the upgrade to Spark 3.0, which uses a different calendar system (Proleptic Gregorian) compared to Spark 2.x and legacy versions of Hive. This can cause ambiguity when reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files, as these files may have been written by Spark 2.x or legacy versions of Hive that use a legacy hybrid calendar. The error message you've provided suggests two solutions: setting spark.sql.legacy.parquet.datetimeRebaseModeInRead to LEGACY to rebase the datetime values with respect to the calendar difference during reading, or setting spark.sql.legacy.parquet.datetimeRebaseModeInRead to CORRECTED to read the datetime values as they are.
In AWS Glue, there are multiple ways to set Spark properties, but it seems these methods don't directly allow the setting of Spark SQL properties. However, you could potentially use a workaround that involves running Spark SQL commands in Glue ETL scripts to set these properties. Unfortunately, due to time constraints, I was unable to confirm the exact steps to do this.
One way you could potentially solve this issue is by transforming your data such that it doesn't include dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z. This might involve adding an additional preprocessing step before your AWS Glue DQ checks.
Relevant content
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 5 months ago
Hi,
Thanks for the responses.
I'm specifically using AWS Glue DQ to run data quality rulesets which doesn't have configuration for these parameters.