AWS Glue catalogPartitionPredicate : Unable to handle the change of year scenario


I am planning to utilize catalogPartitionPredicate in one of my projects. I am unable to handle one of the scenarios. Below are the details:

  1. Partition columns: Year,Month & Day
  2. catalogPartitionPredicate: year>='2021' and month>='12'

If the year changes to 2022(2022-01-01) and I want to read data from 2021-12-01; the expression won't be able to handle as it will not allow to read 2022 data. I tried to concat the partition keys but it didn't work.

Is there any way to implement to_date functionality or any other workaround to handle this scenario?

gefragt vor einem Jahr299 Aufrufe
1 Antwort

Yes, for that kind of query you need to concatenate.
However, catalogPartitionPredicate is a server filter and has limited capabilities.
Instead you can use push_down_predicate, it accepts SparkSQL syntax so you can do that in multiple ways, the simplest is probably:

year || month >= '202112'

You can also keep catalogPartitionPredicate: year>='2021' so it reduces the number of partitions listed on the server side.

profile pictureAWS
beantwortet vor einem Jahr
  • I am currently using this approach. This approach doesn't allow to fully utilize catalogPartitionPredicate. For example, if current month is equal to 7 and we apply catalogPartitionPredicate: year>='2021'; it will still bring last 6 months data.

  • You do need the "push_down_predicate" property to do the filtering and prevent reading the data, if in addition you add catalogPartitionPredicate: year>='2021' you can reduce the list of partitions retrieved from the catalog (partitions, not data) but that's optional, the important is push_down_predicate

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen