AWS Glue catalogPartitionPredicate : Unable to handle the change of year scenario

0

I am planning to utilize catalogPartitionPredicate in one of my projects. I am unable to handle one of the scenarios. Below are the details:

  1. Partition columns: Year,Month & Day
  2. catalogPartitionPredicate: year>='2021' and month>='12'

If the year changes to 2022(2022-01-01) and I want to read data from 2021-12-01; the expression won't be able to handle as it will not allow to read 2022 data. I tried to concat the partition keys but it didn't work.

Is there any way to implement to_date functionality or any other workaround to handle this scenario?

asked a year ago264 views
1 Answer
0

Yes, for that kind of query you need to concatenate.
However, catalogPartitionPredicate is a server filter and has limited capabilities.
Instead you can use push_down_predicate, it accepts SparkSQL syntax so you can do that in multiple ways, the simplest is probably:

year || month >= '202112'

You can also keep catalogPartitionPredicate: year>='2021' so it reduces the number of partitions listed on the server side.

profile pictureAWS
EXPERT
answered a year ago
  • I am currently using this approach. This approach doesn't allow to fully utilize catalogPartitionPredicate. For example, if current month is equal to 7 and we apply catalogPartitionPredicate: year>='2021'; it will still bring last 6 months data.

  • You do need the "push_down_predicate" property to do the filtering and prevent reading the data, if in addition you add catalogPartitionPredicate: year>='2021' you can reduce the list of partitions retrieved from the catalog (partitions, not data) but that's optional, the important is push_down_predicate

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions