AWS Glue catalogPartitionPredicate : Unable to handle the change of year scenario

0

I am planning to utilize catalogPartitionPredicate in one of my projects. I am unable to handle one of the scenarios. Below are the details:

  1. Partition columns: Year,Month & Day
  2. catalogPartitionPredicate: year>='2021' and month>='12'

If the year changes to 2022(2022-01-01) and I want to read data from 2021-12-01; the expression won't be able to handle as it will not allow to read 2022 data. I tried to concat the partition keys but it didn't work.

Is there any way to implement to_date functionality or any other workaround to handle this scenario?

已提问 1 年前298 查看次数
1 回答
0

Yes, for that kind of query you need to concatenate.
However, catalogPartitionPredicate is a server filter and has limited capabilities.
Instead you can use push_down_predicate, it accepts SparkSQL syntax so you can do that in multiple ways, the simplest is probably:

year || month >= '202112'

You can also keep catalogPartitionPredicate: year>='2021' so it reduces the number of partitions listed on the server side.

profile pictureAWS
专家
已回答 1 年前
  • I am currently using this approach. This approach doesn't allow to fully utilize catalogPartitionPredicate. For example, if current month is equal to 7 and we apply catalogPartitionPredicate: year>='2021'; it will still bring last 6 months data.

  • You do need the "push_down_predicate" property to do the filtering and prevent reading the data, if in addition you add catalogPartitionPredicate: year>='2021' you can reduce the list of partitions retrieved from the catalog (partitions, not data) but that's optional, the important is push_down_predicate

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则