Can I verify that Glue is using push_down_predicate?

0

Hello,

I have a Glue table based on parquet files in S3 that is partitioned by 3 columns:

  • event_date_year
    • event_date_month
      • event_date_day

I want to select data for a list of given event dates, so something like

"(event_date_year = '2024' AND event_date_month = '03' AND event_date_day = '22') 
OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '02') 
OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '04')"

How can I enforce and/or verify that these filters will actually be pushed down to the partition level / S3? I am using this code to create the dynamic frame:

dyf1 = glueContext.create_dynamic_frame.from_catalog(
    database='dbname', table_name='tablename', 
    push_down_predicate=  "(event_date_year = '2024' AND event_date_month = '03' AND event_date_day = '22') OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '02') OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '04')"
)

When I run dyf1.toDF().explain() , I only see "Scan ExistingRDD", but nothing about predicate pushdown. But then again, the reason might be that I specified the pushdown in the Glue dynamic frame, and Spark might not know anything about it...

I also tried to use .filter() and .where() on the Spark DataFrame and then ran the explain() again, but then the explain output always shows something like

  • Filter ...
    • Scan ExistingRDD... which makes me think like it's scanning the full data and not doing predicate pushdown...

How can I verify if a pushdown is really happening? Can I assume it is if the glueContext.create_dynamic_frame.from_catalog() with the push_down_predicate` parameter is not throwing an error?

Thanks, Mark

Mark
asked 12 days ago126 views
1 Answer
1

That's a good question, the only way I can think of is using a filter that doesn't return any data and compare the time it takes to run with and without pushdown (assuming the table has enough partitions so it makes a difference, if you don't then it doesn't really matter).

profile pictureAWS
EXPERT
answered 9 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions