Avoid select * from table in toDf() for big tables - Glue

0

Hi. When I do a DynamicFrame.toDf() in Glue it makes a "select * from table " but if the table is very big is a problem. How can I add a filter to the query so it dont's read all table data?

DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = dataSourceCatalogDataBase, table_name = dataSourceCatalogTableName,redshift_tmp_dir = args["TempDir"], transformation_ctx = "DataSource0")

df1 = DataSource0.toDF()

已提問 1 年前檢視次數 250 次
2 個答案
1

@RobertoH,

if you are reading from a relational database you can use the connection option to push down a query using the option sampleQuery as described here.

hope this helps,

AWS
專家
已回答 1 年前
  • Thanks. I tried it but didn't work. It makes an select * from table.

    DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "postgresql", connection_options = {"url": "jdbc:postgresql://ip:5432/db" ,"user": "xxx", "password":"xxx" ,"dbtable": "pushed_checkpoints", "query":"SELECT * FROM pushed_checkpoints where pushed_at>'2022-12-01'"} )

0

The toDf() has a show method that will show only a certain number of rows. You can use that if you want to see a subset of the data. https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html

已回答 1 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南