Avoid select * from table in toDf() for big tables - Glue

0

Hi. When I do a DynamicFrame.toDf() in Glue it makes a "select * from table " but if the table is very big is a problem. How can I add a filter to the query so it dont's read all table data?

DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = dataSourceCatalogDataBase, table_name = dataSourceCatalogTableName,redshift_tmp_dir = args["TempDir"], transformation_ctx = "DataSource0")

df1 = DataSource0.toDF()

asked a year ago239 views
2 Answers
1

@RobertoH,

if you are reading from a relational database you can use the connection option to push down a query using the option sampleQuery as described here.

hope this helps,

AWS
EXPERT
answered a year ago
  • Thanks. I tried it but didn't work. It makes an select * from table.

    DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "postgresql", connection_options = {"url": "jdbc:postgresql://ip:5432/db" ,"user": "xxx", "password":"xxx" ,"dbtable": "pushed_checkpoints", "query":"SELECT * FROM pushed_checkpoints where pushed_at>'2022-12-01'"} )

0

The toDf() has a show method that will show only a certain number of rows. You can use that if you want to see a subset of the data. https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions