Avoid select * from table in toDf() for big tables - Glue

0

Hi. When I do a DynamicFrame.toDf() in Glue it makes a "select * from table " but if the table is very big is a problem. How can I add a filter to the query so it dont's read all table data?

DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = dataSourceCatalogDataBase, table_name = dataSourceCatalogTableName,redshift_tmp_dir = args["TempDir"], transformation_ctx = "DataSource0")

df1 = DataSource0.toDF()

gefragt vor einem Jahr251 Aufrufe
2 Antworten
1

@RobertoH,

if you are reading from a relational database you can use the connection option to push down a query using the option sampleQuery as described here.

hope this helps,

AWS
EXPERTE
beantwortet vor einem Jahr
  • Thanks. I tried it but didn't work. It makes an select * from table.

    DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "postgresql", connection_options = {"url": "jdbc:postgresql://ip:5432/db" ,"user": "xxx", "password":"xxx" ,"dbtable": "pushed_checkpoints", "query":"SELECT * FROM pushed_checkpoints where pushed_at>'2022-12-01'"} )

0

The toDf() has a show method that will show only a certain number of rows. You can use that if you want to see a subset of the data. https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html

beantwortet vor einem Jahr

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen