Avoid select * from table in toDf() for big tables - Glue

0

Hi. When I do a DynamicFrame.toDf() in Glue it makes a "select * from table " but if the table is very big is a problem. How can I add a filter to the query so it dont's read all table data?

DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = dataSourceCatalogDataBase, table_name = dataSourceCatalogTableName,redshift_tmp_dir = args["TempDir"], transformation_ctx = "DataSource0")

df1 = DataSource0.toDF()

질문됨 일 년 전251회 조회
2개 답변
1

@RobertoH,

if you are reading from a relational database you can use the connection option to push down a query using the option sampleQuery as described here.

hope this helps,

AWS
전문가
답변함 일 년 전
  • Thanks. I tried it but didn't work. It makes an select * from table.

    DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "postgresql", connection_options = {"url": "jdbc:postgresql://ip:5432/db" ,"user": "xxx", "password":"xxx" ,"dbtable": "pushed_checkpoints", "query":"SELECT * FROM pushed_checkpoints where pushed_at>'2022-12-01'"} )

0

The toDf() has a show method that will show only a certain number of rows. You can use that if you want to see a subset of the data. https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html

답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠