Glue reading the hudi dataset ignores bookmark

1

I have a glue job which reads from Glue Catalog table which is in hudi format and after every read the reponse contains the whole dateset while I expect only the first run to contain the data but all subsequent runs to return the empty dataset (given there were no changes to the source hudi dataset)

My hudi config is following:

   hudi_config = {
        'className': 'org.apache.hudi',
        'hoodie.datasource.hive_sync.use_jdbc': 'false',
        'hoodie.datasource.write.precombine.field': config['sort_key'],
        'hoodie.datasource.write.partitionpath.field': config['partition_field'],
        'hoodie.datasource.hive_sync.partition_fields': config['partition_field'],
        'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.HiveStylePartitionValueExtractor',
        'hoodie.datasource.write.hive_style_partitioning': 'true',
        'hoodie.datasource.hive_sync.assume_date_partitioning': 'false',
        'hoodie.datasource.write.recordkey.field': config['primary_key'],
        'hoodie.table.name': config['hudi_table'],
        'hoodie.datasource.hive_sync.database': config['target_database'],
        'hoodie.datasource.hive_sync.table': config['hudi_table'],
        'hoodie.datasource.hive_sync.enable': 'true',
        'hoodie.consistency.check.enabled': 'true',
        'hoodie.cleaner.commits.retained': 10,
        'path': f"s3://{config['target_bucket']}{config['s3_path']}",
        'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator',  # noqa: E501
        'hoodie.bulkinsert.shuffle.parallelism': 200,
        'hoodie.upsert.shuffle.parallelism': 200,
        'hoodie.insert.shuffle.parallelism': 200,
        'hoodie.datasource.hive_sync.support_timestamp': 'true',
        # 'hoodie.datasource.write.operation': "insert"
        'hoodie.datasource.write.operation': "upsert"
    }

The reading part looks like this:

Read_node1 = glueContext.create_data_frame.from_catalog(
    database="gluedatabase",
    table_name="table",
    transformation_ctx="Read_node1"
)
AWSGlueDataCatalog_node = DynamicFrame.fromDF(Read_node1, glueContext, "AWSGlueDataCatalog_node")

The result is being written to s3 bucket and always generates the same file.

Thank you

Denys
질문됨 일 년 전343회 조회
1개 답변
0

Bookmarks only works on plain files, when the documentation says it supports parquet it means plain parquet files, Hudi is a format on top of parquet (same for Iceberg and Delta)

profile pictureAWS
전문가
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠