Glue reading the hudi dataset ignores bookmark

1

I have a glue job which reads from Glue Catalog table which is in hudi format and after every read the reponse contains the whole dateset while I expect only the first run to contain the data but all subsequent runs to return the empty dataset (given there were no changes to the source hudi dataset)

My hudi config is following:

   hudi_config = {
        'className': 'org.apache.hudi',
        'hoodie.datasource.hive_sync.use_jdbc': 'false',
        'hoodie.datasource.write.precombine.field': config['sort_key'],
        'hoodie.datasource.write.partitionpath.field': config['partition_field'],
        'hoodie.datasource.hive_sync.partition_fields': config['partition_field'],
        'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.HiveStylePartitionValueExtractor',
        'hoodie.datasource.write.hive_style_partitioning': 'true',
        'hoodie.datasource.hive_sync.assume_date_partitioning': 'false',
        'hoodie.datasource.write.recordkey.field': config['primary_key'],
        'hoodie.table.name': config['hudi_table'],
        'hoodie.datasource.hive_sync.database': config['target_database'],
        'hoodie.datasource.hive_sync.table': config['hudi_table'],
        'hoodie.datasource.hive_sync.enable': 'true',
        'hoodie.consistency.check.enabled': 'true',
        'hoodie.cleaner.commits.retained': 10,
        'path': f"s3://{config['target_bucket']}{config['s3_path']}",
        'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator',  # noqa: E501
        'hoodie.bulkinsert.shuffle.parallelism': 200,
        'hoodie.upsert.shuffle.parallelism': 200,
        'hoodie.insert.shuffle.parallelism': 200,
        'hoodie.datasource.hive_sync.support_timestamp': 'true',
        # 'hoodie.datasource.write.operation': "insert"
        'hoodie.datasource.write.operation': "upsert"
    }

The reading part looks like this:

Read_node1 = glueContext.create_data_frame.from_catalog(
    database="gluedatabase",
    table_name="table",
    transformation_ctx="Read_node1"
)
AWSGlueDataCatalog_node = DynamicFrame.fromDF(Read_node1, glueContext, "AWSGlueDataCatalog_node")

The result is being written to s3 bucket and always generates the same file.

Thank you

Denys
已提问 1 年前343 查看次数
1 回答
0

Bookmarks only works on plain files, when the documentation says it supports parquet it means plain parquet files, Hudi is a format on top of parquet (same for Iceberg and Delta)

profile pictureAWS
专家
已回答 1 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则