Glue reading the hudi dataset ignores bookmark

1

I have a glue job which reads from Glue Catalog table which is in hudi format and after every read the reponse contains the whole dateset while I expect only the first run to contain the data but all subsequent runs to return the empty dataset (given there were no changes to the source hudi dataset)

My hudi config is following:

   hudi_config = {
        'className': 'org.apache.hudi',
        'hoodie.datasource.hive_sync.use_jdbc': 'false',
        'hoodie.datasource.write.precombine.field': config['sort_key'],
        'hoodie.datasource.write.partitionpath.field': config['partition_field'],
        'hoodie.datasource.hive_sync.partition_fields': config['partition_field'],
        'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.HiveStylePartitionValueExtractor',
        'hoodie.datasource.write.hive_style_partitioning': 'true',
        'hoodie.datasource.hive_sync.assume_date_partitioning': 'false',
        'hoodie.datasource.write.recordkey.field': config['primary_key'],
        'hoodie.table.name': config['hudi_table'],
        'hoodie.datasource.hive_sync.database': config['target_database'],
        'hoodie.datasource.hive_sync.table': config['hudi_table'],
        'hoodie.datasource.hive_sync.enable': 'true',
        'hoodie.consistency.check.enabled': 'true',
        'hoodie.cleaner.commits.retained': 10,
        'path': f"s3://{config['target_bucket']}{config['s3_path']}",
        'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator',  # noqa: E501
        'hoodie.bulkinsert.shuffle.parallelism': 200,
        'hoodie.upsert.shuffle.parallelism': 200,
        'hoodie.insert.shuffle.parallelism': 200,
        'hoodie.datasource.hive_sync.support_timestamp': 'true',
        # 'hoodie.datasource.write.operation': "insert"
        'hoodie.datasource.write.operation': "upsert"
    }

The reading part looks like this:

Read_node1 = glueContext.create_data_frame.from_catalog(
    database="gluedatabase",
    table_name="table",
    transformation_ctx="Read_node1"
)
AWSGlueDataCatalog_node = DynamicFrame.fromDF(Read_node1, glueContext, "AWSGlueDataCatalog_node")

The result is being written to s3 bucket and always generates the same file.

Thank you

Denys
asked a year ago311 views
1 Answer
0

Bookmarks only works on plain files, when the documentation says it supports parquet it means plain parquet files, Hudi is a format on top of parquet (same for Iceberg and Delta)

profile pictureAWS
EXPERT
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions