Glue reading the hudi dataset ignores bookmark

1

I have a glue job which reads from Glue Catalog table which is in hudi format and after every read the reponse contains the whole dateset while I expect only the first run to contain the data but all subsequent runs to return the empty dataset (given there were no changes to the source hudi dataset)

My hudi config is following:

   hudi_config = {
        'className': 'org.apache.hudi',
        'hoodie.datasource.hive_sync.use_jdbc': 'false',
        'hoodie.datasource.write.precombine.field': config['sort_key'],
        'hoodie.datasource.write.partitionpath.field': config['partition_field'],
        'hoodie.datasource.hive_sync.partition_fields': config['partition_field'],
        'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.HiveStylePartitionValueExtractor',
        'hoodie.datasource.write.hive_style_partitioning': 'true',
        'hoodie.datasource.hive_sync.assume_date_partitioning': 'false',
        'hoodie.datasource.write.recordkey.field': config['primary_key'],
        'hoodie.table.name': config['hudi_table'],
        'hoodie.datasource.hive_sync.database': config['target_database'],
        'hoodie.datasource.hive_sync.table': config['hudi_table'],
        'hoodie.datasource.hive_sync.enable': 'true',
        'hoodie.consistency.check.enabled': 'true',
        'hoodie.cleaner.commits.retained': 10,
        'path': f"s3://{config['target_bucket']}{config['s3_path']}",
        'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator',  # noqa: E501
        'hoodie.bulkinsert.shuffle.parallelism': 200,
        'hoodie.upsert.shuffle.parallelism': 200,
        'hoodie.insert.shuffle.parallelism': 200,
        'hoodie.datasource.hive_sync.support_timestamp': 'true',
        # 'hoodie.datasource.write.operation': "insert"
        'hoodie.datasource.write.operation': "upsert"
    }

The reading part looks like this:

Read_node1 = glueContext.create_data_frame.from_catalog(
    database="gluedatabase",
    table_name="table",
    transformation_ctx="Read_node1"
)
AWSGlueDataCatalog_node = DynamicFrame.fromDF(Read_node1, glueContext, "AWSGlueDataCatalog_node")

The result is being written to s3 bucket and always generates the same file.

Thank you

Denys
gefragt vor einem Jahr343 Aufrufe
1 Antwort
0

Bookmarks only works on plain files, when the documentation says it supports parquet it means plain parquet files, Hudi is a format on top of parquet (same for Iceberg and Delta)

profile pictureAWS
EXPERTE
beantwortet vor einem Jahr

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen