Glue reading the hudi dataset ignores bookmark

1

I have a glue job which reads from Glue Catalog table which is in hudi format and after every read the reponse contains the whole dateset while I expect only the first run to contain the data but all subsequent runs to return the empty dataset (given there were no changes to the source hudi dataset)

My hudi config is following:

   hudi_config = {
        'className': 'org.apache.hudi',
        'hoodie.datasource.hive_sync.use_jdbc': 'false',
        'hoodie.datasource.write.precombine.field': config['sort_key'],
        'hoodie.datasource.write.partitionpath.field': config['partition_field'],
        'hoodie.datasource.hive_sync.partition_fields': config['partition_field'],
        'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.HiveStylePartitionValueExtractor',
        'hoodie.datasource.write.hive_style_partitioning': 'true',
        'hoodie.datasource.hive_sync.assume_date_partitioning': 'false',
        'hoodie.datasource.write.recordkey.field': config['primary_key'],
        'hoodie.table.name': config['hudi_table'],
        'hoodie.datasource.hive_sync.database': config['target_database'],
        'hoodie.datasource.hive_sync.table': config['hudi_table'],
        'hoodie.datasource.hive_sync.enable': 'true',
        'hoodie.consistency.check.enabled': 'true',
        'hoodie.cleaner.commits.retained': 10,
        'path': f"s3://{config['target_bucket']}{config['s3_path']}",
        'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator',  # noqa: E501
        'hoodie.bulkinsert.shuffle.parallelism': 200,
        'hoodie.upsert.shuffle.parallelism': 200,
        'hoodie.insert.shuffle.parallelism': 200,
        'hoodie.datasource.hive_sync.support_timestamp': 'true',
        # 'hoodie.datasource.write.operation': "insert"
        'hoodie.datasource.write.operation': "upsert"
    }

The reading part looks like this:

Read_node1 = glueContext.create_data_frame.from_catalog(
    database="gluedatabase",
    table_name="table",
    transformation_ctx="Read_node1"
)
AWSGlueDataCatalog_node = DynamicFrame.fromDF(Read_node1, glueContext, "AWSGlueDataCatalog_node")

The result is being written to s3 bucket and always generates the same file.

Thank you

Denys
feita há um ano343 visualizações
1 Resposta
0

Bookmarks only works on plain files, when the documentation says it supports parquet it means plain parquet files, Hudi is a format on top of parquet (same for Iceberg and Delta)

profile pictureAWS
ESPECIALISTA
respondido há um ano

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas