I have a glue job which reads from Glue Catalog table which is in hudi format and after every read the reponse contains the whole dateset while I expect only the first run to contain the data but all subsequent runs to return the empty dataset (given there were no changes to the source hudi dataset)
My hudi config is following:
hudi_config = {
'className': 'org.apache.hudi',
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.write.precombine.field': config['sort_key'],
'hoodie.datasource.write.partitionpath.field': config['partition_field'],
'hoodie.datasource.hive_sync.partition_fields': config['partition_field'],
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.HiveStylePartitionValueExtractor',
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.datasource.hive_sync.assume_date_partitioning': 'false',
'hoodie.datasource.write.recordkey.field': config['primary_key'],
'hoodie.table.name': config['hudi_table'],
'hoodie.datasource.hive_sync.database': config['target_database'],
'hoodie.datasource.hive_sync.table': config['hudi_table'],
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.consistency.check.enabled': 'true',
'hoodie.cleaner.commits.retained': 10,
'path': f"s3://{config['target_bucket']}{config['s3_path']}",
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', # noqa: E501
'hoodie.bulkinsert.shuffle.parallelism': 200,
'hoodie.upsert.shuffle.parallelism': 200,
'hoodie.insert.shuffle.parallelism': 200,
'hoodie.datasource.hive_sync.support_timestamp': 'true',
# 'hoodie.datasource.write.operation': "insert"
'hoodie.datasource.write.operation': "upsert"
}
The reading part looks like this:
Read_node1 = glueContext.create_data_frame.from_catalog(
database="gluedatabase",
table_name="table",
transformation_ctx="Read_node1"
)
AWSGlueDataCatalog_node = DynamicFrame.fromDF(Read_node1, glueContext, "AWSGlueDataCatalog_node")
The result is being written to s3 bucket and always generates the same file.
Thank you