AWS Glue Data Catalog Table Version comparison

I'm inquiring whether it's possible to access the previous version of the catalog table in the ETL job to examine a specific column's content. Presently, as I'm updating the table from the raw bucket to the processed one, the older values of a record are being replaced by the new values. The code segment for this process is:

df.write.format("hudi")
.options(**combinedConf_Upserts)
.mode("append")
.save()

Within the Spark-based ETL job responsible for writing data to Redshift, my goal is to compare the previous and current versions of a record each time and verify whether the values have changed or not.

主题

数据库分析

标签

数据库 AWS Glue Amazon Redshift

语言

English

Deeins

已提问 7 个月前363 查看次数

1 回答

最新
投票最多
评论最多

这些答案有用吗？为正确答案投票，以帮助社区从您的知识中受益。

已接受的回答

The catalog versions don't tell you about the data values. Also not clear how Redshift is related with the Hudi table timeline. It sounds you are looking for something like the "MERGE INTO" command to have more control over the upsert.

专家

Gonzalo Herreros

已回答 7 个月前

Deeins
7 个月前
Thank you for you response! Sorry, I think my question needs a bit of clarity!

I already have diferrent parquet files in the s3 bucket which I have saved using
'hoodie.cleaner.commits.retained': 5, 'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS'

but when I load the same table, I am getting only the latest commit \

df = spark.read.format('org.apache.hudi') .option("hoodie.datasource.read.begin.instanttime", "_some_early_commits_time") .load("s3://bucket/path-to-hudi-table/")

but I was hoping to get all the commits when loading!
Gonzalo Herreros 专家
7 个月前
You normally only use one commit, since COW will override the file and on OR you have commit + deltas. To view the history you need use the timeline https://hudi.apache.org/docs/timeline/. This is purely a Hudi question.
Deeins
7 个月前
Great! Thank you for your answer! Timeline was exactly what I was looking for :)

AWS Glue Data Catalog Table Version comparison

相关内容