AWS Glue Data Catalog Table Version comparison

0

I'm inquiring whether it's possible to access the previous version of the catalog table in the ETL job to examine a specific column's content. Presently, as I'm updating the table from the raw bucket to the processed one, the older values of a record are being replaced by the new values. The code segment for this process is:

df.write.format("hudi")
.options(**combinedConf_Upserts)
.mode("append")
.save()

Within the Spark-based ETL job responsible for writing data to Redshift, my goal is to compare the previous and current versions of a record each time and verify whether the values have changed or not.

Deeins
已提问 7 个月前363 查看次数
1 回答
0
已接受的回答

The catalog versions don't tell you about the data values. Also not clear how Redshift is related with the Hudi table timeline. It sounds you are looking for something like the "MERGE INTO" command to have more control over the upsert.

profile pictureAWS
专家
已回答 7 个月前
  • Thank you for you response! Sorry, I think my question needs a bit of clarity!

    I already have diferrent parquet files in the s3 bucket which I have saved using
    'hoodie.cleaner.commits.retained': 5, 'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS'

    but when I load the same table, I am getting only the latest commit \

    df = spark.read.format('org.apache.hudi') .option("hoodie.datasource.read.begin.instanttime", "_some_early_commits_time") .load("s3://bucket/path-to-hudi-table/")

    but I was hoping to get all the commits when loading!

  • You normally only use one commit, since COW will override the file and on OR you have commit + deltas. To view the history you need use the timeline https://hudi.apache.org/docs/timeline/. This is purely a Hudi question.

  • Great! Thank you for your answer! Timeline was exactly what I was looking for :)

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则