1 回答
- 最新
- 投票最多
- 评论最多
0
The catalog versions don't tell you about the data values. Also not clear how Redshift is related with the Hudi table timeline. It sounds you are looking for something like the "MERGE INTO" command to have more control over the upsert.
相关内容
- AWS 官方已更新 1 年前
- AWS 官方已更新 1 年前
- AWS 官方已更新 2 年前
Thank you for you response! Sorry, I think my question needs a bit of clarity!
I already have diferrent parquet files in the s3 bucket which I have saved using
'hoodie.cleaner.commits.retained': 5, 'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS'
but when I load the same table, I am getting only the latest commit \
df = spark.read.format('org.apache.hudi') .option("hoodie.datasource.read.begin.instanttime", "_some_early_commits_time") .load("s3://bucket/path-to-hudi-table/")
but I was hoping to get all the commits when loading!
You normally only use one commit, since COW will override the file and on OR you have commit + deltas. To view the history you need use the timeline https://hudi.apache.org/docs/timeline/. This is purely a Hudi question.
Great! Thank you for your answer! Timeline was exactly what I was looking for :)