Glue Streaming to Iceberg not reflecting in changelog

0

I have a Kinesis stream that is persisting (inserting) data to an iceberg table, via a Glue streaming job. I'm following the glue streaming pattern as published here.

After that process completes, I have another Glue job that is using the --extra-jars parameter to use the v1.4.3 release of Iceberg so that I can use the changelog feature. However, when I create the changelog, the activity from the kinesis stream is NOT reflected in the changelog. Is that because Kinesis is writing directly to S3 and bypassing the actual INSERT into the table?

I have confirmed that if I do an INSERT, via Athena, or as a part of a MERGE INTO, in a glue job or Athena, that the INSERT is picked up by the changelog.

Do I need to use a different method to write to the iceberg table? For example one of the methods shown here?

asked 2 months ago888 views
2 Answers
0

You cannot write directly to s3 a proper Iceberg table without using the Iceberg format and thus updating metadata snapshots.
It doesn't matter how you make changes to the table, the changelog view works using the snapshot history, up to the point you create the changelog, make sure the job creates new snapshots and that range is included in the view.

profile pictureAWS
EXPERT
answered 2 months ago
  • To add some more detail - I am filtering using the view creation using "start-timestamp" instead of "start-snapshot-id", according to these docs: https://iceberg.apache.org/docs/latest/spark-procedures/#usage_17.

    I wonder if using a timestamp is the problem?

  • I don't see a problem, is going to get the history of snaphosts and filter by timestamp, the question is if the history does have new snaphosts

  • Yes agreed, but I can now confirm, that when I use the snapshot-id as part of the procedure, it works as intended. But when I pass in a timestamp, I get the following error: "Cannot find snapshot older than 1970-01-20T20:07:01.200+00:00"

    So I believe this must be a bug in the Iceberg code for the changelog procedure.

0

It seems like you're facing an issue where the activity from the Kinesis stream is not reflected in the Iceberg changelog when using a Glue streaming job. Let's break down your setup and possible reasons for the behavior you're experiencing.

  1. Kinesis Stream to Iceberg via Glue Streaming Job: You're using a Glue streaming job to consume data from a Kinesis stream and write it to an Iceberg table.

  2. Changelog Generation with Extra Jars: You're using another Glue job to generate the changelog using the Iceberg version 1.4.3 with the --extra-jars parameter.

3)Observation: Activity from the Kinesis stream is not reflected in the changelog, whereas manual inserts or merges via Athena or Glue jobs are reflected.

Possible Reasons:

Direct S3 Writes: If the Kinesis stream is directly writing data to S3 and bypassing the Glue catalog or Iceberg's transaction mechanism, it might not be captured in the changelog. This could be the case if the Glue streaming job is not properly configured to interact with Iceberg or if the writing mechanism bypasses Iceberg altogether.

Transaction Commit: Iceberg captures changes through transaction commits. If the Kinesis stream data isn't being committed within a transaction context, it might not be reflected in the changelog. Ensure that the Glue streaming job is properly committing transactions after writing data to the Iceberg table.

Configuration Issue: There might be a configuration issue in how the Glue streaming job interacts with Iceberg or how Iceberg is configured to capture changes. Check your Glue job settings, Iceberg configuration, and ensure compatibility between the Iceberg version used for writing and reading.

Compatibility Issue: Ensure compatibility between the Iceberg version used for writing data (via Glue streaming job) and generating the changelog (via Glue job with --extra-jars). Incompatibility between versions could lead to issues in capturing changes properly.

Glue Job Execution: Verify that the Glue streaming job and the job for generating the changelog are executed properly without errors. Check logs and monitoring metrics to ensure there are no issues during execution.

To troubleshoot, you may need to dive deeper into the Glue streaming job's configuration, Iceberg setup, and how data is being written from the Kinesis stream to the Iceberg table.

profile picture
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions