Update row parquet S3

0

Hello, we created a job in AWS Glue, where it extracts a table from a JDBC connection, and takes this data to an S3 bucket, in parquet format. We have the following problem, if one of these data is changed at the source (JDBC), is it possible for it to be changed in the S3 file? For example, the Customers table has the Status column, when we first loaded this Status was 1, but due to an update at the source, it became 2. I would like the respective row existing in S3 to be updated with this information, like this as in source.

asked 2 months ago139 views
2 Answers
2
Accepted Answer

You cannot really update objects on S3 (much less for parquet where you cannot just update the file), you would really recreate the file and upload it again, possibly using the old file and making changing before writing back or just regenerating it from the source.
For use cases like that is that formats like Iceberg and Hudi where created, where you can choose to keep the updates on a new file (Merge on Read) or update the existing files easily (Copy on Write).

profile pictureAWS
EXPERT
answered 2 months ago
profile picture
EXPERT
reviewed a month ago
0

Yes, it is possible to achieve the behavior you described using AWS Glue along with some additional AWS services. One common approach to accomplish this is by using AWS Glue with AWS Lambda and Amazon S3 event notifications.

Here's a high-level overview of how you could set this up:

  1. Configure your AWS Glue job to extract data from the JDBC connection and write it to S3 in Parquet format, as you've already done.
  2. Create a Lambda function that triggers on S3 events, specifically on the ObjectCreated event. This Lambda function will be responsible for updating the S3 object whenever changes are detected in the source data.
  3. Configure S3 event notifications to trigger the Lambda function whenever new objects are created in the S3 bucket where your Glue job outputs the data.
  4. Within the Lambda function, implement logic to compare the updated data in the source (JDBC) with the data in the corresponding S3 object. If any changes are detected, update the S3 object accordingly.
profile picture
EXPERT
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions