Update row parquet S3

0

Hello, we created a job in AWS Glue, where it extracts a table from a JDBC connection, and takes this data to an S3 bucket, in parquet format. We have the following problem, if one of these data is changed at the source (JDBC), is it possible for it to be changed in the S3 file? For example, the Customers table has the Status column, when we first loaded this Status was 1, but due to an update at the source, it became 2. I would like the respective row existing in S3 to be updated with this information, like this as in source.

preguntada hace 3 meses177 visualizaciones
2 Respuestas
2
Respuesta aceptada

You cannot really update objects on S3 (much less for parquet where you cannot just update the file), you would really recreate the file and upload it again, possibly using the old file and making changing before writing back or just regenerating it from the source.
For use cases like that is that formats like Iceberg and Hudi where created, where you can choose to keep the updates on a new file (Merge on Read) or update the existing files easily (Copy on Write).

profile pictureAWS
EXPERTO
respondido hace 3 meses
profile picture
EXPERTO
revisado hace 2 meses
0

Yes, it is possible to achieve the behavior you described using AWS Glue along with some additional AWS services. One common approach to accomplish this is by using AWS Glue with AWS Lambda and Amazon S3 event notifications.

Here's a high-level overview of how you could set this up:

  1. Configure your AWS Glue job to extract data from the JDBC connection and write it to S3 in Parquet format, as you've already done.
  2. Create a Lambda function that triggers on S3 events, specifically on the ObjectCreated event. This Lambda function will be responsible for updating the S3 object whenever changes are detected in the source data.
  3. Configure S3 event notifications to trigger the Lambda function whenever new objects are created in the S3 bucket where your Glue job outputs the data.
  4. Within the Lambda function, implement logic to compare the updated data in the source (JDBC) with the data in the corresponding S3 object. If any changes are detected, update the S3 object accordingly.
profile picture
EXPERTO
respondido hace 3 meses

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas