Hello everyone.
Data from the rest api in the form of JSON is loaded daily by lambda into s3-bucket-1.
Then this data should be stored in s3-bucket-2 in the form of a flat parquet table.
I did it in glue-job, but there are two questions:
1 - Lambda updates only some partitions daily (id=parameter). How can I make glue-job process only updated data too?
2 - glue-job always creates a new file as a result, respectively, the data is duplicated.
How to avoid this? (delete existing files before writing new ones, as an option)
Glue-job was compiled in a visual editor, I did not find the necessary settings.
Do I understand correctly that this is solved only by code?
In general, what are the best practices for such a process?
Overwrite files or create a new version every time, and filter the latest one when reading?
Did I choose glue-job correctly for this?