AWS Glue Job - Job Bookmark - Read from JDBC RDS Postgres instance and write parquet files into S3 on daily basis

0

Hi. I'm building up/developing a pipeline in AWS using AWS Glue.

Context: I'm extracting data from RDS AWS Postgres instance. This instance is a productive database of a mobile app (OLTP).

Goal: Extract historical and incremental data. Setting up Glue PySpark jobs in order to extract not only the historical data but also the "daily deltas". Write parquet files on S3 bucket daily folders.

Up to now: I've already scheduled a Crawler, it updates and map tables and schema, weekly.

Most of the RDS tables have this 3 fields:

"id" "created_at" "updated_at"

The app's backend inserts new rows in the RDS ("created_at" == "updated_at"), but also updates (not changing the ID) previously inserted rows (so "created_at" < "updated_at")

On the first job run with bookmark enabled, it will grab the historical data, on the second run (lets say T+1), will the bookmark catch the updated rows?

Besides this, is there any other consideration or advice you see worth sharing?

Thank you in advance.

1回答
2

By default it will bookmark on the PK, so it won't detect updates but if you always update the updated_at column with a timestamp when doing changes and you specify updated_at as a jobBookmarkKeys (see the documentation), then it will retrieve those updated rows as well as the new ones in the next run.

profile pictureAWS
エキスパート
回答済み 1年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン