Most accurate way to measure the replica lag in RDS Postgres

0

I have an RDS instance with Postgres 13 as the DB engine. I also have a read replica created with the same instance class as that of the source database and configured my application to route the read traffic to the replica. Now when I see 'ReplicaLag' parameter in CloudWatch, it shows a delay of anywhere between 5 to 10 seconds, which is too high and is impractical for our solution. But if I insert a record in primary and do a select on that record within a second from the replica, I could successfully fetch the record. This means that the lag is not really 5 to 10 seconds? Then I came across this article from AWS where they state the following:

If no user transactions are running on the source DB instance, the associated PostgreSQL read replica reports a replication lag of up to five minutes. The replica lag is calculated as currentTime - lastCommitedTransactionTimestamp, which means that when no transactions are being processed, the value of replica lag increases for a period of time until the write-ahead log (WAL) segment switches. By default RDS for PostgreSQL switches the WAL segment every 5 minutes, which results in a transaction record and a decrease in the reported lag.

So this means 'ReplicaLag' metric is not trustworthy all the time. And yes, my use case does not involve too many writes, so it is a possibility that the replica lag I'm seeing with the above metric is not really a 'lag'. And the same article suggests me to rather monitor the lag from the DB level with the following SQL: SELECT extract(epoch from now() - pg_last_xact_replay_timestamp()) AS "ReplicaLag";. But I always get NULL as the output and I am confused if it means that there is no lag or there's something wrong with the query.

I also came across OldestReplicationSlotLag in CloudWatch and that is always shown as zero.

I am in search of a straightforward and reliable metric that can give me a realistic picture of lag between my source and replica database.

2 Answers
0

I'm not sure if it perfectly applies to this case, but I found the following knowledge, so could you check it?
Normally, The most common reasons for increase in the replica lag are the following:

  • Configuration differences between the primary and replica instances
  • Heavy write workload on the primary instance
  • Transactions that are running for a long time
  • Exclusive lock on primary instance tables
  • Corrupted or missing WAL file
  • Network issues
  • Incorrect parameter setting
  • No transactions

Why do I have replication lags and errors in my RDS for PostgreSQL DB instance?
https://repost.aws/knowledge-center/rds-postgresql-replication-lag

AWS
EXPERT
hyp
answered 24 days ago
profile picture
EXPERT
reviewed 22 days ago
  • Thanks, but I have gone through this and before I dig further in, I want to make sure that I am using the correct metric to measure the lag and that's what my question is about.

0

Hi,

You may try to get the delay in bytes (not in ms, sorry) from the origin side by using pg_xlog_location_diff to compare the origin's pg_current_xlog_insert_location with the replay_location for that backend's pg_stat_replication entry.

Look at this article for more details: https://medium.com/@sajithchandran/monitoring-postgresql-replication-35d70fd86cef

Best,

Didier

profile pictureAWS
EXPERT
answered 24 days ago
profile picture
EXPERT
reviewed 22 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions