Skip to content

How do I resolve replication lag in my Amazon RDS for PostgreSQL DB instance?

6 minute read
1

I want to resolve replication lag in my Amazon Relational Database Service (Amazon RDS) for PostgreSQL DB instance.

Short description

When RDS for PostgreSQL read replicas fall behind the primary instance, replication lag can occur.

The read replica uses PostgreSQL's native streaming replication to maintain synchronization. The Write Ahead Log (WAL) receiver on the read replica requests the WAL data from the primary instance. When the WAL sender on the primary instance can't find the requested WAL data, it sends an error to the secondary instance. Then, RDS for PostgreSQL tries to recover archived WAL data from Amazon Simple Storage Service (Amazon S3). If you remove the WAL from the primary instance, then replication can't resume because Amazon S3 doesn't support cross-Region replica recovery.

For more information, see Working with read replicas for Amazon RDS for PostgreSQL.

Note: The following resolution is for streaming replication issues. For information about logical replication, see How do I use logical replication to replicate tables between my Amazon RDS for PostgreSQL DB instances?

Resolution

Identify replication issues

To identify your specific replication issue, review the following metrics:

  • ReplicaLag in Amazon CloudWatch: This metric measures the amount of time in seconds that a read replica lags behind the source DB instance.
  • Replication state in the Amazon Aurora and RDS console: If replication stops, then this field changes to Error.
  • oldestreplicationslotlag in PostgreSQL version 14.1 and later when you use replication slots: This metric shows the amount of WAL data in bytes that the most delayed replica hasn't received.

You can also review the the parameters that control PostgreSQL replication.

When replication lag increases, you might receive event messages similar to the following:

"Streaming replication has stopped." This error means that the streaming replication between primary and replica instances failed. Replication switches to replay from the archive in Amazon S3.

"Streaming replication has been terminated." This error occurs after 30 consecutive days of stopped replication. Amazon RDS terminates replication to prevent storage overuse.

Note: After replication stops, the read replica instance is available but you can't resume replication. To recover from this state, recreate the read replica.

Check for configuration mismatches

It's a best practice to set your read replicas to match or exceed the primary instance's specifications. A smaller instance class or different storage type can cause lag. The replica must process the same write workload as the primary and also process read queries. Modify your read replica instance, if needed.

Review primary instance write load and replica read load

Write operations on the primary instance create many WAL files. To identify write pressure, monitor the following CloudWatch metrics and Enhanced Monitoring values:

  • TransactionLogsDiskUsage
  • TransactionLogsGeneration
  • WriteIOPS
  • WriteThroughput
  • WriteLatency

Check for throughput bottlenecks for your DB instance class type. To avoid issues, distribute write activities across multiple transactions. You can configure CloudWatch alarms for WriteLatency and WriteIOPS to identify heavy writes on the source instance.

High read activity on replicas can slow WAL file replay. To analyze high workload and check for resource contention, use CloudWatch metrics or Enhanced Monitoring on the replica instance. If needed, distribute read traffic across multiple read replicas.

Monitor table locks

RDS for PostgreSQL processes an Access Exclusive lock when you run these commands on the primary instance: DROP TABLE, TRUNCATE, REINDEX, VACUUM FULL, and REFRESH MATERIALIZED VIEW without CONCURRENTLY. For more information, see Explicit locking on the PostgreSQL website.

The Access Exclusive lock prevents access to the table from other transactions for the lock's hold duration. The table remains locked until the transaction ends. WAL records the lock activity, and the read replica replays and holds the activity. The longer the table remains under an Access Exclusive lock, the longer the replication lag.

To prevent this issue, it's a best practice to periodically query the pg_locks and pg_stat_activity catalog tables.

To monitor locks, run the following command:

SELECT pid, usename, pg_blocking_pids(pid) AS blocked_by, QUERY AS blocked_query FROM pg_stat_activity
WHERE cardinality(pg_blocking_pids(pid)) > 0;

The output shows information about blocked queries and their blocked processes.

Check RDS for PostgreSQL replication parameters

To prevent replication issues, review the parameters that control RDS for PostgreSQL replication. Update the parameter values, if needed.

To accommodate your total number of replication connections, you can configure max_replication_slots. The default value varies by instance class. This value must equal or exceed your total number of replicas.

The max_standby_streaming_delay and max_standby_archive_delay parameters on the replica instance can help complete long-running read queries. If read queries that run on the replica modify the source data, then these parameters pause WAL replay. If you set the value to -1, then the WAL replay waits until the read query completes. However, this pause can increase replication lag indefinitely and cause high storage consumption at the source because of WAL accumulation.

For replication stability, RDS for PostgreSQL automatically manages several parameters: max_connections, max_worker_processes, max_wal_senders, max_prepared_transactions, and max_locks_per_transaction. On your replica, RDS for PostgreSQL sets these parameters to match or exceed the primary instance values.

The hot_standby_feedback parameter allows replicas to report query conflicts to the primary instance. RDS for PostgreSQL turns off this parameter by default. If you activate the parameter, then tables on the primary instance might experience bloat. If you receive the following error message, then use the hot_standby_feedback parameter:

"ERROR: canceling statement due to conflict with recover. Detail: User query might have needed to see row versions that must be removed"

Review best practices for RDS for PostgreSQL replication and cross-Region read replicas

You might also experience replication lag for the following reasons:

For information about these scenarios, see Best practices for Amazon RDS PostgreSQL replication and Best practices for Amazon RDS for PostgreSQL cross-Region read replicas.

Monitor transactions

To check active transactions on the primary instance that might affect replication, run the following command:

SELECT datname, pid, usename, client_addr, backend_start, xact_start, current_timestamp - xact_start AS xact_runtime, state, backend_xmin FROM pg_stat_activity WHERE state='active' OR state='idle in transaction';

The output shows currently running transactions and their duration. Long-running transactions can affect replication performance.

To stop a problematic query, run the following command:

SELECT pg_terminate_backend(PID);

Note: Replace PID with the process ID number of the query that you want to stop.

Related information

Replication on the PostgreSQL website

Using logical replication to replicate managed Amazon RDS for PostgreSQL and Amazon Aurora to self-managed PostgreSQL