How do I troubleshoot high CPU utilization for Amazon RDS or Amazon Aurora PostgreSQL?

8 minute read
1

I want to identify and resolve what causes high CPU usage in my Amazon Relational Database Service (Amazon RDS) or Amazon Aurora PostgreSQL-Compatible Edition instance.

Short description

To determine what's causing high CPU usage, use the following tools:

Resolution

Review CloudWatch metrics

Use CloudWatch metrics to identify CPU patterns over extended periods. Compare the WriteIOPs, ReadIOPs, ReadThroughput, and WriteThroughput graphs with the CPU utilization to find the times when the workload causes high CPU.

After you identify the time frame, review the Enhanced Monitoring data that's associated with your DB instance. You can set Enhanced Monitoring to collection intervals of 1, 5, 10, 15, 30, or 60 seconds to collect data at a more granular level.

Use Enhanced Monitoring

Enhanced Monitoring provides a view at the operating system (OS) level. For example, you can review the workload average, CPU distribution (System% or Nice%), and OS process list. For more information, see OS monitoring.

You can check the loadAverageMinute data in intervals of 1, 5, and 15 minutes. When the load average is greater than the number of vCPUs, the instance is experiencing a heavy load. If the load average is less than the number of vCPUs for the DB instance class, then CPU throttling might not cause the application latency. When you're troubleshooting the cause of high CPU usage, check the load average to avoid false positives.

For example, you have a DB instance that uses a db.m5.2xlarge instance class with 3000 Provisioned IOPS that reaches the CPU quota. Eight vCPUs are associated with the instance class. If the same load average exceeds 170, then the machine is experiencing heavy load during the measured timeframe.

Load Average Minute:

  • Fifteen: 170.25
  • Five: 391.31
  • One: 596.74

CPU Utilization:

  • User (%): 0.71
  • System (%): 4.9
  • Nice (%): 93.92
  • Total (%): 99.97

Note: In Enhanced Monitoring, Nice% is the amount of CPU that your workload uses against the database.

After you turn on Enhanced Monitoring, you can also check the OS process list that's associated with the DB instance. Enhanced monitoring helps you identify a maximum of 100 processes that are affecting performance. You can use Enhanced Monitoring results with pg_stat_activity results to help identify the resource usage of queries.

Use Performance Insights

Use Amazon RDS Performance Insights to identify the query that's responsible for the database load. Check the SQL tab that corresponds to a specific time frame.

Check Native PostgreSQL views and catalogs

At the database-engine level, you can use pg_stat_activity and pg_stat_statements. If the issue occurs in real time, then use pg_stat_activity or pg_stat_statements to group the machines, clients, and IP addresses that send the most traffic.

Use the data to check increases over time or increases in application servers. You can also check whether an application server has stuck sessions or locking issues. For more information, see pg_stat_activity and pg_stat_statements on the PostgreSQL website.

To turn on pg_stat_statements, complete the following steps:

  1. Modify the existing custom DB parameter group.

  2. Add pg_stat_statements to shared_preload_libraries.

  3. Set track_activity_query_size to 4096.

  4. Set pg_stat_statements.track to ALL.

  5. Set pg_stat_statements.max to 10000.

  6. Choose Apply Immediately, and then reboot the DB instance.

  7. On the database that you want to monitor, run the following command:

    demo=> select current_database();current_database------------------  
    demo  
    (1 row)  
      
    demo=> CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

    Note: The preceding command installs the extension in the demo database.

After you set up pg_stat_statements, use one of the following methods to monitor the output. You can view the query that spends the most time in the database, has less buffer cache hit ratio, or is on a per-execution basis.

To view which queries spend the most time in the database, run the following query for your PostgreSQL version.

PostgreSQL versions 12 and earlier:

SELECT total_time, queryFROM pg_stat_statements  
ORDER BY total_time DESC LIMIT 10;

PostgreSQL versions 13 and later:

SELECT total_plan_time+total_exec_time as total_time, queryFROM pg_stat_statements  
ORDER BY 1 DESC LIMIT 10;

To list queries with less buffer cache hit ratio, run the following query for your PostgreSQL version.

PostgreSQL versions 12 and earlier:

SELECT query, calls, total_time, rows, 100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percentFROM pg_stat_statements   
ORDER BY total_time DESC LIMIT 10;

PostgreSQL versions 13 and later:

SELECT query, calls, total_plan_time+total_exec_time as total_time, rows, 100.0 * shared_blks_hit / nullif(shared_blks_hit +shared_blks_read, 0) AS hit_percentFROM pg_stat_statements   
ORDER BY 3 DESC LIMIT 10;

To list queries on a per-execution basis to sample queries over time, run the following query for your PostgreSQL version.

PostgreSQL versions 12 and earlier:

SELECT query, calls, total_time/calls as avg_time_ms, rows/calls as avg_rows,temp_blks_read/calls as avg_tmp_read, temp_blks_written/calls as avg_temp_writtenFROM pg_stat_statements  
WHERE calls != 0  
ORDER BY total_time DESC LIMIT 10;

PostgreSQL versions 13 and later:

SELECT query,calls,  
(total_plan_time+total_exec_time as total_time)/calls as avg_time_ms,   
 rows/calls as avg_rows,  
temp_blks_read/calls as avg_tmp_read,  
 temp_blks_written/calls as avg_temp_written  
FROM pg_stat_statements  
WHERE calls != 0  
ORDER BY 3 DESC LIMIT 10;

Check for idle connections in the database

Idle connections in the database might use compute resources, such as memory and CPU. When your instance has high CPU utilization, check for idle connections on the database. For more information, see Performance impact of idle PostgreSQL connections.

You can use Enhanced Monitoring to review the OS process list for idle connections. However, the list shows only a maximum of 100 processes. To check for idle connections at the database level, run the following queries.

View current sessions that are idle and active:

SELECT pid, datname, state, current_timestamp-least(query_start,xact_start) age, application_name, usename, queryFROM pg_stat_activityWHERE query != '<IDLE>  
'AND query NOT ILIKE '%pg_stat_activity%'  
AND usename!='rdsadmin'  
ORDER BY query_start desc;  
SELECT application_name,pid,wait_event_type,wait_event,current_timestamp-least(query_start,xact_start) AS runtime, query AS current_query  
FROM pg_stat_activity  
WHERE not pid=pg_backend_pid()  
AND query NOT ILIKE '%pg_stat_activity%'  
AND usename!='rdsadmin';

Get the connection counts for each user and application name:

postgres=> SELECT application_name,count(*) FROM pg_stat_activity GROUP BY application_name;    application_name    | count   
------------------------+-------  
 psql                   |     1  
 PostgreSQL JDBC Driver |     1   
                        |     5  
(3 rows)  
postgres=> SELECT usename,count(*) FROM pg_stat_activity GROUP BY usename;  
 usename  | count  
----------+-------  
 master   |     4   
 user1    |     1  
 rdsadmin |     2  
(3 rows)

After you identify the idle connections, run one of the following queries to end the connections:

psql=> SELECT pg_terminate_backend(pid)   FROM pg_stat_activity  
   WHERE usename = 'example-username'  
   AND pid <> pg_backend_pid()  
   AND state in ('idle');

-or-

SELECT pg_terminate_backend (example-pid);

If your application causes too many connections, then modify the application so that memory and CPU resources aren't managing these connections. You can either limit the number of connections, or use a connection pooler such as PgBouncer. You can also use Amazon RDS Proxy to set up connection pools.

Run the ANALYZE command

The ANALYZE command collects statistics about the contents of tables in the database and stores the results in the pg_statistic system catalog. Then, the query planner uses the statistics to help determine the most efficient execution plans for queries. When you don't frequently run ANALYZE on tables in your database, the queries might use more compute resources, because of stale statistics in the system.

Stale statistics occur for the following reasons:

  • Autovacuum isn't running frequently.
  • You didn't run the ANALYZE operation after the major version upgrade.

Autovacuum checks for bloated tables in the database and reclaims the space for reuse. To make sure that table statistics are regularly updated, the autovacuum daemon runs the ANALYZE command when the set threshold of tuples is dead. 

For more information, see the following resources:

To know when autovacuum and autoanalyze last ran on the tables, run the following query:

SELECT relname, last_autovacuum, last_autoanalyze FROM pg_stat_user_tables;

To prevent performance issues after a major engine version upgrade, run the ANALYZE command to refresh the pg_statistic table. Run the ANALYZE command for every database in your RDS for PostgreSQL DB instance.

To avoid performance issues because of higher resource utilization, regenerate all statistics. To generate statistics for all regular tables in the current database after a major version upgrade, run the following command without parameters:

ANALYZE VERBOSE

Check PostgreSQL error logs

Use Amazon RDS to turn on query logging for PostgreSQL. Then, check the PostgreSQL error logs to confirm that you set your log_min_duration_statement and log_statement parameters to appropriate values. For more information, see Error reporting and logging on the PostgreSQL website.

Reduce CPU usage

After you identify the queries that cause the high CPU, use the following methods to further reduce CPU usage:

  • Use EXPLAIN and EXPLAIN ANALYZE to identify ways to tune query plans. For more information, see Using EXPLAIN on the PostgreSQL website.
  • If there's a query that's repeatedly running, then use prepared statements to lower the pressure on your CPU. Prepared statements that run repeatedly will cache the query plan. When the plan is already in cache for further runs, there's less time to plan the query.

Related information

Best practices for working with PostgreSQL

AWS OFFICIAL
AWS OFFICIALUpdated 3 days ago