Monitoring and troubleshooting query performance using system tables on Amazon Redshift provisioned clusters

9 minute read
Content level: Advanced
4

This article provides guidance on monitoring queries, tracking query progress and status, and troubleshooting performance issues related to the cluster or specific queries, utilizing the system tables available in Amazon Redshift provisioned clusters.

Introduction:

This article introduces key system tables that enable monitoring running queries on your Amazon Redshift cluster, checking query status, gathering execution metrics, performing troubleshooting, and reviewing recommendations to improve cluster and query performance. By leveraging these system tables, you can gain insights into query execution behavior, cluster workload, resource utilization, analyze execution plans, and identify potential bottlenecks.

Queries to use for troubleshooting when the issue is live:

  1. To check the active running queries on the cluster:
SELECT * FROM stv_inflight ORDER BY starttime;

Check the starttime of the query to understand how long the query has been running on the Redshift cluster.

  1. To check the queries running on the cluster but waiting for a table lock or WLM slot:
SELECT * FROM stv_recents WHERE status <> 'Done' AND pid NOT IN (SELECT pid FROM stv_inflight) ORDER BY duration DESC;

If there are any queries reported in stv_recents using the above query and not appeared in stv_inflight, then the jobs are either waiting to acquire the lock on the target table(s) involved in the query or waiting for a WLM slot. Check the starttime or duration of the query to understand how long the query has been waiting for table lock or WLM slot. If you suspect queries are running long or hung and not progressing, then run the locking query below:

  1. Run the locking query to identify the sessions that are holding locks and blocking other sessions that are waiting for a lock :
select a.txn_owner, a.txn_db, a.xid, a.pid, a.txn_start, a.lock_mode, a.relation as table_id,nvl(trim(c."name"),d.relname) as tablename, a.granted,b.pid as blocking_pid ,datediff(s,a.txn_start,getdate())/86400||' days '||datediff(s,a.txn_start,getdate())%86400/3600||' hrs '||datediff(s,a.txn_start,getdate())%3600/60||' mins '||datediff(s,a.txn_start,getdate())%60||' secs' as txn_duration
from svv_transactions a 
left join (select pid,relation,granted from pg_locks group by 1,2,3) b 
on a.relation=b.relation and a.granted='f' and b.granted='t' 
left join (select * from stv_tbl_perm where slice=0) c 
on a.relation=c.id 
left join pg_class d on a.relation=d.oid
where  a.relation is not null;

If you see granted=false and any blocking_pid that is holding a lock for a longer time, check if that transaction/query is active using STV_INFLIGHT or using the SVL_STATEMENTTEXT to see if that transaction is left over open by not closing the transaction properly(missing COMMIT or END at the end of transaction to close and release lock). To check, pass the source xid of the blocking pid in the query below:

SELECT * FROM stv_inflight WHERE xid = <blocking_xid>;
SELECT * FROM svl_statementtext WHERE xid = <blocking_xid> AND sequence = 0 ORDER BY xid, sequence, starttime;

If a transaction in STV_INFLIGHT has no active statement running and a long time has elapsed since the last statement submission as per svl_statementtext, then it may be an open transaction left over. Check why the transaction was not committed/closed properly from the client tool(missing COMMIT or END). If needed, abort the PID using the provided query to kill the transaction and release the lock for other waiting transactions to acquire the table lock.

SELECT pg_terminate_backend(<blocking_pid>);
  1. Check the status of the query in the WLM phase:
SELECT * FROM stv_wlm_query_state WHERE query = <query_id>;
  1. Count the number of queries with a state for each queue to understand how many queries are running and queued on cluster:
SELECT service_class, state, COUNT(query) FROM stv_wlm_query_state WHERE service_class >= 6 GROUP BY 1, 2 ORDER BY 1, 2;
  1. To check if the query is progressing or not, you can monitor the variation in the count of rows by running this query in 1-minute or 5-minute intervals. If the row count is not changing, then the query might be hung:
SELECT * FROM STV_EXEC_STATE WHERE query = <query_id> ORDER BY segment, step, slice;
--(or)
SELECT SUM(rows) FROM STV_EXEC_STATE WHERE query = <query_id>;

For long-running queries in STV_INFLIGHT, assess their criticality and potential impact. If non-critical, consider aborting them to free WLM slots for high-priority workloads, as one long-running query can cause resource contention and degrade system performance. Implement WLM query monitoring rules to control such queries and prevent impact on other cluster workloads. Review the factors affecting query performance and fine tune queries before rerunning, as there could be several reasons for prolonged execution times on Redshift clusters.

Queries to use for troubleshooting for the query that has either completed or aborted:

  1. To check the status of the query completion and its duration:
SELECT * FROM STL_QUERY WHERE query=<query_id>;
  1. To check the time taken and the rows processed at each step and slice level of the query execution:
SELECT * FROM SVL_QUERY_REPORT WHERE query = <query_id> ORDER BY segment, step, slice;
  1. To check the time taken and the rows processed at each step of the query execution:
SELECT * FROM SVL_QUERY_SUMMARY WHERE query = <query_id> ORDER BY stm, seg, step;
  1. To check the overall query metrics summary of the query execution:
SELECT * FROM svl_query_metrics_summary WHERE query = <query_id>;
  1. To check the complete SQL text of the query:
SELECT * FROM SVL_STATEMENTTEXT WHERE xid = (SELECT xid FROM stl_query WHERE query = <query_id>) ORDER BY xid, sequence, starttime;
--(or)
SELECT * FROM STL_QUERYTEXT WHERE query=<query_id> order by query, sequence;
  1. To check the query execution details of a particular query in the WLM phase(queue start, execution start and end etc):
SELECT * FROM stl_wlm_query WHERE query = <query_id>; 
  1. To check how much time Query has spent in compile phase, usually you will see this increase for first runs after patch on cluster.
SELECT * FROM svl_compile where query=<query_id> order by query, segment;
  1. To check the commit time of the transaction query:
SELECT node, DATEDIFF(s, startqueue, startwork) AS queue_time_s, DATEDIFF(s, startwork, endtime) AS commit_time_s, queuelen FROM stl_commit_stats WHERE xid = <xid> ORDER BY node;
  1. To check the details of the tables involved in a query.
SELECT * FROM SVV_TABLE_INFO WHERE "table" IN (SELECT DISTINCT tablename FROM (SELECT TRIM(perm_table_name) AS tablename FROM stl_scan WHERE query = <query_id>));
  1. To identify opportunities to improve query performance by reviewing the alerts logged:
SELECT * FROM stl_alert_event_log WHERE query = <query_id> ORDER BY query, segment, step, slice;
  1. To check if any automated actions taken by Amazon Redshift on tables defined for automatic optimization.
SELECT * FROM SVL_AUTO_WORKER_ACTION where table_id in (select distinct tbl from stl_scan where query=<query_id>);

The above queries will help you check the status of the query and execution metrics and understand whether the query is progressing or hung on your Amazon Redshift cluster. Always compare the performance between warm runs as first run will be slower after cluster reboot or Patch.

Compare the performance of the slow run with fast run of the same query:

To debug at the query level, obtain the query ID of the faster run and compare it with the slow version's query ID of the same query using the Redshift system tables mentioned in Section 2 (for completed or aborted queries). This helps troubleshoot differences and deviations in query text, execution plan, data volume processed, etc., between the faster and slower runs.

For Example,

  1. SVL_QUERY_METRICS_SUMMARY provides complete summary about the query execution metrics for each query.
SELECT * FROM svl_query_metrics_summary WHERE query in (<slow_run_query_id>,<fast_run_query_id>) ORDER BY query;
  1. SVL_QUERY_SUMMARY provides execution details at step level for each query
select * from SVL_QUERY_SUMMARY where query in (<slow_run_query_id>,<fast_run_query_id>) order by query, stm, seg, step;
  1. STL_EXPLAIN provides the plan generated for both queries:
select query,nodeid,parentid,plannode from stl_explain where query in (<slow_run_query_id>,<fast_run_query_id>) order by query,nodeid;
  1. To get list of tables with missing statistics on Redshift cluster with reported count.
SELECT REGEXP_REPLACE(TRIM(plannode), '.*Tables missing statistics: \(([^)]+)\) -----.*', '\\1') AS table_name, COUNT(*) AS count
FROM stl_explain WHERE plannode LIKE '%missing statistics%'
GROUP BY table_name ORDER BY count DESC;

To maintain data tables hygiene and improve performance, follow these best practices:

  • Periodically run VACUUM on tables with high vacuum_sort_benefit (>20) to enhance query performance.
    • select * from svv_table_info where unsorted > 20 order by vacuum_sort_benefit desc limit 20;
  • Periodically run ANALYZE on tables with high unsorted values (>20) to update statistics and generate optimal plans.
    • select * from svv_table_info where stats_off > 20 order by size,stats_off desc limit 20;
  • Review tables with large VARCHAR columns, inefficient sort keys, or encrypted columns as the first sort key column using SVV_TABLE_INFO, as these can impact query performance.
  • Regularly review alerts and recommendations from the Redshift Advisor and implement them to optimize cluster performance.

Here are some data points to review for any changes in the cluster, query, or workload that could be causing long-running queries on Redshift:

  • Check for unusual spikes in DatabaseConnections, CPUUtilization, PercentageDiskSpaceUsed, CommitQueueLength, QueryPlanning and QueryWaiting using Cluster Metrics or CloudWatch Metrics for your Redshift cluster.
  • Compare query text, explain plans, and execution metrics between faster and slower instances of the same query.
  • Determine if it's a new query needing optimization or a previously well-performing query that degraded.
  • Check if recent patches, updates, or version changes on the cluster could have impacted performance, and identify which environments (prod, QA, dev) and Redshift versions are affected.
  • Note the approximate date, time, and timezone when performance issues started, correlating with other changes in the cluster, such as recent WLM settings or Concurrency Scaling configurations.

If performance issues on your Redshift cluster or specific queries persist after reviewing the scenarios mentioned above, contact AWS Support and provide the following details to assist in further investigation:

  • Redshift cluster ID or ARN
  • Query IDs for both slow and fast runs of the same query
  • Query text (if required)
  • Explain plan (if applicable for both queries)

You can also use SYS Monitoring views to check query status and troubleshoot query performance issues on both Redshift provisioned clusters and Redshift Serverless. Refer to Redshift Serverless Monitoring or STL to SYS Mapping to get respective SYS views and mapping, note that while the query ID will differ between SYS Monitoring views and STV/SVL tables, the xid and pid/session_id will remain consistent. You can also refer Redshift Serverless monitoring

Additionally, review cluster performance, and best practices, diagnostic queries and troubleshooting common issues to identify and resolve performance bottlenecks in your Redshift cluster.

profile pictureAWS
SUPPORT ENGINEER
published 2 months ago552 views