Why is my RDS for SQL Server instance stuck in the rebooting state?

4 minute read
0

I want to know why it's taking a long time to reboot my Amazon Relational Database Service (Amazon RDS) for Microsoft SQL Server instance.

Short description

If there's a large number of ongoing transactions on the instance, then your DB instance might take a long time to reboot. To resolve this issue, reduce the number of ongoing queries on the DB instance.

The reboot might also appear to be stuck when there's a high number of virtual log files (VLFs) on the instance that slow down the reboot. To resolve this issue, reduce the number of VLFs on the DB instance.

Resolution

Reduce the number of ongoing queries on the DB instance

When you're rebooting your DB instance, all ongoing transactions stop and SQL Server runs the recovery. SQL Server rolls the transactions forward and back so that the database is in a consistent state.

To find the amount of time that the recovery process takes, review the SQL Server error log. The log entry includes the amount of time that's used in each recovery phase.

In the following log entry example, X is the amount of time that SQL Server takes in each phase and completed recovery:

Recovery completed for database DB_NAME (database ID id of database) in X second(s) (analysis X ms, redo X ms, undo X ms [system undo X ms, regular undo X ms].) This is an informational message only. No user action is required.

To resolve this issue, reduce the number of ongoing queries.

To check for active data modification transactions on the database, run the following command:

SELECT r.session_id,       r.start_time,
       r.status,
       r.cpu_time,
       r.total_elapsed_time,
       st.TEXT AS batch_text
FROM sys.dm_exec_requests AS r
     CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) AS st
     CROSS APPLY sys.dm_exec_query_plan(r.plan_handle) AS qp
     where session_id <> @@SPID
ORDER BY cpu_time DESC;

The command output provides information about the sessions that ran, the start time, and SQL text. If queries are still running, then allow the queries to complete before you reboot the instance.

Reduce the number of VLFs on the DB instance

An excessive number of VLFs might accumulate from a small transaction log that manually or automatically grew in small increments.

Before the recovery phases start during reboot, SQL Server serially scans all the VLFs. If a large number of VLFs are found, then the following notification appears in the error log:

"Database ls has more than % virtual log files which is excessive. Too many virtual log files can cause long startup and backup times. Consider shrinking the log and using a different growth increment to reduce the number of virtual log files."

To reduce the number of VLFs, complete the following steps:

Note: It's a best practice to perform these steps in slow or non-business hours.

  1. Log in to RDS for SQL Server instance from Microsoft SQL Server Management Studio as the primary user.
  2. Run DBCC SQLPERF(LOGSPACE) query to check log file usage.
  3. Shrink the log file of the database that received the high VLF notification in the error log. For more information, see DBCC SHRINKFILE (Transact-SQL) on the Microsoft website.
  4. Expand the transaction log file to an appropriate usage size all at once to avoid automatic growth.

Note: When you shrink the log file, the number of VLFs reduces. A one-time expansion creates a limited number of VLFs that are required for the VLF creation criteria. For more information, see Virtual Log Files (VLFs) on the Microsoft website.

Related information

How can I troubleshoot storage consumption in my RDS for SQL Server DB instance?