How do I troubleshoot Xid errors on my NVIDIA GPU-accelerated EC2 Linux instance?

7 minute read

When running my application on an NVIDIA GPU-accelerated Amazon Elastic Compute Cloud (Amazon EC2) Linux instance, my application crashed and I found GPU-specific Xid errors in the system log. I want to retrieve diagnostic information from the GPU and troubleshoot GPU-related Xid errors.

Short description

AWS offers multiple EC2 instance families with GPU acceleration. GPUs are passed through to guest instances for all GPU-accelerated EC2 instance families. This lets you use the full capabilities of the GPU hardware.


Reading and interpreting nvidia-smi diagnostics

Use the nvidia-smi tool to retrieve statistics and diagnostics about the health and performance of the NVIDIA GPUs that are attached to your instance. The NVIDIA GPU driver automatically provides this tool, including any variant of the Deep Learning Amazon Machine Image (AMI). For details on installing the NVIDIA GPU driver for any GPU instance family, see Install NVIDIA drivers on Linux instances.

Run the sudo nvidia-smi -q command to query statistics.

Memory statistics example

ECC Errors
        Volatile                              # Errors counted since last GPU driver reload 
            SRAM Correctable            : 0
            SRAM Uncorrectable          : 0
            DRAM Correctable            : 0
            DRAM Uncorrectable          : 0
        Aggregate                             # Errors counted for the life of the GPU
            SRAM Correctable            : 0
            SRAM Uncorrectable          : 0
            DRAM Correctable            : 0
            DRAM Uncorrectable          : 0

All generations of NVIDIA GPUs record aggregate and volatile memory statistics of the GPU hardware. Note that aggregate ECC error counters persist for the life of the GPU. A positive value doesn't indicate that the instance is encountering a hardware issue or faulty GPU. The positive value might be from the past, so it's important to review volatile metrics.

However, volatile ECC errors are incremented from zero, starting from the last time that the GPU driver reloaded. Uncorrectable ECC errors are incremented during the life of the instance. If the ECC errors are volatile, then you might need to reboot the instance or reset the GPU. Depending on the instance type and GPU generation, rebooting initiates either page retirement or row remapping for bad memory pages.

P3, P3dn, G4dn instances

    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending Page Blacklist      : No

Early generations of NVIDIA GPUs use dynamic page retirement. You can ignore single-bit errors because they are generally benign. GPU firmware identifies double-bit errors.

If the GPU firmware identifies double-bit errors, then the GPU stops processing and causes the application to abruptly exit. If double-bit errors occur, then a Xid error is recorded in the operating system (OS) log and the Pending Page Blacklist status is Yes. To resolve these errors, reboot the instance to retire the bad memory location. After rebooting, the Pending Page Blacklist status resets to No.

Note: The error counters persist for the life of the GPU. So, a non-zero counter at instance launch isn't indicative of an active hardware issue or faulty GPU.

P4d, P4de, G5, and G5g instances

    Remapped Rows
        Correctable Error                 : 0 # Can safely ignore.
        Uncorrectable Error               : 0 # If > 0, review system logs for Xid errors
        Pending                           : No # If Yes, an instance reboot or GPU reset is required.
        Remapping Failure Occurred        : No # Should always be No. If Yes, please stop/start the instance.

Later instance families with A100 and A10G GPUs isolate and contain memory errors by row remapping. Similar to dynamic page retirement, row remapping prevents reuse of known degraded memory locations. Row remapping replaces the page retirement scheme in earlier generation GPUs.

You can ignore correctable memory errors. Uncorrectable errors might cause errors or abrupt exits of the application. Uncorrectable errors are logged to the OS system log as Xid errors.

Pending remapped rows that are activated after an uncorrectable error require a GPU reset to retire the bad memory location. Reboot the instance to reset the GPU. Or, run the following command to manually reset the GPU:

sudo nvidia-smi -i <GPU UUID> -r

If a remapping failure occurs, then stop and start the instance. Stopping and starting the instance migrates the instance to a new underlying host with a healthy GPU.

Detection of unhealthy GPUs

AWS uses automation to regularly perform diagnostics and detect unhealthy GPUs. Any GPUs that are in an unhealthy state because of hardware errors are eventually identified and automatically replaced.

Failure modes

The GPU driver for all generations of NVIDIA GPUs writes errors to the OS system logs as Xid errors. For categorization and descriptions of these errors, see Xid Errors on the NVIDIA website.

The following list of common Xid errors includes best practices to resolve the issues:

Incorrect number of GPUs, or GPUs are missing

Run the following command:

nvidia-smi —list-gpus | wc -l

In the command output, make sure that the number of attached GPUs matches the expected number of GPUs for your instance type. If a GPU is missing, then stop and start the instance.

Xid 48: A DBE has occurred.
Xid 63: A page has successfully been retired.
Xid 64: A page has failed retirement due to an error.

The preceding errors indicate that an ECC error occurred. To resolve this issue, complete the steps in the Incorrect number of GPUs, or GPUs are missing section.

NVRM: Xid 79 (PCI:0000:00:00): GPU has fallen off the bus.

The Xid 79 error indicates that the instance lost communication with the underlying GPU. To resolve this issue, reboot the instance. If the issue persists after reboot, then stop and start your instance.

WARNING: infoROM is corrupted at gpu 0000:00:00.0

The infoROM is corrupted error indicates that a portion of the GPU firmware is corrupted. To resolve this issue, reboot the instance or reset the GPU. If the issue persists after reboot, then stop and start your instance.

NVRM: Xid 119 PCI:0000:00:00): Timeout waiting for RPC from GSP!
NVRM: Xid 120 PCI:0000:00:00): GSP Error: Task 1 raised error code ...

The preceding errors occur when GPU Systems Processor (GSP) is turned on. Verify that GSP is turned off from within the GPU driver or kernel module.

Best practices

  • When possible, use the latest driver and CUDA runtime. Fixes, improvements, and optimizations are frequently introduced with newer releases of the GPU driver. However, these updates might contain functional changes. Stage and test driver updates on non-production GPU instances first.
  • Similar to common x86 CPUs with turbo boost, GPUs have a core and memory clock speed that dynamically changes depending on load. For the best performance, persistently set the GPU core and memory clock speeds to their maximum speeds. For more information, see Optimize GPU settings.
  • Turn off GSP. On recent instance generations, NVIDIA GPUs include the GSP firmware feature. GSP is designed to offload GPU initialization and other management tasks. For more information, see Turning off GSP firmware on the NVIDIA website.
  • Use the Amazon CloudWatch agent to monitor your GPUs. The CloudWatch agent natively supports NVIDIA GPU metrics that you can collect and monitor from CloudWatch. For more information, see Collect NVIDIA GPU metrics.

Contact AWS Support

Provide your instance ID and the output of the nvidia-smi -q command in your support case.

Also, run the sudo command that's included with the NVIDIA GPU driver. The script captures key logs and other diagnostic information. The script creates a compressed log file named nvidia-bug-report.log.gz in your current working directory that you can retrieve and provide to AWS Support.

AWS OFFICIALUpdated a year ago