How do I troubleshoot Xid errors on my NVIDIA GPU-accelerated EC2 Linux instance?

6 minute read
4

When I run my application on a NVIDIA GPU-accelerated Amazon Elastic Compute Cloud (Amazon EC2) Linux instance, my application crashes. Also, I receive GPU-specific Xid errors in the system log. I want to retrieve diagnostic information from the GPU and troubleshoot GPU-related Xid errors.

Resolution

Note: The following resolution troubleshoots G4, G5, and G6 instance types. GPUs are passed through to guest instances for all GPU-accelerated EC2 instance families.

Retrieve your nvidia-smi diagnostics

Use the nvidia-smi tool to retrieve statistics and diagnostics about the health and performance of the NVIDIA GPUs that are attached to your instance. The NVIDIA GPU driver automatically provides the tool and includes all the AWS Deep Learning Amazon Machine Image (DLAMI) options. For information about how to install the NVIDIA GPU driver for any GPU instance family, see Installation options.

To query statistics, run the sudo nvidia-smi -q command.

Memory statistics example:

ECC Errors
        Volatile                              # Errors counted since last GPU driver reload 
            SRAM Correctable            : 0
            SRAM Uncorrectable          : 0
            DRAM Correctable            : 0
            DRAM Uncorrectable          : 0
        Aggregate                             # Errors counted for the life of the GPU
            SRAM Correctable            : 0
            SRAM Uncorrectable          : 0
            DRAM Correctable            : 0
            DRAM Uncorrectable          : 0

All generations of NVIDIA GPUs record aggregate and volatile memory statistics of the GPU hardware. Aggregate ECC error counters persist for the life of the GPU. Because the positive value might be from a past issue, you must also check the volatile metrics. Volatile ECC errors are incremented from zero from the last time that the GPU driver reloaded.

ECC errors that you don't correct increase during the life of the instance. However, you can correct ECC errors. To reset their counter, reboot the instance or reset the GPU. Depending on the instance type and GPU generation, a reboot initiates page retirement or row remapping for bad memory pages.

P3, P3dn, G4dn instances:

    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending Page Blacklist      : No

Early generations of NVIDIA GPUs use dynamic page retirement. You can ignore single-bit errors because they typically don't cause issues.

If the GPU firmware identifies double-bit errors, then the GPU stops processing and causes the application to abruptly exit. If double-bit errors occur, then the operating system (OS) logs an Xid error and the Pending Page Blacklist status is Yes. To resolve these errors, reboot the instance to retire the bad memory location. After the reboot, the Pending Page Blacklist status resets to No.

Note: The error counters persist for the life of the GPU. A non-zero counter at instance launch doesn't mean that there's an active hardware issue or faulty GPU.

P4d, P4de, G5, G5g, and G6 instances:

    Remapped Rows        Correctable Error                 : 0 # Can safely ignore.
        Uncorrectable Error               : 0 # If > 0, review system logs for Xid errors
        Pending                           : No # If Yes, an instance reboot or GPU reset is required.
        Remapping Failure Occurred        : No # Should always be No. If Yes, please stop/start the instance.

Later instance families with A100 and A10G GPUs isolate and contain memory errors with row remapping that prevents reuse of known degraded memory locations. Row remapping replaces the page retirement scheme in earlier generation GPUs.

You can ignore correctable memory errors. Uncorrectable errors might cause errors or abrupt application exits and are logged to the OS system log as Xid errors.

When an uncorrectable error activates pending remapped rows, you must reset the GPU to retire the bad memory location. Reboot the instance to reset the GPU. Or, run the following command to manually reset the GPU:

sudo nvidia-smi -i GPU_UUID -r

Note: Replace GPU_UUID with your GPU ID.

If a remapping failure occurs, then stop and start the instance to migrate the instance to a new underlying host with a healthy GPU.

Note: AWS performs regular diagnostics to detect and automatically replace unhealthy GPUs.

Resolve failure modes

The GPU driver for all generations of NVIDIA GPUs writes errors to the OS system logs as Xid errors. For more information about these errors, see Xid errors on the NVIDIA website.

Incorrect number of GPUs or GPUs are missing

To view all attached GPUs, run the following command:

nvidia-smi --list-gpus | wc -l

In the command's output, check that the number of attached GPUs matches the expected number of GPUs for your instance type. If a GPU is missing, then stop and start the instance.

You can also use the preceding troubleshooting steps to resolve the following example ECC errors:

  • "Xid 48: A DBE has occurred"
  • "Xid 63: A page has successfully been retired"
  • "Xid 64: A page has failed retirement due to an error"

NVRM: Xid 79 (PCI:0000:00:00): GPU has fallen off the bus

The Xid 79 error occurs when the instance loses communication with the underlying GPU. To resolve this issue, reboot the instance. If the issue persists after reboot, then stop and start your instance.

WARNING: infoROM is corrupted at gpu 0000:00:00.0

The infoROM is corrupted error occurs when a part of the GPU firmware is corrupted. To resolve this issue, reboot the instance or reset the GPU. If the issue persists after reboot, then stop and start your instance.

NVRM: Xid 119 PCI:0000:00:00): Timeout waiting for RPC from GSP

-or-

NVRM: Xid 120 PCI:0000:00:00): GSP Error: Task 1 raised error code

The preceding errors occur when you activate GPU Systems Processor (GSP). To resolve this issue, deactivate the GSP from within the GPU driver or kernel module. For instructions on how to deactivate GSP, see 4.2.6. Disabling GSP firmware on the NVIDIA website.

Avoid future Xid errors

When possible, use the latest driver and CUDA runtime. GPU driver releases frequently introduce fixes, improvements, and optimizations. However, the updates might also contain functional changes. Stage and test driver updates on non-production GPU instances first.

GPUs have a core and memory clock speed that dynamically changes depending on load. To improve performance, persistently set the GPU core and memory clock speeds to their maximum speeds.

Deactivate GSP. On recent instance generations, NVIDIA GPUs include the GSP firmware feature. For instructions on how to deactivate GSP, see 4.2.6. Disabling GSP firmware on the NVIDIA website.

Also, use the Amazon CloudWatch agent to monitor your GPU metrics.

If you complete the preceding troubleshooting steps and still encounter Xid errors, then open an AWS Support case. Provide your instance ID and the output of the nvidia-smi -q command. Also, run the sudo nvidia-bug-report.sh command that's included with the NVIDIA GPU driver. The nvidia-bug-report.sh script captures key logs and other diagnostic information in your current working directory. Attach the nvidia-bug-report.log.gz compressed log file to your support case.

AWS OFFICIAL
AWS OFFICIALUpdated 3 months ago