Implementing health checks for large-scale AI/ML training
This article provides a systematic approach to diagnose and resolve performance issues in distributed large language model (LLM) training operations. This article also focuses on pre-flight checks and comprehensive monitoring solutions for multi-node GPU clusters.
Introduction
LLM training operations have become increasingly complex as enterprises scale their AI and machine learning (AI/ML) initiatives to meet growing business demands. Organizations face significant challenges in managing distributed training environments, including the following:
-
Hardware inconsistencies
-
Network bottlenecks
-
Resource optimization difficulties
-
Substantial costs associated with GPU cluster downtime
These challenges increase when you must train large-scale models that require coordinated multi-node operations across hundreds or thousands of GPUs. During these operations, even minor configuration issues can cascade into training failures, wasted compute resources, and delayed model deployment timelines.
Against this backdrop, a global customer with extensive AI/ML operations approached AWS Enterprise Support with performance challenges in their distributed LLM training environment. Their AI/ML team experienced inconsistent results and occasional failures across multi-node GPU clusters. These issues caused significant delays in model development and inefficient resource utilization. The unpredictable nature of these issues affected their ability to maximize their substantial infrastructure investments. To create reliable operations at scale, the organization needed a systematic approach to diagnose pre-training issues and implement effective monitoring for consistent performance across distributed training workloads.
Identifying opportunities for improvements
The AWS Enterprise Support team directly worked with the customer through their Technical Account Manager (TAM). AWS Support assessed the customer's training infrastructure and workflows, and identified several opportunities for improvement:
-
Hardware validation procedures
-
Elastic Fabric Adapter (EFA) configuration
-
Monitoring practices
To help improve these opportunities, AWS Support collaborated with the customer's AI/ML engineers to develop a solution. The solution implements three key components to support reliability in large-scale LLM training:
-
Pre-flight checks: The team created a systematic validation framework to verify hardware, network, and system readiness before the start of a training job.
-
Improved monitoring: The team created a real-time monitoring solution that tracks GPU health and performance during a training job.
-
Testing and troubleshooting: The team reviewed the solution to test and troubleshoot common issues.
Implementing the solution
Prerequisites:
-
Multi-node GPU cluster environment
-
NVIDIA GPU hardware with appropriate drivers
-
EFA networking configuration
-
Access to NVIDIA tools, such as nvidia-smi and DCGM
-
Amazon CloudWatch for monitoring integration
Configuring pre-flight checks
AWS Support helped the customer analyze their training environment to identify critical validation points that would prevent costly failures. The team developed a comprehensive checklist based on the customer's specific infrastructure and training workloads. This checklist made sure that the customer properly validates all hardware and network components before they initiate resource-intensive training jobs.
Pre-flight tests in LLMs represent a critical phase in the model deployment pipeline. These tests serve as a comprehensive validation system that safeguards both performance and reliability, and are essential for multiple reasons:
-
Tests can provide quality assurance and verify that the model performs according to expectations before you deploy a solution.
-
Tests implement crucial safety measures to prevent harmful or inappropriate outputs.
-
Tests confirm system stability and consistency.
Streamlining the pre-flight check implementation
AWS Support worked with the customer to develop a streamlined deployment process for the pre-flight checks. The team created a reusable script that they could easily integrate into the customer's existing instance launch workflow. This script helped the customer consistently validate their entire training fleet. Then, the team completed the following steps:
-
The team saved the Amazon Elastic Compute Cloud (Amazon EC2) initialization script to the standard preflight-checks.sh'. This turned on automatic execution scripts placed in the user data field during the instance's first boot.
-
The team used the AWS Management Console to launch Amazon EC2 instances with the script in the User Data field.
- The team used log files and notifications to monitor the script execution. After the launch completed, the team used the log file located in the /var/log/ directory to verify pre-flight check results.
The team also configured Slack webhooks and email notifications to receive automated check reports.
Review the customer requirements
To implement the pre-flight checks, the team reviewed the customer's resources and infrastructure requirements. They identified the following critical validation components:
Hardware validation
Purpose: Verify GPU availability and performance
-
Confirms GPU detection and specifications
-
Monitors temperature and power consumption
-
Validates GPU count and memory allocation
# Basic GPU status
nvidia-smi
# Check GPU temperature and power
nvidia-smi -q -d TEMPERATURE,POWER
For NVIDIA GPU monitoring and management capabilities, see System Management Interface SMI and Data Center GPU Manager on the NVIDIA website.
Network fabric testing
Purpose: Ensure high-speed interconnect functionality.
-
Tests EFA connectivity
-
Validates libfabric providers for distributed computing
-
Confirms network device availability
# Check available providers
fi_info
For technical resources and implementation examples, see EFA for AI/ML and HPC workloads on Amazon EC2 and Get started with EFA and MPI for HPC workloads on Amazon EC2.
Multi-GPU communication
Purpose: Verify GPU-to-GPU communication pathways.
-
Tests NCCL (NVIDIA Collective Communications Library) performance
-
Measures bandwidth and latency between GPUs
-
Validates collective operations for distributed training
# Set up
ALL_REDUCE=/usr/local/cuda/efa/test-/all_reduce_perf
$ nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# Check bandwidth and latency metrics using All_gather Perf
ALL_GATHER=/usr/local/cuda/efa/test-/all_Gather_perf
$ nccl-tests/build/all_gather_perf -b 8 -e 1G -f 2 -g 8
For comprehensive NCCL implementation and performance validation, see nccl-tests on the GitHub website.
Monitoring system resources
To confirm adequate system resources, the team turned on the following customer system resource monitoring checks:
-
Checks available memory and disk space
-
Monitors CPU utilization and I/O performance
# Check system memory
free -h
# Check disk space
df -h
# Check CPU usage
top or htop
# Check I/O status
iostat -x
Implementing the monitoring solution
To implement a monitoring solution that would improve the LLM training operations for the customer, the team had to determine which metrics to monitor. Based on the customer's uses, the team decided to monitor the following metrics:
-
GPU utilization rates
-
Memory consumption patterns
-
Temperature thresholds
-
Power usage metrics
-
PCIe bandwidth utilization
-
Inter-GPU communication efficiency
Implementing monitoring tools
To implement the monitoring tools, the team ran the following commands:
# Continuous monitoring
watch -n 1 nvidia-smi
dcgmi dmon
# Display specific metrics
nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,memory.used --format=csv
# Running diagnostic tests
dcgmi diag -r 3
dcgmi diag -r 3 -p diagnostic.test_duration=300.0
dcgmi diag -r 3 -p diagnostic.temperature_max=60.0
Integrating CloudWatch
To monitor the solution, the team used the following AWS resources to integrate with CloudWatch:
-
CloudWatch Agent Configuration: Collect custom GPU metrics from Linux instances. See Collect metrics, logs, and traces using the CloudWatch Agent for more information.
-
SageMaker CloudWatch Integration: Built-in metrics for distributed training workloads. See Amazon SageMaker AI metrics in CloudWatch for available metrics.
-
Additional Resources: For information on configuration and containerized workloads, see AWS Systems Manager Parameter Store and Container Insights.
Testing and troubleshooting the solution
After the team launched the instance, AWS Support worked with the customer to verify pre-flight check results. To review the results, the team accessed the log file located in the /var/log/ directory. To receive automated comprehensive check reports, AWS Support helped the customer configure Slack webhooks and email notifications.
The team helped the customer monitor instance performance through the CloudWatch agent. This agent tracks important metrics such as CPU usage, disk utilization, NVIDIA GPU utilization through nvidia-smi, and memory consumption.
Common errors and diagnosis
AWS Support worked with the customer to review common errors and provide instructions on how to resolve the errors. For the customer, the errors fell into three categories:
NCCL errors
NCCL errors typically occur when there are communication issues between GPUs in a distributed training environment. Causes can include network configuration problems, EFA setup issues, or incompatible NCCL versions across nodes.
# Debugging commands
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
To resolve NCCL errors, see Troubleshooting on the NVIDIA website. This guide provides comprehensive debugging procedures, including EFA-specific configurations.
XID errors
XID errors are NVIDIA GPU driver-related errors that indicate hardware or driver issues. Common causes include GPU overheating, driver incompatibilities, or hardware failures.
Common XID error codes:
-
XID 13: GPU Hardware Exception
-
XID 31: Display Driver Stopped Responding
-
XID 32: GPU Memory Page Fault
-
XID 43: GPU Driver Timeout
Error messages are stored in /var/log/ messages.
The following is an example of a Xid error message:
[...] NVRM: GPU at 0000:03:00: GPU-b850f46d-d5ea-c752-ddf3-c56781461
[...] NVRM: Xid (0000:03:00): 14, Channel 00000001
For comprehensive diagnostics, use the nvidia-bug-report.sh tool.
For resolution strategies, review the following resources:
-
XID errors on the NVIDIA website
-
How to submit a bug report on the NVIDIA website
ECC errors
ECC (Error Correction Code) errors indicate GPU memory issues that can affect training stability and results:
-
Single-bit errors: Correctable, but might indicate early memory degradation.
-
Double-bit errors: Uncorrectable, require immediate intervention.
For information on how to manage ECC memory issues, see Dynamic page retirement on the NVIDIA website.
Cleanup
Proper cleanup after you implement and test these solutions is crucial for both cost management and system performance. AWS Support emphasized to the customer that terminating unnecessary instances, diagnostic tools, and monitoring processes prevents unexpected charges and resource contention.
Keeping diagnostic tools running can consume significant GPU memory and processing power, potentially affecting future training jobs. Additionally, accumulated logs and monitoring data can grow to substantial sizes if not properly managed, and affect storage costs and system performance. The team recommended implementing automated cleanup procedures as part of the customer's operational workflow to maintain an efficient training environment.
If you've deployed CloudWatch agents or custom monitoring solutions, then review your configurations to optimize cost and performance. Consider setting up log rotation policies and data retention rules to manage storage costs while maintaining access to important historical data.
Conclusion
Implementing rigorous node monitoring and testing protocols stands as a cornerstone of maintaining high-performance training infrastructure. This proactive approach to system health and performance management allows organizations to detect and resolve potential issues before they impact production workloads.
To get the most out of your AWS environment, contact AWS Cloud Support Engineers and TAMs. They can help you with general guidance, best practices, troubleshooting, and operational support on AWS. To learn more about our plans and offerings, see AWS Support. To learn more about the suggested solution, contact your TAM or AWS account team.
About the authors
Alma Mohapatra
Alma Mohapatra is an Enterprise Support Manager supporting strategic AI/ML customers running on high performance computing (HPC) environments. She leads specialized technical teams that help organizations optimize their large-scale ML workloads across distributed GPU clusters. With extensive experience in enterprise ML operations, Alma guides customers through complex performance challenges and infrastructure optimization for training LLMs. She excels at translating technical requirements into practical solutions that enhance reliability and efficiency in production AI environments. Alma is known for her collaborative approach to customer success, and works closely with TAMs to make sure that AI/ML initiatives meet critical business objectives.
Sid Sharma
Sid Sharma is a Senior TAM at AWS who specializes in HPC for the HPC training platform team. He helps enterprise customers optimize ML training pipelines and maximize GPU cluster performance. With expertise in ML and distributed systems, Sid serves as a trusted advisor for organizations that are scaling their AI workloads and implementing best practices for large-scale ML infrastructure.
- Language
- English

The solution has been published, you can come join us on October 21st on Twitch 2 PM PST / 5 PM EST and join us to deploy it live!
Relevant content
- asked 3 years ago