Skip to content

Implementing health checks for large-scale AI/ML training

10 minute read
Content level: Advanced
2

This article provides a systematic approach to diagnose and resolve performance issues in distributed large language model (LLM) training operations. This article also focuses on pre-flight checks and comprehensive monitoring solutions for multi-node GPU clusters.

Introduction

LLM training operations have become increasingly complex as enterprises scale their AI and machine learning (AI/ML) initiatives to meet growing business demands. Organizations face significant challenges in managing distributed training environments, including the following:

  • Hardware inconsistencies

  • Network bottlenecks

  • Resource optimization difficulties

  • Substantial costs associated with GPU cluster downtime

These challenges increase when you must train large-scale models that require coordinated multi-node operations across hundreds or thousands of GPUs. During these operations, even minor configuration issues can cascade into training failures, wasted compute resources, and delayed model deployment timelines.

Against this backdrop, a global customer with extensive AI/ML operations approached AWS Enterprise Support with performance challenges in their distributed LLM training environment. Their AI/ML team experienced inconsistent results and occasional failures across multi-node GPU clusters. These issues caused significant delays in model development and inefficient resource utilization. The unpredictable nature of these issues affected their ability to maximize their substantial infrastructure investments. To create reliable operations at scale, the organization needed a systematic approach to diagnose pre-training issues and implement effective monitoring for consistent performance across distributed training workloads.

Identifying opportunities for improvements

The AWS Enterprise Support team directly worked with the customer through their Technical Account Manager (TAM). AWS Support assessed the customer's training infrastructure and workflows, and identified several opportunities for improvement:

  • Hardware validation procedures

  • Elastic Fabric Adapter (EFA) configuration

  • Monitoring practices

To help improve these opportunities, AWS Support collaborated with the customer's AI/ML engineers to develop a solution. The solution implements three key components to support reliability in large-scale LLM training:

  • Pre-flight checks: The team created a systematic validation framework to verify hardware, network, and system readiness before the start of a training job.

  • Improved monitoring: The team created a real-time monitoring solution that tracks GPU health and performance during a training job.

  • Testing and troubleshooting: The team reviewed the solution to test and troubleshoot common issues.

Implementing the solution

Prerequisites:

  • Multi-node GPU cluster environment

  • NVIDIA GPU hardware with appropriate drivers

  • EFA networking configuration

  • Access to NVIDIA tools, such as nvidia-smi and DCGM

  • Amazon CloudWatch for monitoring integration

Configuring pre-flight checks

AWS Support helped the customer analyze their training environment to identify critical validation points that would prevent costly failures. The team developed a comprehensive checklist based on the customer's specific infrastructure and training workloads. This checklist made sure that the customer properly validates all hardware and network components before they initiate resource-intensive training jobs.

Pre-flight tests in LLMs represent a critical phase in the model deployment pipeline. These tests serve as a comprehensive validation system that safeguards both performance and reliability, and are essential for multiple reasons:

  • Tests can provide quality assurance and verify that the model performs according to expectations before you deploy a solution.

  • Tests implement crucial safety measures to prevent harmful or inappropriate outputs.

  • Tests confirm system stability and consistency.

Streamlining the pre-flight check implementation

AWS Support worked with the customer to develop a streamlined deployment process for the pre-flight checks. The team created a reusable script that they could easily integrate into the customer's existing instance launch workflow. This script helped the customer consistently validate their entire training fleet. Then, the team completed the following steps:

  1. The team saved the Amazon Elastic Compute Cloud (Amazon EC2) initialization script to the standard preflight-checks.sh'. This turned on automatic execution scripts placed in the user data field during the instance's first boot.

  2. The team used the AWS Management Console to launch Amazon EC2 instances with the script in the User Data field.

Enter image description here

  1. The team used log files and notifications to monitor the script execution. After the launch completed, the team used the log file located in the /var/log/ directory to verify pre-flight check results.

Enter image description here

The team also configured Slack webhooks and email notifications to receive automated check reports.

Review the customer requirements

To implement the pre-flight checks, the team reviewed the customer's resources and infrastructure requirements. They identified the following critical validation components:

Hardware validation

Purpose: Verify GPU availability and performance

  • Confirms GPU detection and specifications

  • Monitors temperature and power consumption

  • Validates GPU count and memory allocation

# Basic GPU status
nvidia-smi
# Check GPU temperature and power
nvidia-smi -q -d TEMPERATURE,POWER

For NVIDIA GPU monitoring and management capabilities, see System Management Interface SMI and Data Center GPU Manager on the NVIDIA website.

Network fabric testing

Purpose: Ensure high-speed interconnect functionality.

  • Tests EFA connectivity

  • Validates libfabric providers for distributed computing

  • Confirms network device availability

# Check available providers
fi_info

For technical resources and implementation examples, see EFA for AI/ML and HPC workloads on Amazon EC2 and Get started with EFA and MPI for HPC workloads on Amazon EC2.

Multi-GPU communication

Purpose: Verify GPU-to-GPU communication pathways.

  • Tests NCCL (NVIDIA Collective Communications Library) performance

  • Measures bandwidth and latency between GPUs

  • Validates collective operations for distributed training

# Set up
ALL_REDUCE=/usr/local/cuda/efa/test-/all_reduce_perf
$ nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# Check bandwidth and latency metrics using All_gather Perf
ALL_GATHER=/usr/local/cuda/efa/test-/all_Gather_perf
$ nccl-tests/build/all_gather_perf -b 8 -e 1G -f 2 -g 8

For comprehensive NCCL implementation and performance validation, see nccl-tests on the GitHub website.

Monitoring system resources

To confirm adequate system resources, the team turned on the following customer system resource monitoring checks:

  • Checks available memory and disk space

  • Monitors CPU utilization and I/O performance

# Check system memory
free -h
# Check disk space
df -h
# Check CPU usage
top or htop
# Check I/O status
iostat -x

Implementing the monitoring solution

To implement a monitoring solution that would improve the LLM training operations for the customer, the team had to determine which metrics to monitor. Based on the customer's uses, the team decided to monitor the following metrics:

  • GPU utilization rates

  • Memory consumption patterns

  • Temperature thresholds

  • Power usage metrics

  • PCIe bandwidth utilization

  • Inter-GPU communication efficiency

Implementing monitoring tools

To implement the monitoring tools, the team ran the following commands:

# Continuous monitoring
watch -n 1 nvidia-smi
dcgmi dmon
# Display specific metrics
nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,memory.used --format=csv
# Running diagnostic tests
dcgmi diag -r 3
dcgmi diag -r 3 -p diagnostic.test_duration=300.0
dcgmi diag -r 3 -p diagnostic.temperature_max=60.0

Integrating CloudWatch

To monitor the solution, the team used the following AWS resources to integrate with CloudWatch:

Testing and troubleshooting the solution

After the team launched the instance, AWS Support worked with the customer to verify pre-flight check results. To review the results, the team accessed the log file located in the /var/log/ directory. To receive automated comprehensive check reports, AWS Support helped the customer configure Slack webhooks and email notifications.

The team helped the customer monitor instance performance through the CloudWatch agent. This agent tracks important metrics such as CPU usage, disk utilization, NVIDIA GPU utilization through nvidia-smi, and memory consumption.

Enter image description here

Enter image description here

Common errors and diagnosis

AWS Support worked with the customer to review common errors and provide instructions on how to resolve the errors. For the customer, the errors fell into three categories:

NCCL errors

NCCL errors typically occur when there are communication issues between GPUs in a distributed training environment. Causes can include network configuration problems, EFA setup issues, or incompatible NCCL versions across nodes.

# Debugging commands
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

To resolve NCCL errors, see Troubleshooting on the NVIDIA website. This guide provides comprehensive debugging procedures, including EFA-specific configurations.

XID errors

XID errors are NVIDIA GPU driver-related errors that indicate hardware or driver issues. Common causes include GPU overheating, driver incompatibilities, or hardware failures.

Common XID error codes:

  • XID 13: GPU Hardware Exception

  • XID 31: Display Driver Stopped Responding

  • XID 32: GPU Memory Page Fault

  • XID 43: GPU Driver Timeout

Error messages are stored in /var/log/ messages.

The following is an example of a Xid error message:

[...] NVRM: GPU at 0000:03:00: GPU-b850f46d-d5ea-c752-ddf3-c56781461
[...] NVRM: Xid (0000:03:00): 14, Channel 00000001

For comprehensive diagnostics, use the nvidia-bug-report.sh tool.

For resolution strategies, review the following resources:

ECC errors

ECC (Error Correction Code) errors indicate GPU memory issues that can affect training stability and results:

  • Single-bit errors: Correctable, but might indicate early memory degradation.

  • Double-bit errors: Uncorrectable, require immediate intervention.

For information on how to manage ECC memory issues, see Dynamic page retirement on the NVIDIA website.

Cleanup

Proper cleanup after you implement and test these solutions is crucial for both cost management and system performance. AWS Support emphasized to the customer that terminating unnecessary instances, diagnostic tools, and monitoring processes prevents unexpected charges and resource contention.

Keeping diagnostic tools running can consume significant GPU memory and processing power, potentially affecting future training jobs. Additionally, accumulated logs and monitoring data can grow to substantial sizes if not properly managed, and affect storage costs and system performance. The team recommended implementing automated cleanup procedures as part of the customer's operational workflow to maintain an efficient training environment.

If you've deployed CloudWatch agents or custom monitoring solutions, then review your configurations to optimize cost and performance. Consider setting up log rotation policies and data retention rules to manage storage costs while maintaining access to important historical data.

Conclusion

Implementing rigorous node monitoring and testing protocols stands as a cornerstone of maintaining high-performance training infrastructure. This proactive approach to system health and performance management allows organizations to detect and resolve potential issues before they impact production workloads.

To get the most out of your AWS environment, contact AWS Cloud Support Engineers and TAMs. They can help you with general guidance, best practices, troubleshooting, and operational support on AWS. To learn more about our plans and offerings, see AWS Support. To learn more about the suggested solution, contact your TAM or AWS account team.

About the authors

Enter image description here

Alma Mohapatra

Alma Mohapatra is an Enterprise Support Manager supporting strategic AI/ML customers running on high performance computing (HPC) environments. She leads specialized technical teams that help organizations optimize their large-scale ML workloads across distributed GPU clusters. With extensive experience in enterprise ML operations, Alma guides customers through complex performance challenges and infrastructure optimization for training LLMs. She excels at translating technical requirements into practical solutions that enhance reliability and efficiency in production AI environments. Alma is known for her collaborative approach to customer success, and works closely with TAMs to make sure that AI/ML initiatives meet critical business objectives.

Enter image description here

Sid Sharma

Sid Sharma is a Senior TAM at AWS who specializes in HPC for the HPC training platform team. He helps enterprise customers optimize ML training pipelines and maximize GPU cluster performance. With expertise in ML and distributed systems, Sid serves as a trusted advisor for organizations that are scaling their AI workloads and implementing best practices for large-scale ML infrastructure.