This article helps AWS customers understand when to use standard Amazon SageMaker training jobs versus Amazon SageMaker HyperPod for their machine learning workloads. As organizations scale their ML operations, making the right choice between these options can significantly impact cost, efficiency, and team productivity.
Introduction
As machine learning workloads grow in complexity and scale, choosing the right training infrastructure becomes crucial. In this topic, we'll quickly explore the key differences between Amazon SageMaker training jobs and Amazon SageMaker HyperPod.
Amazon SageMaker Training Jobs
Amazon SageMaker training jobs are managed, on-demand ML training tasks that provide a serverless experience for training machine learning models on Amazon SageMaker. They provide a straightforward way to train ML models.
Common Use Cases for Standard Amazon SageMaker Training Jobs:
- Training supervised learning models (classification/regression) for tasks like customer churn prediction, fraud detection, and price forecasting.
- Training deep learning models for computer vision, NLP, and recommendation systems that fit in single-instance memory and don't require persistent infrastructure.
Here's a typical high level implementation:
estimator = PyTorch(
entry_point='train.py',
role=role,
instance_count=1,
instance_type='ml.p3.2xlarge',
framework_version='1.8.0',
hyperparameters={
'epochs': 10,
'batch-size': 64
}
)
estimator.fit()
Key Benefits:
- Simple setup and execution
- Pay-per-use pricing model
- Ideal for periodic training needs
- Lower operational overhead
Amazon SageMaker HyperPod
Amazon SageMaker HyperPod helps you provision resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs). It accelerates development of FMs by removing undifferentiated heavy-lifting involved in building and maintaining large-scale compute clusters powered by thousands of accelerators such as AWS Trainium instance family.
HyperPod offers a persistent cluster approach for ML training:
{
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.g5.12xlarge",
"InstanceCount": 2,
"InstanceStorageConfigs": [
{
"EbsVolumeConfig": {
"VolumeSizeInGB": 500
}
}
],
"LifeCycleConfig": {
"SourceS3Uri": "s3://$Lifecycle_Bucket/src",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "$Sagemaker_Execution_Role_ARN",
"ThreadsPerCore": 1
}
Common Use Cases for Amazon SageMaker HyperPod
- Training and fine-tuning Large Language Models (LLMs) and foundation models that require significant computational resources
- Production-scale distributed training for enterprise-level deep learning workloads requiring persistent infrastructure
- Long-running research and experimentation projects with complex hyperparameter optimization needs and continuous model improvements
Key Benefits:
- Persistent cluster infrastructure
- Optimized for continuous workloads
- Workload orchestration using SLURM or Amazon EKS
- Advanced resource management
- Better cost efficiency at scale
Comparison Table
| Feature | SageMaker Training Jobs | SageMaker HyperPod |
|---|
| Infrastructure Type | Ephemeral (Serverless) | Persistent Clusters |
| Best For | Periodic training, smaller models | Large models, continuous training |
| Cost Model | Pay-per-use | Reserved capacity or On-demand pricing |
| Setup Time | Minutes | Hours (but persists) |
| Checkpointing | Basic | Advanced with auto-recovery |
| Scale | Single to few instances | Up to hundreds of instances |
| Use Cases | Traditional ML, small-medium DL | LLMs, Foundation Models |
| Resource Management | Automatic provisioning/cleanup | Managed persistent clusters |
Making the Right Choice
Choose Standard Training Jobs when:
- Running periodic training workloads
- Need pay-per-use pricing
- Operating with smaller teams
- Requiring simple setup
- Working with limited budgets
- Performing development and testing
Choose HyperPod when:
- Training large language models
- Need persistent infrastructure
- Running continuous training workloads
- Require distributed training
- Working with foundation models
- Need advanced checkpointing
Cost Considerations
Standard Training Jobs
- Pay only for actual training time
- No minimum commitment
- Higher per-hour rates
- Includes infrastructure management
HyperPod
- Reserved capacity or On-Demand pricing
- Additional storage costs for persistence
Conclusion
Choosing between Amazon SageMaker Training Jobs and HyperPod depends primarily on your specific ML training and operational needs. Standard Training Jobs are ideal for smaller models and periodic training with their serverless, pay-as-you-go approach. HyperPod, on the other hand, excels at training large-scale models like LLMs with its persistent infrastructure and advanced features.
Consider these key factors when deciding:
- Model size and complexity
- Training frequency and duration
- Budget constraints
- Team requirements
- Infrastructure persistence needs
By carefully evaluating these aspects, you can select the most cost-effective and efficient training infrastructure for your machine learning workflows.
Article co-authors:
Sashank Bulusu