- Newest
- Most votes
- Most comments
I'll provide an answer to the issue of memory errors during generative model training on a p4d.24xlarge instance. First, consuming all 320GB of GPU memory with a 4GB batch size is unusual. This suggests there might be issues with the model structure or data processing method. Regarding pre-training setup, ensure NVIDIA drivers and CUDA are correctly installed. The provided output indicates nvidia-smi command isn't working, which could point to driver installation problems. Training can typically be stopped with Ctrl+C, but if GPU memory isn't released, you may need to fully exit the Python interpreter or restart the instance.
To address memory issues, consider these approaches.
- Simplify the model structure or reduce parameter count.
- Decrease batch size further.
- Implement gradient accumulation.
- Use memory-efficient options like mixed precision training.
Lastly, the NUMA-related warning usually doesn't significantly impact performance and can be ignored.
If problems persist after trying these methods, closely review your model structure and data processing logic.
Thank you. If my dataset is 500Gb to 1 TB in total, what would be my best option for training quickly on this dataset? Combining EC2 instances together and using multiple P5's, or is the best option just to lower your batch size?
Relevant content
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 5 months ago