Out of Memory Error SageMaker ml.p3.16xlarge

0

Link to the notebook.

I am trying to run the example notebook above on SageMaker using the gpt-j-xl model with the suggested instance of an ml.p3.16xlarge. However, I keep running into an out of memory error. I have tried other suggested instances (eg ml.g4dn.12xlarge) as well but get the same error. I've attached the latest error below. I've tried to set the train and val batch sizes to as low as 2 and still run into OOM issues. Any guidance would be appreciated.

Enter image description here

1 Answer
0

The runtime command doesn't look right. In the notbeook, the model is trained using model parallel which means the whole model will be partitioned and spread all available GPU devices.

However, actual running command is

mpirun --host algo-1 -np 1 ... ...

This will launch only one process and use one GPU devices, which is not possible to host most GPT models with only 16GB GPU memory.

Have you modified any parameters relating to -np ? What is the value of processes_per_host before the smp_estimator.fit cell ?

AWS
answered a year ago
  • Great catch, I missed that. I'll run it through again to see if it fixes the issue. I ended up getting it to run but had to decrease the batch-size significantly along with a few other tweaks. I also had to adjust the processes per host value because that threw an error as well. I'll report back on what I find.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions