Out of Memory Error SageMaker ml.p3.16xlarge

0

Link to the notebook.

I am trying to run the example notebook above on SageMaker using the gpt-j-xl model with the suggested instance of an ml.p3.16xlarge. However, I keep running into an out of memory error. I have tried other suggested instances (eg ml.g4dn.12xlarge) as well but get the same error. I've attached the latest error below. I've tried to set the train and val batch sizes to as low as 2 and still run into OOM issues. Any guidance would be appreciated.

Enter image description here

1 回答
0

The runtime command doesn't look right. In the notbeook, the model is trained using model parallel which means the whole model will be partitioned and spread all available GPU devices.

However, actual running command is

mpirun --host algo-1 -np 1 ... ...

This will launch only one process and use one GPU devices, which is not possible to host most GPT models with only 16GB GPU memory.

Have you modified any parameters relating to -np ? What is the value of processes_per_host before the smp_estimator.fit cell ?

AWS
已回答 1 年前
  • Great catch, I missed that. I'll run it through again to see if it fixes the issue. I ended up getting it to run but had to decrease the batch-size significantly along with a few other tweaks. I also had to adjust the processes per host value because that threw an error as well. I'll report back on what I find.

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则

相关内容