Out of Memory Error SageMaker ml.p3.16xlarge

0

Link to the notebook.

I am trying to run the example notebook above on SageMaker using the gpt-j-xl model with the suggested instance of an ml.p3.16xlarge. However, I keep running into an out of memory error. I have tried other suggested instances (eg ml.g4dn.12xlarge) as well but get the same error. I've attached the latest error below. I've tried to set the train and val batch sizes to as low as 2 and still run into OOM issues. Any guidance would be appreciated.

Enter image description here

已提問 1 年前檢視次數 605 次
1 個回答
0

The runtime command doesn't look right. In the notbeook, the model is trained using model parallel which means the whole model will be partitioned and spread all available GPU devices.

However, actual running command is

mpirun --host algo-1 -np 1 ... ...

This will launch only one process and use one GPU devices, which is not possible to host most GPT models with only 16GB GPU memory.

Have you modified any parameters relating to -np ? What is the value of processes_per_host before the smp_estimator.fit cell ?

AWS
已回答 1 年前
  • Great catch, I missed that. I'll run it through again to see if it fixes the issue. I ended up getting it to run but had to decrease the batch-size significantly along with a few other tweaks. I also had to adjust the processes per host value because that threw an error as well. I'll report back on what I find.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南