Out of Memory Error SageMaker ml.p3.16xlarge

0

Link to the notebook.

I am trying to run the example notebook above on SageMaker using the gpt-j-xl model with the suggested instance of an ml.p3.16xlarge. However, I keep running into an out of memory error. I have tried other suggested instances (eg ml.g4dn.12xlarge) as well but get the same error. I've attached the latest error below. I've tried to set the train and val batch sizes to as low as 2 and still run into OOM issues. Any guidance would be appreciated.

Enter image description here

질문됨 일 년 전605회 조회
1개 답변
0

The runtime command doesn't look right. In the notbeook, the model is trained using model parallel which means the whole model will be partitioned and spread all available GPU devices.

However, actual running command is

mpirun --host algo-1 -np 1 ... ...

This will launch only one process and use one GPU devices, which is not possible to host most GPT models with only 16GB GPU memory.

Have you modified any parameters relating to -np ? What is the value of processes_per_host before the smp_estimator.fit cell ?

AWS
답변함 일 년 전
  • Great catch, I missed that. I'll run it through again to see if it fixes the issue. I ended up getting it to run but had to decrease the batch-size significantly along with a few other tweaks. I also had to adjust the processes per host value because that threw an error as well. I'll report back on what I find.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠