Out of Memory Error SageMaker ml.p3.16xlarge

Question

[Link to the notebook.](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt-j/11_train_gptj_smp_tensor_parallel_notebook.ipynb)

I am trying to run the example notebook above on SageMaker using the gpt-j-xl model with the suggested instance of an ml.p3.16xlarge. However, I keep running into an out of memory error. I have tried other suggested instances (eg ml.g4dn.12xlarge) as well but get the same error. I've attached the latest error below. I've tried to set the train and val batch sizes to as low as 2 and still run into OOM issues. Any guidance would be appreciated.

![Enter image description here](/media/postImages/original/IMAL-P9HZFRYGlgcCCToU59Q)

Answer

The runtime command doesn't look right. In the notbeook, the model is trained using model parallel which means the whole model will be partitioned and spread all available GPU devices.

However, actual running command is
```
mpirun --host algo-1 -np 1 ... ...
```

This will launch only one process and use one GPU devices, which is not possible to host most GPT models with only 16GB GPU memory.

Have you modified any parameters relating to `-np` ? What is the value of `processes_per_host` before the `smp_estimator.fit` cell ?

Out of Memory Error SageMaker ml.p3.16xlarge

Relevant content