1 Respuesta
- Más nuevo
- Más votos
- Más comentarios
0
The runtime command doesn't look right. In the notbeook, the model is trained using model parallel which means the whole model will be partitioned and spread all available GPU devices.
However, actual running command is
mpirun --host algo-1 -np 1 ... ...
This will launch only one process and use one GPU devices, which is not possible to host most GPT models with only 16GB GPU memory.
Have you modified any parameters relating to -np
? What is the value of processes_per_host
before the smp_estimator.fit
cell ?
respondido hace un año
Contenido relevante
- OFICIAL DE AWSActualizada hace 2 años
- OFICIAL DE AWSActualizada hace un año
Great catch, I missed that. I'll run it through again to see if it fixes the issue. I ended up getting it to run but had to decrease the batch-size significantly along with a few other tweaks. I also had to adjust the processes per host value because that threw an error as well. I'll report back on what I find.