1 Antwort
- Neueste
- Die meisten Stimmen
- Die meisten Kommentare
0
The runtime command doesn't look right. In the notbeook, the model is trained using model parallel which means the whole model will be partitioned and spread all available GPU devices.
However, actual running command is
mpirun --host algo-1 -np 1 ... ...
This will launch only one process and use one GPU devices, which is not possible to host most GPT models with only 16GB GPU memory.
Have you modified any parameters relating to -np
? What is the value of processes_per_host
before the smp_estimator.fit
cell ?
beantwortet vor einem Jahr
Relevanter Inhalt
- AWS OFFICIALAktualisiert vor 3 Jahren
- AWS OFFICIALAktualisiert vor einem Jahr
Great catch, I missed that. I'll run it through again to see if it fixes the issue. I ended up getting it to run but had to decrease the batch-size significantly along with a few other tweaks. I also had to adjust the processes per host value because that threw an error as well. I'll report back on what I find.