What to do after my training job fails with "InternalServerError"?

0

I have a training job with resnet-50 on a 50GB/ml.p2.xlarge (1 instance), Pipe mode, object detection model with 1073 images in training and 268 images in validation. According to CloudWatch, it ran up to epoch 77 (for about 4 hours) but then failed with no specific message recorded in CloudWatch. I only get the dreaded "InternalServerError: We encountered an internal error. Please try again." which is not okay because it costs money and I need to know what is failing.

The CPU utilization is stable at around 270% (a number that needs to be divided by the number of vCPUs which is 4 so really this is about 68% per vCPU), GPU Utilization is constant at under 60%, GPU Memory Utilization is constant at around 18%, Memory Utilization is constant at 3.2%, Disk Utilization is stable at 0.22%.

Is there an obvious mistake I am doing? Thanks for the help!

fascani
posta un anno fa99 visualizzazioni
Nessuna risposta

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande