What to do after my training job fails with "InternalServerError"?

0

I have a training job with resnet-50 on a 50GB/ml.p2.xlarge (1 instance), Pipe mode, object detection model with 1073 images in training and 268 images in validation. According to CloudWatch, it ran up to epoch 77 (for about 4 hours) but then failed with no specific message recorded in CloudWatch. I only get the dreaded "InternalServerError: We encountered an internal error. Please try again." which is not okay because it costs money and I need to know what is failing.

The CPU utilization is stable at around 270% (a number that needs to be divided by the number of vCPUs which is 4 so really this is about 68% per vCPU), GPU Utilization is constant at under 60%, GPU Memory Utilization is constant at around 18%, Memory Utilization is constant at 3.2%, Disk Utilization is stable at 0.22%.

Is there an obvious mistake I am doing? Thanks for the help!

fascani
gefragt vor einem Jahr99 Aufrufe
Keine Antworten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen