What to do after my training job fails with "InternalServerError"?

0

I have a training job with resnet-50 on a 50GB/ml.p2.xlarge (1 instance), Pipe mode, object detection model with 1073 images in training and 268 images in validation. According to CloudWatch, it ran up to epoch 77 (for about 4 hours) but then failed with no specific message recorded in CloudWatch. I only get the dreaded "InternalServerError: We encountered an internal error. Please try again." which is not okay because it costs money and I need to know what is failing.

The CPU utilization is stable at around 270% (a number that needs to be divided by the number of vCPUs which is 4 so really this is about 68% per vCPU), GPU Utilization is constant at under 60%, GPU Memory Utilization is constant at around 18%, Memory Utilization is constant at 3.2%, Disk Utilization is stable at 0.22%.

Is there an obvious mistake I am doing? Thanks for the help!

fascani
已提问 1 年前99 查看次数
没有答案

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则