- Newest
- Most votes
- Most comments
In: sklearn_estimator.latest_training_job.describe()['FailureReason']
Out:
Traceback (most recent call last): File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_trainer.py", line 84, in train entrypoint() File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 39, in main train(environment.Environment()) File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 35, in train runner_type=runner.ProcessRunnerType) File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/entry_point.py", line 100, in run wait, capture_error File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 291, in run cwd=environment.code_dir, File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 208, in check_error info=extra_info, sagemaker_training.errors.ExecuteUserScriptError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "" Command "/miniconda3/bin/python train.py" ExecuteUserScriptErro```
Hello, is the reported problem similar to this issue reported on SageMaker Python SDK project ?
Hello, thank you for your response! :)
Nope. It's other problem.
In this issue the author uses ValueError to read input data into input_fn, but failure in this case is unlikely due to this error (it took too long to read the data - 10 hours). But even if it did.
The question is different: How to pass the failure training_job information via FailureReason, ErrorMessage or other parameters? I'm causing a failure via own Error and want to understand how this information can be passed and collected?
Hi, did you try writing to /opt/ml/output/failure
as per the doc here?
It's worth mentioning that in the past there was a bug that overwrote this file in the base training toolkit that powers "script mode" containers. This got resolved at source per the linked issue, but I guess there's a chance older containers or frameworks which customize this tool could still be affected? So may be worth upgrading your framework version if you're using an older one.
Hi, Alex! Your comment is really worthwhile! I will add my answer with attachments below
This issue looks like related to my problem)
There are 2 sagemaker-scikit-learn versions: 0.23-1, 0.20-0.
So, I use:
683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3
sagemaker version: 2.86.0 (tried to upgrade with pip in terminal to 2.90.0 , but there is the previous version 2.86 in notebooks)
Python - 3.7
How can I understand which versions I should use to get away of ErrorMessage problem?
Relevant content
- asked a year ago
- asked a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
Look,
ErrorMessage ""
is empty.Output of FailureReason is limited by 1024. But I look at full text and it's useless too. So I have no way to get any failure information.
Even the failure was by my own failed scenario and error: This trace from failed training output in SageMager Notebook: