Hi,
I am using SageMaker for a computer vision project. The project goal is to train an Object Detection model on SageMaker and create an Endpoint. We follow the AWS instructions to prepare a dataset having images files and *.manifest file created inside a new S3 bucket within the same region of the SageMaker notebook
We use the notebook (http://aws-tc-largeobjects.s3-us-west-2.amazonaws.com/DIG-TF-200-MLBEES-10-EN/demo.ipynb) which we download from a link provided by an AWS Youtube video (https://www.youtube.com/watch?v=OFlu6Gd7CrQ).
We followed the instructions to load the images and *.manifest file provided by the notebook ran the code and then created a Training job but failed many times with the following error:
"Failure reason
ClientError: Cannot resume training. Checkpoint hyperparameters are missing. Please check the checkpoint hyperparameters file exists on S3., exit code: 2"
instance type used is p2.xlarge
I have no idea what this error means, and I have no idea what is a checkpoint hyperparameters file. I checked my S3 a hyperparameters file does not exist.
I checked and all hyperparameters are set correctly during job creation and here is the list report in the report:
Hyperparameters
Key Value
base_network resnet-50
early_stopping false
early_stopping_min_epochs 10
early_stopping_patience 5
early_stopping_tolerance 0.0
epochs 30
freeze_layer_pattern false
image_shape 300
label_width 350
learning_rate 0.001
lr_scheduler_factor 0.1
mini_batch_size 1
momentum 0.9
nms_threshold 0.45
num_classes 1
num_training_samples 400
optimizer adam
overlap_threshold 0.5
use_pretrained_model 1
weight_decay 0.0005
Thanks for help!