For the last few days, my training jobs have blown out, and the logs are showing over 1 hour to download the training job. I'm using spot instances for training - is this a symptom of that? It seems unlikely because I'd assumed if a spot instance wasn't available I'd get some other error, or at least it wouldn't have started preparing the instances? I'm using the HuggingFace
estimator with the following
transformers_version="4.28", # Transformers version
pytorch_version="2.0", # PyTorch version
py_version="py310", # Python version
16:10:41 2023-10-10 05:10:10 Starting - Starting the training job...
16:11:41 2023-10-10 05:10:29 Starting - Preparing the instances for training......
16:12:11 2023-10-10 05:11:26 Downloading - Downloading input data...
16:19:14 2023-10-10 05:11:47 Training - Downloading the training image..........................................
17:27:40 2023-10-10 05:18:44 Training - Training image download completed. Training in progress.........................................................................................................................................................................................................................................................................................................................................................................................................................
17:28:42 2023-10-10 06:27:30 Uploading - Uploading generated training model......
17:28:42 2023-10-10 06:28:16 Completed - Training job completed