How to verify that checkpoints work for SageMaker Spot Training?



How can we know that checkpoint works before launching a sagemaker spot training job? Is there a way to force a regular checkpoint to s3 instead of waiting for the SIGTERM?


preguntada hace 4 años369 visualizaciones
1 Respuesta
Respuesta aceptada

Hi olivier, If you enable Sagemaker checkpointing , it periodically saves a copy of the artifacts into S3. I have used this in pytorch and it works by checkpointing periodically and the blog on Managed Spot Training: Save Up to 90% On Your Amazon SageMaker Training Jobs also mentions the same

To avoid restarting a training job from scratch should it be interrupted, we strongly recommend that you implement checkpointing, a technique that saves the model in training at periodic intervals

respondido hace 4 años
profile picture
revisado hace un mes

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas