- Newest
- Most votes
- Most comments
SageMaker training job metrics are time series: Your job can log multiple values of e.g. train:mae
over time as it trains, which is useful for long-running training jobs to continuously report metrics for monitoring (and maybe trigger early stopping).
This is why metrics are generally described by summary statistics. You can usually see these time series charts in Run > Charts within Studio, or from the training job details page in the AWS Console... But if your training job is very short, the graphs might not be that interesting: I believe they aggregate data points to a 1min or 5min granularity by default.
So which is the important statistic to look at? Usually 'Final value', but it depends what your script is doing.
For example if you're training a model with checkpointing and automatic stopping, it could be that accuracy gets worse for a few iterations before the script detects the issue, stops training, and re-loads the best-performing model from checkpoint? In that case, either you could make sure your script re-logs the final accuracy score so that "Final" is consistent with the final model... Or could just refer to "Max".
Alternatively if you have a script that does cross-validation and you wanted the summary statistics to accurately reflect this (e.g. standard deviation is the deviation of accuracy between different folds, average is the average over folds, etc) - then you would want to make sure your script logged the metrics exactly once for each validation fold, without any repetitions during the training process. Nice for cross-validation statistics, but those metrics then wouldn't give you continuous insight into the model as it trained (if that's something you want).
I couldn't find either 135.2...
or 186.2...
in your Runs table's OptimizationMetric
column, so I think your second screenshot may be omitting the record for the run shown in the first screenshot? But the Final value 135.2...
is the one I'd usually expect to see listed in the summary value.
Relevant content
- Accepted Answerasked 3 years ago
- Accepted Answerasked 2 years ago
- asked 2 months ago
- asked 2 years ago
- AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 8 months ago