SageMaker Model Monitor Missing Columns Constraint Violation

0

I have an Endpoint inference pipeline model deployed from an AutoPilot training job. Now that this is successful, I want to add model monitor. I have a script for online validation of the endpoint, and the F1 score is ~99%. This indicates that the endpoint interprets the call correctly.

Model Monitor is recognizing the data in my jsonl files as the data not being CSV formatted. When my Model Monitor processing job runs, I receive the following constraint violation: "There are missing columns in current dataset. Number of columns in current dataset: 1, Number of columns in baseline constraints: 225".

Given the results from the Endpoint and this Model Monitor constraint violation, I perceive there is a conflict between how the Endpoint is storing the data and how the Model Monitor Processing Job wants to consume the data.

Here is one sample prediction from the jsonl file. The data value is comma separated.

{"captureData":{"endpointInput":{"observedContentType":"text/csv","mode":"INPUT","data":"JHB,44443000.0,-0.0334,,44264000.0,,,,-2014000.0,,-2014000.0,,,,,,,-0.04,-0.04,55872000.0,,,0.996,,,,,,,,-0.0453,,2845000.0,,2845000.0,11636000.0,,,,,,,,,,,,190000000.0,,,,,,,,-18718000.0,,,,,,,,29000000.0,,,,,,,,-33000000.0,,-4000000.0,,,,,,,,,,,,,,,0.0,,,0.995972369102,1.0,-0.045316472785366,0.0,,,,,,,0.0,,,,,,,,,95.5638,,,,,,1.0,1.0,,0.15263157894737,,,,,,0.65252120693923,0.0,0.15263157894737,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,18606500.0,,,95.5638,,,2.3886,,,,,-0.0326,,-1.0449,,-1.05,-1.05,,0.0,,-0.1471,,,,,,,,,,,,,,,,,-0.5451,,,,,,,Financial Services,16.67890010036862","encoding":"CSV"},"endpointOutput":{"observedContentType":"text/csv; charset=utf-8","mode":"OUTPUT","data":"1\n","encoding":"CSV"}},"eventMetadata":{"eventId":"c97df615-0a2e-414d-9be3-bf3a14eb6363","inferenceTime":"2020-04-15T16:26:46Z"},"eventVersion":"0"}

Here is the point within the log that the processing job recognizes a column mismatch. I see that it pulls down the data to store locally, pulls down the statistics and constraints files, errors with this constraint, and then gracefully ends the Processing Job. If more logs are needed to analyze, I have the Processing Job logs in CloudWatch Logs.

2020-04-15 17:11:49 INFO  FileUtil:66 - Read file from path /opt/ml/processing/baseline/constraints/constraints.json.
2020-04-15 17:11:50 INFO  FileUtil:66 - Read file from path /opt/ml/processing/baseline/stats/statistics.json.
2020-04-15 17:11:50 ERROR DataAnalyzer:65 - There are missing columns in current dataset. Number of columns in current dataset: 1, Number of columns in baseline constraints: 225
Skipping further processing because of column count mismatch.

I could not find Model Monitor documentation on how to deal with column mismatch constraint violations.

AWS
blayze
asked 4 years ago763 views
1 Answer
0
Accepted Answer

That violation fires when, for example, input to your endpoint has fewer columns than baseline input does. This is helpful to flag data quality issues. https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-interpreting-violations.html

In this case, however, this is an artifact of how we perform the analysis. We concatenate output and input CSVs into a single CSV to analyze the whole thing in one go. E.g. it would look like:

output_col,input_col_1,input_col_2,...,input_col_n

In this case, however, your output has a trailing newline which means that after concatenating this looks like:

output_col # embedded newline in your output
,input_col_1,input_col_2,...,input_col_n

Triggering the code to think there is only one column in dataset and hence failing the job.

We have a fix flowing through the pipeline now, while that goes out you can add a preprocessing script to your schedule to strip out the trailing newline from the output. We will create a sample notebook for this, in the meantime docs are at https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-pre-and-post-processing.html#model-monitor-pre-processing-script

AWS
answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions