Model error of deployed sageMaker Endpoint

0

After fine-tuning a DistilBERT model and saving it as 'model.pth', and creating an 'inference.py' script, I packaged both into a '.tar.gz' file. Upon deploying it, an endpoint was successfully created. However, encountered an error upon attempting to access the endpoint.

The error is: ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.",

inference.py code is below

import json
import torch
import logging
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
import os

# Set up logging
logging.basicConfig(level=logging.INFO)

# Load tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Define model_fn function to load the model
def model_fn(model_dir):
    # Load the model architecture
    model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)  # Assuming you have 3 labels
    
    # Load the model state dictionary
    model_state_path = os.path.join(model_dir, 'model.pth')
    model.load_state_dict(torch.load(model_state_path, map_location=torch.device('cpu')))  # Load the model on CPU
    
    # Set the model in evaluation mode
    model.eval()
    
    return model

# Define the predict function
def predict(review_text, model):
    encoding = tokenizer.encode_plus(
        review_text,
        add_special_tokens=True,
        max_length=512,
        return_token_type_ids=False,
        padding='max_length',
        return_attention_mask=True,
        return_tensors='pt',
        truncation=True
    )

    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']

    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)

    logits = outputs.logits  # Get the logits from the output
    prediction = torch.argmax(logits, dim=1).item()
    label_dict = {0: 'Negative', 1: 'Neutral', 2: 'Positive'}
    sentiment = label_dict[prediction]

    return sentiment

# Define input and output functions
def input_fn(input_data, content_type):
    logging.info("Input function invoked")
    if content_type == 'application/json':
        data = json.loads(input_data)
        return data['review_text']
    else:
        raise ValueError(f'Unsupported content type: {content_type}')

def output_fn(prediction_output, accept):
    logging.info("Output function invoked")
    return str(prediction_output)

def predict_fn(input_data, model):
    logging.info("Predict function invoked")
    return predict(input_data, model)

Anoop
asked 5 months ago294 views
1 Answer
0

It looks like you're using either the HuggingFace or PyTorch framework containers to deploy your model.

Tarball structure

In either case (as documented here for PyTorch, here for HF), your inference.py code should be located in a code/ subfolder of your model tarball, but it sounded from the question like you might have it in the root?

If you're preparing your tarball by hand, check also that it correctly extracts to the current directory ., and doesn't e.g. create a new subfolder model/model.pth when you unzip it... For example I've sometimes created them with command like tar -czf ../model.tar.gz . from inside the folder where I've prepared my artifacts.

Model format

Since you're already providing a custom model_fn, you don't need to go to the effort of converting to a model.pth if you don't want... For HuggingFace models I find it's easier to just use for example Trainer.save_model() to the target folder and then at inference time you can directly:

model = DistilBertForSequenceClassification.from_pretrained(model_dir)

As shown in the linked example, I'd probably save the tokenizer in your tarball too, to avoid any hidden external dependencies.

Debugging

Your endpoint's CloudWatch logs are usually the best place to look for what's going wrong with deployments like this, but I know the default configuration can be a bit sparse...

I'd suggest setting the env={"PYTHONUNBUFFERED": "1"} when you create your HuggingFaceModel to disable Python log buffering and ensure that logs from a crashing thread/process actually get written to CloudWatch before the thread/process dies. If you're just directly going from the shortcut estimator.deploy(), you'll need to change your code to create a Model first to be able to specify this parameter.

Deploying a SageMaker endpoint involves 3 API-side steps that the SDK makes a little non-obvious: Creating a "Model", an "Endpoint Configuration", and an "Endpoint". To make matters more confusing, creating an SDK e.g. HuggingFaceModel doesn't actually create a SageMaker Model yet because it doesn't have all the information yet (container URI is inferred from instance type, which the SDK only collects when you try to create a Transformer or Predictor). Be careful when re-trying different configurations to check (e.g. in the AWS Console for SageMaker) that your previous model and endpoint configuration are actually getting deleted and not re-used: Or you might feel like you're trying different things and seeing the same result.

Finally, you'll want to avoid waiting several minutes every time for an endpoint to deploy, to check your new configuration works. I'd recommend verifying your inference.py functions work as you'd expect them to against an extracted local folder containing the contents of your model.tar.gz... To test the whole stack in an environment where Docker is available, you can use instance_type='local' - SageMaker Local Mode. If you're working on notebooks in SageMaker Studio, note that Local Mode didn't used to be supported, but now is! You just have to enable it and install docker on the Studio instance.

AWS
EXPERT
Alex_T
answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions