SageMaker inference endpoint with HuggingFaceModel ignores custom inference.py script

Question

Hello,
I'm trying do deploy a [HuggingFaceModel](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-model) using sagemaker inference endpoint. I've been following some guides, e.g.: [this one](https://medium.com/innovation-res/inference-your-own-nlp-trained-model-on-aws-sagemaker-with-pytorchmodel-or-huggingefacemodel-30bcbdc4348b) and [this](https://www.philschmid.de/sagemaker-llama-llm). My model of choice is Llama-2 fine-tuned on my own data.
I've packed it and created a model.tar.gz which contains the following structure:

```
model.tar.gz/
├── config.json
├── generation_config.json
├── tokenizer.json
├── pytorch_model-00001-of-00003.bin
├── ... (other model files)
└── code/
  ├── inference.py
  └── requirements.txt
```

My `inference.py` script defines the functions  `model_fn` and `output_fn`, with custom model loading and output parsing logic.

I've uploaded this model.tar.gz to the S3 bucket in `model_s3_path`.

During the sagemaker endpoint creation, I define my HuggingFaceModel as follows:

```
from sagemaker.huggingface import get_huggingface_llm_image_uri
from sagemaker.huggingface import HuggingFaceModel

llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.9.3"
)

huggingface_model = HuggingFaceModel(
    model_data=model_s3_path,
    role=aws_role, 
    image_uri=llm_image,
    env={
      'HF_MODEL_ID': 'meta-llama/Llama-2-7b-hf',
      'SM_NUM_GPUS': '1',
      'MAX_INPUT_LENGTH': '2048',
      'MAX_TOTAL_TOKENS': '4096',
      'MAX_BATCH_TOTAL_TOKENS': '8192', 
      'HUGGING_FACE_HUB_TOKEN': ""
    }
)
```

And then I deploy the model:

```
huggingface_model.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=300
)
```

However, during the inference, the resulting endpoint model doesn't seem to use any of the functionality from `inference.py`, but rather sticks to all default methods. For instance, it still returns response as `[{"generated_texts": model_response}]` although my post-processing function (`output_fn`) should've changed the return type.

1. I've tried setting `entry_point="inference.py"` and `source_dir="./code"` during the HF model creation - the endpoint was not deploying at all.
2. Used env variable `"SAGEMAKER_PROGRAM": "inference.py"` - did not change the model's responses, functionality from `inference.py` still was ignored.
3. Tried various `image_uri` - did not change the endpoint's behaviour.

Accepted Answer

Hello Vlad,

Thank you for using AWS SageMaker.

I understand that you are trying to built a custom endpoint which will serve to your request with the help of the model that was trained outside SageMaker. The blogs that are used as reference are 3rd party blog so I won't be able to check internally if they have any code fix required, but to better investigate the issue, we would like to know more details about the endpoint configuration and certain backend details along with CloudWatch logs which will help us understand what could be missing and how to fix the issue. As this medium is not secured to share all those details and without that it will be difficult to narrow down the issue, so I request you please create a case with AWS Support so that the available engineers can assist you better to achieving the desired result.

To open a support case with AWS use the link: https://console.aws.amazon.com/support/home?#/case/create

SageMaker inference endpoint with HuggingFaceModel ignores custom inference.py script

Relevant content