Hello,
I'm trying do deploy a HuggingFaceModel using sagemaker inference endpoint. I've been following some guides, e.g.: this one and this. My model of choice is Llama-2 fine-tuned on my own data.
I've packed it and created a model.tar.gz which contains the following structure:
model.tar.gz/
├── config.json
├── generation_config.json
├── tokenizer.json
├── pytorch_model-00001-of-00003.bin
├── ... (other model files)
└── code/
├── inference.py
└── requirements.txt
My inference.py
script defines the functions model_fn
and output_fn
, with custom model loading and output parsing logic.
I've uploaded this model.tar.gz to the S3 bucket in model_s3_path
.
During the sagemaker endpoint creation, I define my HuggingFaceModel as follows:
from sagemaker.huggingface import get_huggingface_llm_image_uri
from sagemaker.huggingface import HuggingFaceModel
llm_image = get_huggingface_llm_image_uri(
"huggingface",
version="0.9.3"
)
huggingface_model = HuggingFaceModel(
model_data=model_s3_path,
role=aws_role,
image_uri=llm_image,
env={
'HF_MODEL_ID': 'meta-llama/Llama-2-7b-hf',
'SM_NUM_GPUS': '1',
'MAX_INPUT_LENGTH': '2048',
'MAX_TOTAL_TOKENS': '4096',
'MAX_BATCH_TOTAL_TOKENS': '8192',
'HUGGING_FACE_HUB_TOKEN': "<my-hf-token>"
}
)
And then I deploy the model:
huggingface_model.deploy(
initial_instance_count=1,
instance_type=inference_instance_type,
endpoint_name=endpoint_name,
container_startup_health_check_timeout=300
)
However, during the inference, the resulting endpoint model doesn't seem to use any of the functionality from inference.py
, but rather sticks to all default methods. For instance, it still returns response as [{"generated_texts": model_response}]
although my post-processing function (output_fn
) should've changed the return type.
- I've tried setting
entry_point="inference.py"
and source_dir="./code"
during the HF model creation - the endpoint was not deploying at all.
- Used env variable
"SAGEMAKER_PROGRAM": "inference.py"
- did not change the model's responses, functionality from inference.py
still was ignored.
- Tried various
image_uri
- did not change the endpoint's behaviour.