Sagemaker response streaming with custom inference code

0

I'm deploying an LLM that uses huggingface transformers on Sagemaker with custom inference code (e.g. custom model_fn, predict_fn, etc). I know that Sagemaker has support for response streaming, and I've seen this AWS blog post on how to get it set up using ready-to-go containers, but I'm wondering how to set it up with custom inference code.

The predict_fn does the actual inference like so:

def predict_fn(data, model):
    model, tokenizer = model
    prompt = data.pop("input")
    ...
    <inference the model, get full generated output>
    ...
    return output

I don't think streaming is possible with this return statement, we would probably want to yield the generated tokens or return a transformers.TextStreamer right? I'm not sure what Sagemaker expects from this function when streaming is enabled.

Yuri
asked 6 months ago131 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions