Sagemaker response streaming with custom inference code

0

I'm deploying an LLM that uses huggingface transformers on Sagemaker with custom inference code (e.g. custom model_fn, predict_fn, etc). I know that Sagemaker has support for response streaming, and I've seen this AWS blog post on how to get it set up using ready-to-go containers, but I'm wondering how to set it up with custom inference code.

The predict_fn does the actual inference like so:

def predict_fn(data, model):
    model, tokenizer = model
    prompt = data.pop("input")
    ...
    <inference the model, get full generated output>
    ...
    return output

I don't think streaming is possible with this return statement, we would probably want to yield the generated tokens or return a transformers.TextStreamer right? I'm not sure what Sagemaker expects from this function when streaming is enabled.

Yuri
gefragt vor 6 Monaten138 Aufrufe
Keine Antworten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen