Sagemaker response streaming with custom inference code

0

I'm deploying an LLM that uses huggingface transformers on Sagemaker with custom inference code (e.g. custom model_fn, predict_fn, etc). I know that Sagemaker has support for response streaming, and I've seen this AWS blog post on how to get it set up using ready-to-go containers, but I'm wondering how to set it up with custom inference code.

The predict_fn does the actual inference like so:

def predict_fn(data, model):
    model, tokenizer = model
    prompt = data.pop("input")
    ...
    <inference the model, get full generated output>
    ...
    return output

I don't think streaming is possible with this return statement, we would probably want to yield the generated tokens or return a transformers.TextStreamer right? I'm not sure what Sagemaker expects from this function when streaming is enabled.

Yuri
質問済み 6ヶ月前138ビュー
回答なし

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ