Sagemaker response streaming with custom inference code

0

I'm deploying an LLM that uses huggingface transformers on Sagemaker with custom inference code (e.g. custom model_fn, predict_fn, etc). I know that Sagemaker has support for response streaming, and I've seen this AWS blog post on how to get it set up using ready-to-go containers, but I'm wondering how to set it up with custom inference code.

The predict_fn does the actual inference like so:

def predict_fn(data, model):
    model, tokenizer = model
    prompt = data.pop("input")
    ...
    <inference the model, get full generated output>
    ...
    return output

I don't think streaming is possible with this return statement, we would probably want to yield the generated tokens or return a transformers.TextStreamer right? I'm not sure what Sagemaker expects from this function when streaming is enabled.

Yuri
feita há 6 meses137 visualizações
Sem respostas

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas