I'm deploying an LLM that uses huggingface transformers on Sagemaker with custom inference code (e.g. custom model_fn, predict_fn, etc). I know that Sagemaker has support for response streaming, and I've seen this AWS blog post on how to get it set up using ready-to-go containers, but I'm wondering how to set it up with custom inference code.
The predict_fn
does the actual inference like so:
def predict_fn(data, model):
model, tokenizer = model
prompt = data.pop("input")
...
<inference the model, get full generated output>
...
return output
I don't think streaming is possible with this return statement, we would probably want to yield
the generated tokens or return a transformers.TextStreamer
right? I'm not sure what Sagemaker expects from this function when streaming is enabled.