How can I serve ML models quickly and with a low latency


Assume a user connects via a Websocket connection to a server, which serves a personalized typescript function based on a personalized JSON file

So when a user connects,

  • the personalized JSON file is loaded from an S3 bucket (around 60-100 MB per user)
  • and when he types a Typescript/JavaScript/Python code is executed which returns some string a reply and the JSON-like data structure gets updates
  • when the user disconnects the JSON gets persisted back to the S3-like bucket.

In total, you can think about 10,000 users, so 600 GB in total.

It should

  • spin up fast for a user,
  • should be very scalable given the number of users (such that we do not waste money) and
  • have a global latency of a few tens of ms.

Is that possible? If so, what architecture seems to be the most fitting?

1 Answer

To achieve low latency, scalability, and cost efficiency for your use case, you can consider the following

  • Use Amazon API Gateway with WebSocket support to handle the WebSocket connections. API Gateway will manage the connections and route messages to the appropriate backend services.
  • For processing personalized JSON files and executing Typescript/JavaScript/Python code, use AWS Lambda. Lambda allows you to run functions in response to events, in this case, messages coming from API Gateway. This will provide fast spin-up, auto-scaling, and cost efficiency.
  • To reduce the latency of reading and writing JSON files, use Amazon ElastiCache (either Redis or Memcached) as an in-memory data store. When a user connects, load the JSON file from S3 into ElastiCache. Perform updates in ElastiCache and persist the data back to S3 when the user disconnects. This approach will significantly reduce latency compared to loading the JSON directly from S3.
  • Store the personalized JSON files in an Amazon S3 bucket.
  • To reduce latency globally, use Amazon CloudFront with Lambda@Edge. With Lambda@Edge, you can run Lambda functions closer to the users, reducing latency.

In summary, the architecture would look like this:

Users connect to API Gateway (WebSocket) which triggers Lambda functions. Lambda functions read/write data from/to ElastiCache (in-memory storage). ElastiCache loads data from S3 when the user connects and stores data back to S3 when the user disconnects. Use CloudFront and Lambda@Edge to reduce global latency.

profile picture
answered a year ago
  • Many thanks for getting back to me. Is it really possible to essentially have one lambda instance per user? Also couldn't we just store the JSON in the lambda and persist it back to S3 upon disconnect? Is there some CloudFront/Terraform for setting it up?

  • Lambda functions are stateless and there is no way to assign a specific user to a specific instance. For that reason, as @sdtslmn mentioned, you should save the interim objects outside of the function memory.

    I would also look at using DynamoDB as the interim store instead of Elasticache, to have a fully serverless solution.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions