How do I troubleshoot high latency with my Amazon SageMaker endpoint?

4 minute read
1

I want to troubleshoot high latency with my Amazon SageMaker endpoint.

Short description

Your SageMaker endpoint might experience the following types of latency:

  • Model latency - This is the time the model takes to respond to an inference request. Model latency includes the local communication time to send a request and then fetch a response. Also, model latency includes the inference completion time inside the model container.
  • Overhead latency - This is the time SageMaker takes to respond to an invocation request and excludes model latency.
  • Network latency - This is the time the request takes to travel back and forth between the client and the SageMaker endpoint. Network latency occurs outside the AWS infrastructure.

If your SageMaker endpoint serves a single model, then the following Amazon CloudWatch metrics are available:

  • Model latency
  • Overhead latency

If your SageMaker endpoint serves a multi-model endpoint, then the following CloudWatch metrics are available:

  • Model loading wait time - This metric shows the wait time an invocation request takes for the target model to download or load before inference is performed.
  • Model downloading time - This metric shows the time the model takes to download from Amazon Simple Storage Service (Amazon S3).
  • Model loading time - This metric shows the time the model takes to load into the container.
  • Model cache hit - This metric shows the number of InvokeEndpoint requests that are sent to the endpoint that the model was loaded to.

Note: Multi-model endpoints load and unload models throughout their lifetime. To view the number of loaded models for an endpoint, use the LoadedModelCount CloudWatch metric.

Resolution

Troubleshoot your high latency based on the following types of latency:

Model latency

To reduce your high model latency, complete the following:

  • To test the model performance, benchmark the model outside of a SageMaker endpoint.
  • If SageMaker Neo supports your model, then compile the model. SageMaker Neo optimizes models to run twice as fast with less memory footprint and no loss in accuracy.
  • If AWS Inferentia supports your model, then compile the model for Inferentia. This allows higher throughput at a lower cost per inference.
  • If you use a CPU instance and the model supports GPU acceleration, then use a GPU instance to add GPU acceleration to an instance.
    Note: The inference code might affect the model latency based on how the code handles the inference. Any delays in code increase the latency.
  • To dynamically increase and decrease the number of available instances for an endpoint, add auto scaling to an endpoint. An overused endpoint might cause higher model latency.

Overhead latency

Factors that contribute to high overhead latency are:

  • Payload size for requests and responses
  • Request frequency or infrequency
  • Authentication or authorization of the request

Also, the first endpoint invocation might have an increase in latency because of a cold start. To avoid high latency on a cold start, send test requests to the endpoint to pre-warm it.

Network latency

To reduce your high network latency, complete the following:

  • Deploy the client application closer to the AWS Region where the SageMaker endpoint is hosted.
  • Optimize the client-side network configurations and internet connectivity.
  • To bring the inference requests closer to the client, use a content delivery network (CDN) or an edge computing solution.

Note: SageMaker can't directly influence network latency. Make sure that you optimize the overall inference latency for applications that use SageMaker endpoints based on your use case.

AWS OFFICIAL
AWS OFFICIALUpdated 5 months ago