Step Functions StartExecution latency via AWS SDK

0

Hi,

I have a Step Functions Express state machine for which I start executions with AWS SDK for PHP (StartExecution API). My code is running on an EC2 instance (Docker container on t3.micro) in a load balanced Beanstalk application. For the API call to start an execution, the total time it takes (everything included) is between 155ms and 500ms. The average is around 200ms. This is quite high and is a problem for us. My first question is if this is unusually high, or if this is normal?

I tried starting the same workflow through API Gateway and saw roughly the same response times (or maybe slightly lower). I also tried using the PutItem API for a DynamoDB table and saw an average of around 200ms. Am I correct in assuming that these numbers should be lower?

If my assumption is correct, I am thinking that maybe this is caused by the network path from my EC2 instance to the AWS API. My Beanstalk application is not using a VPC (though the EC2 instance is in the default VPC). Perhaps things could be improved by using a VPC and PrivateLink (VPC interface endpoint)? https://docs.aws.amazon.com/step-functions/latest/dg/vpc-endpoints.html

So;

  1. Is an average of 200ms unusually high or is this to be expected?
  2. If #1 is true, should I expect using VPC/PrivateLink to improve this?
  3. Which response times (everything included) should I expect (roughly/ballpark)?

Thanks a lot!

1 Answer
0
Accepted Answer

Using a VPC endpoint here isn't going to affect the latency in any noticeable way so I'd not recommend it unless you would like to use a private endpoint for other reasons (for example, reducing traffic through a NAT Gateway).

We don't provide any guidance as to the latency (shortest, longest, average) on API calls to AWS services. If you believe there is a problem the best suggestion I have is to raise a support case because the team can look at the Step Function as configured and determine if there are any problems with the service itself.

I'd also add that all AWS services are multi-tenant (in same way or another); and there will be variances in the time it takes to perform actions (API calls are a good example of this). We do our absolute best to ensure that all customers have "fair" access to the services and to reduce the effect of "noisy neighours" but it is inevitable in large-scale systems that the latency of some operations will vary sometimes.

You say "this (latency) is quite high and is a problem for us". What sort of latency do you require? And what is the Step Function doing? It might be the case that this isn't the best solution for the problem.

profile pictureAWS
EXPERT
answered a year ago
  • Thank you for the response. We need to ingest analytics data from many browsers and start an SFN execution for each event that is received. Currently that would add ~200ms response time to our API, which would make it slow. We also need to access DynamoDB, which is also slow (~150ms). The SFN workflow just does a bit of processing with Lambda functions. I was hoping for <100ms response times on average. I also tried testing with the AWS CLI but saw the same response times.

  • Curiously, running a DynamoDB query within AWS Console (which uses AWS JS SDK) is fast (30-40ms), but running the exact same query with AWS CLI gives a response time of ~150ms. The Step Functions API seems slow even in the AWS Console (inspecting my browser's network tab).

  • There are a lot of things that can add up to create latency. For example: When calling AWS services the underlying libraries have to set up a TCP connection (so there's a three-way handshake there) and then perform TLS negotiation (quite a few packets) and only then can the request be sent. You might consider writing the code such that it has those connections open all the time (this might be done by calling a "read" API when initialised) so that your other calls are then faster.

  • Yes, that is actually the conclusion I have come to as well. Because the AWS SDK is on a web server (event loop), a new TCP connection is created for each call. I will try to optimize this by keeping the HTTP connections alive after the first request by using an Envoy proxy or similar. Hopefully that will noticeably improve the time spent sending the API requests. Thanks a lot for the input, it is much appreciated!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions