Spot rebalancing recommendations terminates old instance before new one is ready

0

We use Beanstalk to manage spot EC2 instances through an Auto Scaling Group.

Yesterday we had 10s of rebalance recommendations that caused problems as, although the documentation says the old instance is only terminated after the new one becomes healthy, that didn't seem to be the case, and this is repeatable using the Fault Injection Service.

Screenshot

In this screenshot, the top instance (ending 195) is being added as a result of the second instance (ending e98) having a rebalance recommendation.

The ASG has a minimum of 4 machines, so I was expecting the second instance (ending e98) to stay accepting requests until the top instance shows as healthy, however, as the screenshot shows, it starts the draining process very quickly (about 30 seconds after the new instance is created, despite failing health-checks while it sets itself up).

The activity logs says "At (date) an instance was taken out of service in response to an EC2 instance rebalance recommendation" well before the new instance was healthy/serving data (about 3 mins later).

We have a terminating lifecycle hook set for 30 mins later so I assume that's not related.

Does anyone know what we could be doing wrong?

Thanks, Ed

Ed
asked 11 days ago120 views
1 Answer
1

EDIT: The issue here ended up being the HealthCheck setting on the ASG was set to only "EC2" and needed to also have "ELB" healthchecks enabled. Original answer and comment thread below for additional info/context

Hi Ed,

This likely isn't any config setting on your side. The only thing you should check is if there is an Instance Maintenance Policy set on the ASG. This would show up on Instance maintenance policy section near the bottom of the Details tab of the ASG on the ASG console. If the feature is enabled and the Min healthy percentage is less than 100%, that will override the default "launch before terminate" behavior of Rebalance Recommendations. If you set the Fault Injection Simulator tests to a much longer time for the notification coming in, then you should see the launch happening and finishing before the termination workflow starts. This would confirm everything is setup correctly on your end

Otherwise, what's likely happening here is the Spot 2 minute warning coming in relatively quickly after the Rebalance Recommendation (or possibly even at the same time). As soon as the 2 minute warning comes in, the ASG will start trying to gracefully terminate the instance to reduce the impact of spot terminating it 2 minutes later, regardless of if the new instance is already launched or healthy.

Nothing in the ASG will prevent spot from terminating the instance, when spot wants to reclaim it, even if the ELB is still deregistering or the lifecycle hook is still in progress (otherwise, you would be able to delay spot taking the instance back for hours/days with a hook, and Spot is built around EC2 being able to get the capacity back from spot when it needs it)

If the time to deregister the instance and complete the lifecycle action takes too long, the instance might be interrupted while Amazon EC2 Auto Scaling waits for your lifecycle action to complete before terminating the instance.


If the application always needs anywhere close to 30 minutes to gracefully drain, spot might not be ideal for it at this stage until the application can be refactored to reduce that time.

Since you have a min of 4, you may want to look into setting the OnDemandBaseCapacity of the ASG to 4, with everything above the base being spot; and purchasing an Compute Savings Plan(CSP) for the cost of the 4 instances. This will make it so you know there should at least 4 stable instances running. The instance type isn't mentioned, but I see many instance types in eu-west-2 currently that have similar (or sometimes even better) pricing on a CSP than on Spot when purchasing a 3 year CSP

AWS
answered 11 days ago
  • Thanks for the very comprehensive answer - really appreciated. Sadly the min percentage is already 100%, and when using the fault simulator, I was testing with a 15 minute warning so I could see the new instance come in, but it wasn't ready before the old one moved to terminating and/or was deregistered (I can't tell which happened first - the deregistering or terminating) and then well after that happened, the 2 minute warning came in. Feels like a bug, maybe...???

    (The 30 min lifecycle is for our on-demand ones to give them a chance to finish the bigger jobs they can do - the spot ones don't need the 30 mins but the time has to be the same, I think).

  • Hmm, interesting. With a 15 minute warning, I would expect termination to start after 13 minutes if the new instance isn't up yet. I'll need to check what exact condition the ASG uses to determine 'healthy' for the purposes of this process and get back to you. Can you check if the Health Check Type in the ASG is set to EC2 or ELB? Last time I used Beanstalk, it defaulted to EC2, and you could only change it to ELB with an EBExtension (its been a while though). If your ASG is set to EC2 as the healthcheck type, then it won't ever look at the ELB's health state

  • Thanks for the reply - changing the ASG to use ELB health checks too has worked perfectly - thank you so much (and your comment about the 3-year discount is very valid also - basically identical price to spot it seems!!)

    (I'm not sure what the right thing to do RE accepting the original answer is though - do let me know!).

  • Glad to hear it worked! These comments are all linked to the original answer, so you can just accept it and they'll be visible to anyone who finds this in the future. I'm also editing the answer with the final outcome so others don't need to go through all the comments. Thanks for both accepting the answer, and coming back to confirm it worked! Always nice to close the loop on these things :D

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions