Design questions on asg, backup restore, ebs and efs

Question

Hi experts,
We are designing to deploy a BI application in AWS. We have a default setting to repave the ec2 instance every 14 days which means it will rebuild the whole cluster instances with services and bring back it to last known good state. We want to have a solution with no/minimal downtime.
The application has different services provisioned on different ec2 instances. First server will be like a main node and rest are additional nodes with different services running on them. We install all additional nodes same way but configure services later in the code deploy. 
1. Can we use asg? If yes, how can we distribute the topology? Which mean out of 5 instances, if one server repaves, then that server should come up with the same services as the previous one. Is there a way to label in asg saying that this server should configure as certain service?
1. Each server should have its own ebs volume and stores some data in it. - what is the fastest way to copy or attach the ebs volume to new repaves server without downtime?
2. For shared data we want to use EFS 
3. for metadata from embedded Postgres - we need to take a backup periodically and restore after repave(create new instance with install and same service) - how can we achieve this without downtime?

We do not want to use customized AMI as we have a big process for ami creation and we often need to change it if we want to add install and config in it.

Sorry if this is a lot to answers. Some guidance is helpful.

Accepted Answer

Hello!

There's a few different things going on here, but before I answer your questions I'd like to start by pointing out that a lot of these are anti-patterns to the way ASGs are designed to be used.  ASGs and most of their features are assuming a group of stateless worker nodes which are all identical and ephemeral.  While its not impossible to do what you're asking for, its going to take a lot more work than trying to setup a fleet of webservers (either work in setting up customizations, or in reworking the application to be more stateless/cloud native)

1. Can we use asg? If yes, how can we distribute the topology? Which mean out of 5 instances, if one server repaves, then that server should come up with the same services as the previous one. Is there a way to label in asg saying that this server should configure as certain service?  
* ASGs launch identical instances every time (same AMI and userdata run on every launch).  If you have different services running on different servers, you would need to use multiple ASGs (or have a service like ECS/EKS running multiple container services on top of the ASG instances).  If its the same services running on every instance, then you could create a userdata script to install and/or start those services when the new instances launch
2. Each server should have its own ebs volume and stores some data in it. - what is the fastest way to copy or attach the ebs volume to new repaves server without downtime?
* ASG instances aren't designed to store any unique data on them.  The main workarounds to this are either
  * Store the data on secondary EBS volumes, and have the new instances re-attach those volumes when they launch as part of userdata
  * Store the data on EFS
  * Store the data in a backend data tier and keep nothing important on the instance itself (RDS, DynamoDB, etc)

3. For shared data we want to use EFS
* This is a good idea, Here's some [userdata for mounting the EFS volume](https://docs.aws.amazon.com/efs/latest/ug/wt2-apache-web-server.html#wt2-apache-web-server-auto-scale-group)

4. for metadata from embedded Postgres - we need to take a backup periodically and restore after repave(create new instance with install and same service) - how can we achieve this without downtime?
* Is there a reason you need Postgres to be running inside each instance?  Generally with ASGs its better to have separate external databases so that there isn't any accidental data loss (ASG instances should be treated like they can get terminated at any time, for example if a healthcheck replacement happens due to underlaying hardware failure).  This is also more scalable, since this way you're scaling the worker instances/application separately from the database, both of which often have very different resource usage requirements.

We do not want to use customized AMI as we have a big process for ami creation and we often need to change it if we want to add install and config in it. 
* You can have a base AMI and make changes in userdata.  Or for even fewer launch template updates, you could have the userdata pointed at an external repository to download the applications. (don't use launch configurations, [they're deprecated](https://aws.amazon.com/blogs/compute/amazon-ec2-auto-scaling-will-no-longer-add-support-for-new-ec2-features-to-launch-configurations/))
* You mentioned CodeDeploy also.  CodeDeploy will start deploying almost immediately when the instance is launched, so be careful about race conditions where UserData and CodeDeploy are both trying to run things at the same time.  The two aren't going to be aware of each other.

repave the ec2 instance every 14 days which means it will rebuild the whole cluster instances with services and bring back it to last known good state. We want to have a solution with no/minimal downtime
* You can use an [Instance Refresh](https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-instance-refresh.html) to replace instances in the ASG in batches
* You can set a [Max Instance Lifetime ](https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-max-instance-lifetime.html) and AutoScaling will replace instances when the are close to this set max age.  This would be for if you don't want any single instance to be older than 2 weeks, but aren't trying to replace them all on a set schedule
* Have a plan to drain the existing requests running on the instances before they're terminated.  If you're using an ELB, the deregistration delay should take care of this for you for most types of applications.

Design questions on asg, backup restore, ebs and efs

Relevant content