EFS mount on ec2 not tolerant of zone failure


Hi, I use a efs mount on my ec2s that are part of a auto scaling group. Whilst doing some house keeping I spotted that the efs was spread across three subnets that I didn't intend. I used the manage network option within the EFS console and removed a zone and then added it back with the correct subnet. This caused the mount on my ec2's to become unresponsive. I simple reboot resolved the issue. This got me thinking about what would happen with a zone failure so I did some tests. It appears that if you remove a zone from an EFS, if the ec2 mount isn't using that zone then everything is ok, however, if that zone is currently being used, then the ec2 mount doesn't failover to another zone and the mount becomes un-responsive. Is there any way to mitigate a zone failure on a efs mount or is this a single point of failure in a system using efs mounts?

My fstab entry options are; nfs4 nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,_netdev 0 0

Thanks for any advice.

  • What DNS hostname is your mount using?

  • The dns name is fs-0d2<obfuscated>5b.efs.eu-west-1.amazonaws.com

asked 2 years ago449 views
3 Answers

Hi , perhaps this document would provide more clarity - https://docs.aws.amazon.com/efs/latest/ug/how-it-works.html#how-it-works-conceptual , especially the part "We recommend that you access the file system from a mount target within the same Availability Zone for performance and cost reasons." . In addition, cross-AZ mounts ( EC2 in one AZ and mount target in another ) would also reduce availability . If any one of those AZs go down, your application availability will be impacted. That's the reason EC2 and mount target should be in same AZ ( static) and any changes in the network settings of the mount target would make the mount unresponsive on EC2. For any zonal failures, EC2 instances and Mount Targets on another AZ should be able to pick up the additional load.

answered 2 years ago

The challenge here is the interaction between DNS names, IP addresses and applications - in this case, the application is the NFS client.

The intention behind using a DNS name rather than an IP address for a resource (and this is regardless of application) is that the DNS lookup can return multiple IP addresses. The application can then choose one (normally at random, but it might have some mechanism for using something "closer" to itself - or maybe on the local network; most of the time this doesn't happen though) and that IP address is what it connects to. If the IP address doesn't respond then it can choose another from the list that was returned in the first place.

Note that this choice and the retries (in the event of a failure) might be done at an operating system level rather than within the application. But either way, it works pretty much the same.

But what happens when the application connects to an IP that works; but later on that IP stops working? The application has to use one of the other IP addresses; it might have them cached or it might do a DNS lookup again. That's not really the issue - the challenge is, how long does it take for the application to figure out that the IP address it is communicating with is not responding? And in that case, does it do another DNS lookup to try another IP address; or does it just retry the existing IP address forever? At what point does it give up?

In this case, the NFS client might eventually figure out that the original IP address isn't responding and try another. But it also might not. From this distance, it's not possible to say because I don't know what NFS client is being used; and even if I did it's more a question for the developers/designers of that client. The timeout might even be adjustable - again, a question for the developers of the client.

You might find that rather than a reboot it would be possible to reset just the NFS client by sending a signal/interrupt of some sort. But again, you'd need to reliably detect that the endpoint wasn't responding.

profile pictureAWS
answered 2 years ago

Thanks to both of you for taking the time to give me your thoughts. This efs is shared across ec2's in an ASG. The instances have Apache2 running so I think I'll move the health.html file for the health check form the ami and put it on the efs, This should mitigate against a ec2 that isn't fully functional residing in the ASG.

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions