Skip to content

Site-to-Site VPN - Both Tunnels Flapping Simultaneously

2

We have an AWS Site-to-Site VPN connection to our on-premises data center. For the past week, we've been experiencing intermittent outages where both VPN tunnels go down at the same time and recover 25-45 minutes later.

CloudWatch TunnelState metric shows both tunnels dropping to 0 simultaneously and recovering together. This happens 2-3 times per day, usually lasting 25-45 minutes.

Our setup:

  • VPN connection with 2 tunnels (as standard)
  • Customer Gateway: Cisco ASA 5525-X
  • DPD (Dead Peer Detection) is enabled
  • Both tunnels are configured as active/active
  • IKEv2 with AES-256-GCM

We initially suspected AWS maintenance, but AWS support confirmed no maintenance events during the outage windows. We also checked our ISP, there were no reported outages.

What could cause BOTH tunnels to fail simultaneously? They terminate on different AWS endpoints, right?

2 Answers
4
Accepted Answer

You're correct that the two tunnels terminate on different AWS endpoint IPs. This is the key diagnostic clue: if both tunnels fail simultaneously despite terminating on separate AWS infrastructure, the single common point of failure is the Customer Gateway (CGW) side.

Both tunnels originate from the same CGW IP (your Cisco ASA's public interface). Anything that disrupts connectivity from that single IP to the internet will take down both tunnels at the same time, even though the AWS endpoints are independent.

Most likely causes:

  1. The CGW is failing to respond to AWS DPD (Dead Peer Detection) messages on both tunnels. AWS sends DPD probes every 10 seconds. After 3 consecutive missed responses, AWS declares the peer dead and takes action based on the configured DPD timeout action. If your ASA experiences a brief CPU spike, memory pressure, or interface flap, it may miss DPD responses on both tunnels simultaneously.

  2. An upstream network issue between your CGW and the internet (ISP micro-outage, NAT device session table exhaustion, or firewall rate-limiting UDP 500/4500) that is not severe enough to appear on ISP status pages but enough to drop the DPD packets.

The 25-45 minute recovery time suggests your DPD timeout action is set to "Clear" (the default) and the startup action is set to "Add" (the default, meaning AWS acts as responder only). With this combination, when a DPD timeout occurs, AWS ends the IKE session and does not attempt to re-initiate it. The tunnel stays down until the CGW initiates a new IKE session. If your ASA has a backoff timer or does not actively retry, the tunnel remains down until the ASA's retry timer fires.

Troubleshooting steps:

  1. Enable VPN tunnel logging on the AWS side and check whether the tunnels went down due to DPD timeout (CGW not responding) or due to the CGW sending a DELETE notification (CGW intentionally tearing down). This distinction tells you whether the CGW went unresponsive or actively disconnected.

  2. Check your ASA logs during the exact outage timestamps for CPU/memory alerts, IKE errors, or interface flaps:

   show cpu usage
   show memory
   show crypto ikev2 sa detail
   show logging | include VPN|IKE|DPD
  1. Run continuous pings from a device outside your CGW (e.g., a separate host on your network) to both AWS tunnel endpoint IPs. If pings fail during the outage, the issue is upstream of the CGW (ISP/firewall/NAT), not the CGW itself.

  2. Check if your firewall or NAT device has aggressive UDP session timeouts. Some devices timeout UDP 4500 (NAT-T) sessions as low as 30 seconds, which can cause both tunnels to appear unreachable from AWS's perspective.

Recommended changes:

  1. Change the DPD timeout action from "Clear" to "Restart". With "Restart", AWS will actively attempt to re-establish the IKE session after a DPD timeout rather than waiting for the CGW to initiate. This should significantly reduce your recovery time.

  2. Change the IKE startup action from "Add" (responder) to "Start" (initiator). This ensures AWS actively tries to bring the tunnel up rather than waiting for the CGW.

  3. On the ASA side, configure aggressive IKE retry timers so the CGW attempts to re-establish immediately after a failure rather than waiting 25-45 minutes.

  4. For long-term redundancy, consider adding a second VPN connection from a different CGW device (different public IP, ideally different ISP). This eliminates the single point of failure at the CGW level.

References:

AWS
SUPPORT ENGINEER
answered 16 days ago
EXPERT
reviewed 16 days ago
  • Ah, nice pointing out the Startup and DPD actions on the AWS side. Combined with checking the ASA crypto logs, that pretty much covers it.

3

AWS tunnels terminate in different Availability Zones. A simultaneous drop almost always points to the single common denominator on your end - your Cisco ASA or the ISP.

Since both tunnels drop together for 25-45 minutes, check your ASA syslogs just before the outage for these three common culprits:

  1. SA Rekeying Failures: The specific outage duration strongly suggests an IPsec lifetime mismatch. Ensure your ASA perfectly matches AWS defaults: Phase 1 (8 hours / 28800s) and Phase 2 (1 hour / 3600s). Look for IKEv2 rekeying errors.

  2. Resource Exhaustion: A CPU spike, memory leak, or crypto-engine hang on the ASA will kill both tunnels at once. Try to run show crypto accelerator statistics during the next outage.

  3. Silent ISP Packet Loss: ISPs rarely detect transient UDP (500/4500) drops or upstream BGP flaps. Set up continuous pings to the AWS tunnel public IPs to catch micro-outages that trigger DPD (Dead Peer Detection) disconnects.

The exact root cause will be in your ASA's IKE/IPsec logs in the 5 minutes leading up to the disconnect.

P.S.: I normally download the official Cisco ASA configuration template directly from the AWS VPC console (selecting ASA 9.x+ and IKEv2). Cross-check your running configuration line-by-line with this fresh template - especially the crypto ikev2 policy, transform-set, and lifetimes. If there is even a minor discrepancy or an outdated proposal from an older deployment, it often triggers exactly this kind of rekeying instability.

EXPERT
answered 16 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.