Office365 not reachable from gateway load balancer

0

Hello, we have a setup of Gateway Load Balancer fronting a couple of firewall appliances in a central VPC to inspect all the traffic from all the spoke VPCs as well as several on-prem remote locations connected through S2S VPN. We have noticed that some clients are not able to reach Microsoft Office365 endpoint https://www.office.com which resolves to (13.107.6.156)

Also we have checked and our clients are able to reach resources in spoke VPCs though and no issues on that. We have consulted our firewall vendor and they have confirmed that the firewall can see the packets egressing to the internet and not being blocked. In an effort to comprehend the cause of this, we have turned on VPC flow logs, however, we couldn't establish a relation between the clients that are able to reach Office365 and the ones that are not.

Our setup is:

Remote Offices <....> S2S VPN <....> Transit Gateway <....> Gateway Load Balancer <....> Firewalls <....> Internet

On a side note, we had a similar issue in the past for some random clients in the spoke VPCs but that was resolved by enabling Transit Gateway Appliance mode. We are not sure though if it's related to the current issue or not as we tried disabling it again but that didn't change anything.

Any hint would be appreciated.

asked 2 years ago7149 views
7 Answers
1
Accepted Answer

Thanks for providing the outputs.

From the traceroute of the non-working machine, there are 17 hops without an entry to reach the final destination [13.107.6.156], while the working one, there are 19 hops with an entry to the final destination.

Clearly there are more than 17 hops to reach [13.107.6.156]. Taking that into consideration and correlating the packet with Sequence Num:2805856178 that is seen in the VPN router and the same packet in the firewall [the inner packet under geneve encapsulation], I do see the TTL is 17 which isn't enough to reach the final destination.

May I know what's the current default TTL value that is used at the clients?

You can check that by running the command below or if you run packets capture in the client itself and look at the IP header.

sudo sysctl -a | grep "net.ipv4.ip_default_ttl"

Clearly you need to look at the TTL at your network and ensure it's big enough to reach the destination.

In some old kernels, it might be set to 64 and you can increase it to 255 by running the command:

sudo sysctl -w "net.ipv4.ip_default_ttl=255"

In most cases, you should see ICMP - Time exceeded in the capture when the TTL reaches zero.

profile pictureAWS
mml
answered 2 years ago
profile pictureAWS
EXPERT
reviewed 2 years ago
1

Keep in mind that Gateway Load Balancer is just a pump in the wire, it doesn't contribute to the TTL. The main issue is within your on-prem network as it seems there are a lot of hops until you hit AWS edge. Since you mentioned that you're using an MPLS backbone to serve your remote branches, you can use this global command to disable decrementing the TTL value of the inner packet - which represents the client packet - across the LSP path, hence making the entire MPLS backbone looks like a single hop. It will hide all the LSRs across the path when you telnet to any destination.

no mpls ip propagate-ttl

Usually, it doesn't incur any downtime, but it's a good practice to schedule a maintenance window just to be in the safe side and commit the change across all your P and PE routers that make up the entire MPLS backbone.

profile pictureAWS
mml
answered 2 years ago
0

Hello, I've got some questions for you:

  • Is the connectivity issue persistent with Office365?
  • Is it only seen with Office365? What about other websites?
  • Can you provide outputs of TCP traceroute to Office365 from a working machine and from non-working one?
  • Are you egressing to the internet through the firewalls only or you have Nat Gateways fronting them? This is to understand if some public IPs are somehow blocked in the internet.

Use this command:

traceroute -T -p443 www.office.com

  • Can you keep one healthy firewall behind Gateway Load Balancer and run packets capture from a non-working client and the firewall at the same time? You can de-register the firewalls from the GWLB target group and keep only one healthy node.
profile pictureAWS
mml
answered 2 years ago
0

On top of user mml's questions, are you doing SNAT on the FWs?

As for the Appliance mode, it is recommended to enable it (only for the TGW Attachment towards the inspection VPC not the Spoke VPC attachments)

There are few additional best practices when it comes to GWLB deployment, take a look if you haven't already:

https://aws.amazon.com/blogs/networking-and-content-delivery/best-practices-for-deploying-gateway-load-balancer/

Also are the route tables setup according to this design:

https://docs.aws.amazon.com/whitepapers/latest/building-scalable-secure-multi-vpc-network-infrastructure/using-nat-gateway-and-gwlb-with-ec2.html

GWLB is just a bump in the wire, it does not alter anything in the traffic.

profile pictureAWS
EXPERT
answered 2 years ago
0

We've looked at these guides and practices and also validated the route tables entries at every hop. The issue is persistent with Office365 for the non working stations. Office365 is the most important website for us, but we noticed the same issue with some other websites as well. We do have three Nat Gateways (one in each AZ) fronting the firewall appliances in the same central VPC and we tried creating new ones to change the egress public IPs, however that didn't help.

I'll provide the outputs and captures shortly.

answered 2 years ago
0

Traceroute from non working station:

[root@99.103.11.203~]# traceroute -T -p443 www.office.com
traceroute to www.office.com (13.107.6.156), 30 hops max, 60 byte packets
 1  * * *
 2  * * *
 3  * (10.30.200.93) (10.30.200.93)  2.148 ms  *
 4  * (10.30.70.150) (10.30.70.150)  3.072 ms  *
 5  * * *
 6  * 240.1.116.14 (240.1.116.14)  2.666 ms *
 7  * 240.1.116.24 (240.1.116.24)  2.687 ms *
 8  * 241.0.10.134 (241.0.10.134)  2.746 ms 240.1.112.8 (240.1.112.8)  3.150 ms
 9  241.0.10.72 (241.0.10.72)  2.911 ms 241.0.10.134 (241.0.10.134)  2.814 ms 240.1.112.28 (240.1.112.28)  2.872 ms
10  242.3.209.17 (242.3.209.17)  3.033 ms 242.3.200.1 (242.3.200.1)  3.391 ms 240.1.108.24 (240.1.108.24)  2.710 ms
11  100.95.19.129 (100.95.19.129)  2.999 ms 242.3.209.129 (242.3.209.129)  13.362 ms 242.3.208.1 (242.3.208.1)  2.956 ms
12  100.100.16.106 (100.100.16.106)  3.342 ms 100.100.2.32 (100.100.2.32)  16.986 ms 100.95.19.143 (100.95.19.143)  2.803 ms
13  100.100.2.42 (100.100.2.42)  9.206 ms 100.95.20.17 (100.95.20.17)  3.501 ms 100.100.2.38 (100.100.2.38)  3.390 ms
14  100.100.2.42 (100.100.2.42)  3.697 ms 99.82.178.63 (99.82.178.63)  3.642 ms 100.95.20.113 (100.95.20.113)  3.310 ms
15  100.100.2.40 (100.100.2.40)  3.561 ms ae25-0.icr02.dub07.ntwk.msn.net (104.44.239.35)  3.667 ms 99.82.178.23 (99.82.178.23)  25.308 ms
16  13.104.176.13 (13.104.176.13)  3.785 ms ae26-0.icr01.dub07.ntwk.msn.net (104.44.239.33)  3.854 ms ae21-0.db3-96c-1a.ntwk.msn.net (104.44.236.62)  3.720 ms
17  * ae25-0.icr02.dub07.ntwk.msn.net (104.44.239.35)  3.749 ms ae21-0.db3-96c-1a.ntwk.msn.net (104.44.236.62)  4.201 ms

Traceroute from a working station:


[root@99.103.11.204~]# traceroute -T -p443 www.office.com
traceroute to www.office.com (13.107.6.156), 30 hops max, 60 byte packets
 1  * * *
 2  * * *
 3  * (10.30.200.93) (10.30.200.93)  2.184 ms  *
 4  * (10.30.70.150) (10.30.70.150)  3.146 ms  *
 5  * * *
 6  * 240.1.116.12 (240.1.116.12)  3.028 ms *
 7  240.1.116.27 (240.1.116.27)  2.916 ms * *
 8  240.1.112.9 (240.1.112.9)  2.791 ms 241.0.10.135 (241.0.10.135)  2.539 ms 240.1.108.9 (240.1.108.9)  2.739 ms
 9  241.0.10.137 (241.0.10.137)  2.875 ms 241.0.10.131 (241.0.10.131)  2.574 ms 240.1.112.26 (240.1.112.26)  2.869 ms
10  242.3.201.17 (242.3.201.17)  3.795 ms 240.1.112.28 (240.1.112.28)  2.817 ms 240.1.112.31 (240.1.112.31)  2.501 ms
11  100.95.19.147 (100.95.19.147)  11.709 ms 242.3.208.17 (242.3.208.17)  9.224 ms 100.95.19.155 (100.95.19.155)  3.633 ms
12  100.100.16.46 (100.100.16.46)  3.608 ms 100.100.2.54 (100.100.2.54)  4.259 ms 100.95.19.139 (100.95.19.139)  2.804 ms
13  100.95.20.65 (100.95.20.65)  4.296 ms 100.100.2.62 (100.100.2.62)  3.825 ms 100.95.20.81 (100.95.20.81)  3.450 ms
14  100.95.20.113 (100.95.20.113)  3.396 ms ae27-0.icr02.dub08.ntwk.msn.net (104.44.239.39)  3.724 ms 99.82.178.23 (99.82.178.23)  4.167 ms
15  ae25-0.icr02.dub07.ntwk.msn.net (104.44.239.35)  3.693 ms ae21-0.db3-96c-1b.ntwk.msn.net (104.44.236.70)  6.535 ms ae23-0.db3-96c-1a.ntwk.msn.net (104.44.236.68)  3.950 ms
16  ae27-0.icr02.dub08.ntwk.msn.net (104.44.239.39)  3.880 ms 13.104.176.8 (13.104.176.8)  3.605 ms ae25-0.icr02.dub07.ntwk.msn.net (104.44.239.35)  3.934 ms
17  ae27-0.icr01.dub08.ntwk.msn.net (104.44.239.37)  3.673 ms * 13.104.176.9 (13.104.176.9)  4.042 ms
18  ae21-0.db3-96c-1a.ntwk.msn.net (104.44.236.62)  4.278 ms * *
19  * 13.107.6.156 (13.107.6.156)  5.667 ms *

tcpdump from on-prem VPN router:

[root@99.103.11.204~]# tcpdump -n -v -r router.pcap
reading from file router.pcap, link-type EN10MB (Ethernet)
20:21:09.541736 IP (tos 0x0, ttl 19, id 13550, offset 0, flags [DF], proto TCP (6), length 60)
    10.10.10.244.43212 > 13.107.6.156.https: Flags [S], cksum 0x2933 (incorrect -> 0x559f), seq 2805856178, win 62727, options [mss 1350,sackOK,TS val 3042490422 ecr 0,nop,wscale 7], length 0
	

tcpdump from the firewall:

[root@firewall-x11-d23~]# tcpdump -n -v -r firewall.pcap
reading from file firewall.pcap, link-type EN10MB (Ethernet)
20:21:09.542414 IP (tos 0x0, ttl 255, id 0, offset 0, flags [none], proto UDP (17), length 128)
    10.30.200.93.60308 > 10.30.200.227.6081: Geneve, Flags [none], vni 0x0, options [class Unknown (0x108) type 0x1 len 12, class Unknown (0x108) type 0x2 len 12, class Unknown (0x108) type 0x3 len 8]
        IP (tos 0x0, ttl 17, id 13550, offset 0, flags [DF], proto TCP (6), length 60)
    10.10.10.244.43212 > 13.107.6.156.https: Flags [S], cksum 0x57f3 (correct), seq 2805856178, win 62727, options [mss 1350,sackOK,TS val 3042490422 ecr 0,nop,wscale 7], length 0
20:21:09.542697 IP (tos 0x0, ttl 254, id 14928, offset 0, flags [none], proto UDP (17), length 128)
    10.30.200.227.60308 > 10.30.200.93.6081: Geneve, Flags [none], vni 0x0, options [class Unknown (0x108) type 0x1 len 12, class Unknown (0x108) type 0x2 len 12, class Unknown (0x108) type 0x3 len 8]
        IP (tos 0x0, ttl 17, id 13550, offset 0, flags [DF], proto TCP (6), length 60)
    10.10.10.244.43212 > 13.107.6.156.https: Flags [S], cksum 0x57f3 (correct), seq 2805856178, win 62727, options [mss 1350,sackOK,TS val 3042490422 ecr 0,nop,wscale 7], length 0
answered 2 years ago
0

Spot on! In efforts to simulate the TTL, we created a temporary GRE tunnel over our MPLS backbone that we use to connect our on-prem remote offices to the VPN router to switch the LSP path to a shorter route and it worked. The TTL didn’t expire and Office365 was reachable. However, this is not a permanent solution and we will now explore other options to address this problem.

Is it possible to modify the Gateway Load Balancer and Transit Gateway to keep the TTL as is and not decrement it?

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions