ARP resolution does not work as intended within VPC when doing L2 Announcements with CIlium CNI on Kubernetes Cluster spanning EC2 instances across subnets.

0

VPC Configuration

  • VPC CIDR: 10.0.0.0/16
  • Region 1: 10.0.0.0/24 (public), 10.0.64.0/24 (private)
  • Region 2: 10.0.16.0/24 (public), 10.0.80.0/24 (private)
  • Region 3: 10.0.32.0/24 (public), 10.0.96.0/24 (private)

EC2 Configuration

  • Instance A: Deployed on private subnet 10.0.64.0/24 in Region 1. Acts as the control-plane node in my kubernetes cluster.
  • Instance B: Deployed on private subnet 10.0.96.0/24 in Region 3. Acts as the worker node in my kubernetes cluster.
  • Instance C: Deployed on public subnet 10.0.16.0/24 in Region 2. Acts as the worker node in my kubernetes cluster.
  • Instance D: Deployed on public subnet 10.0.0.0/24 in Region 1. Acts as the test machine.

Kubernetes Setup

I've setup a kubernetes cluster with a Instance A as the master node, Instance B as the private worker node and Instance C as the public worker node. I'm using Cilium CNI with VXLAN routing and I've enabled Cilium's L2Annoucements feature. I've deployed a nginx deployement with an nginx service called nginx-svc of type LoadBalancer. I've also created a CiliumLoadBalancerIPPool resource in my cluster that will grant any services an External IP from subnet 10.0.128.0/24 to services of type LoadBalancer.

I chose 10.0.128.0/24 because it was unused and wouldn't conflict with my existing VPC subnets.

Problem

As expected, my nginx-svc received an External IP from the virtual subnet 10.0.128.0/24. Let's say this External IP was 10.0.128.1. When I do a curl http://10.0.128.1 from Instance A, Instance B and Instance C, I'm able to access my nginx-svc. However, when I do curl http://10.0.128.1 on Instance D that isn't joined to my kubernetes cluster I'm unable to access the service and the request times out. This is the problem. I've read into how the L2Annoucement feature works and it does so by sending an ARP reply to the router responsible for LAN CIDR (such as 10.0.0.0/16 in AWS case) to make it aware of the usage of virtual IP 10.0.128.1 such that ARP requests from other instances (Instance A,B,C,D) within the same LAN (10.0.0.0/16) are forwarded to the MAC address of instance running the service.

This exact same setup works locally when I run it on QEMU and I'm able to access 10.0.128.1 even from virtual machines that are not joined into the cluster but this fails when running on AWS VPC. I am not entirely sure why this fails on AWS VPC.

The reason I want to to access service running on the virtual IP 10.0.128.1 from Instance D which isn't joined to the cluster is so I can create a DNAT/SNAT rule on Instance D via iptables and have it forward traffic from/to it's public IPv4 address to/from the private service address 10.0.128.1 accessible via LAN/VPC thereby simulating a public facing service.

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions