On a high level we monitor network traffic in our EKS clusters through the container_network_transmit_bytes_total
and container_network_receive_bytes_total
metrics provided by cadvisor. Recently, while investigating network usage and trying to reduce the costs for NatGateway-Bytes
and DataTransfer-Regional-Bytes
, I stumbled upon unusually high network usage of the aws-node
and kube-proxy
pods. Looking at the utilization, the following queries
sum (increase(container_network_transmit_bytes_total{pod=~"aws-node.*",cluster="prod"}[24h])) / 1024^3
sum (increase(container_network_transmit_bytes_total{pod=~"kube-proxy.*",cluster="prod"}[24h])) / 1024^3
both return the same usage (up to GB precision) of 17818 GB/day
. Compared to the residual traffic of the cluster
sum (increase(container_network_transmit_bytes_total{pod!~"kube-proxy.*|aws-node.*",cluster="prod"}[24h])) / 1024^3
which returns 9817 GB/day
, this seems unusually high. I could not find reasons that would justify these numbers online. To my understanding kube-proxy
just creates rules in the iptables of the nodes to forward packets to the correct pods/services, but from these findings it seems to me that the packets are actually routed through kube-proxy
? Is there any way to debug this further, or can somebody please enlighten me about this high network usage of aws-node
and kube-proxy
? Also what is the reason that these two pods report almost identical network usage?
Best,
Sam