I am running into an issue with my MSK cluster's broker prometheus metrics. The JMX metrics endpoint constantly returns 429 (too many requests) errors when prometheus attempts to scrape the /metrics endpoint on port 11001 (JMX).
This does not seem to be related to broker instance type (3 m5.large brokers), as the Node metrics endpoint on port 11002 AND my consumers do not run into any throttling issues ever.
This is problematic, as I wish to monitor OffsetLag and other broker-specific metrics; the inconsistency of JMX metrics scrapes makes this nearly impossible. I have found no info anywhere else of anyone running to this particular error. Like I mentioned, I am only running into 429 errors on this JMX metrics endpoint, not anywhere else.
I have even pushed back the scrape interval to 2+minutes, and this does not solve the problems.
I'm having the same issue - the throttling on the endpoint is extreme and I'm not sure how to reliably get metrics. If the metrics are too expensive to calculate it should serve cached metrics not 429