Skip to content

How do I troubleshoot primary node failure with error “502 Bad Gateway” or “504 Gateway Time-out” in Amazon EMR?

3 minute read
0

My Amazon EMR primary node fails with a "502 Bad Gateway" or "504 Gateway Time-out" error.

Short description

An Amazon EMR primary node might fail with one of the following errors:

"The master failed: Error occurred:<html>?? <head><title>502 Bad Gateway</title></head> <body>?? <center><h1>502 Bad Gateway</h1></center> <hr><center>nginx/1.20.0</center>?? </body>?? </html>??"

-or-

"The master failed: Error occurred: <html>??<head><title>504 Gateway Time-out</title></head>??<body>??<center><h1>504 Gateway Time-out</h1></center>??<hr><center>nginx/1.16.1</center>??</body>??</html>??"

You might receive these errors for one of the following reasons:

  • The instance-controller daemon is in the stopped state or is down on the primary node instance.
  • The primary node is out of memory or disk space.
  • The Amazon Elastic Compute Cloud (Amazon EC2) instance status checks fail.

Resolution

Troubleshoot primary node instance-controller daemon failures

The instance controller on the primary node communicates with the Amazon EMR control plane and the rest of the cluster. If the instance controller can't communicate with the Amazon EMR control plane, then Amazon EMR classifies the primary node as unhealthy. If termination protection is turned on, then use SSH to connect to the primary node and then restart the instance controller process.

Amazon EMR version 5.30.0 and later:

  1. To check the status of the instance controller, run the following command:

    sudo systemctl status instance-controller.service
  2. If the instance controller status is down, then run the following command to restart the instance controller:

    sudo systemctl start instance-controller.service

Amazon EMR version 2 to 4:

  1. To check the status of the instance controller, run the following command:

    sudo /etc/init.d/instance-controller status
  2. If the instance controller status is down, then run the following command to restart the instance controller:

    sudo /etc/init.d/instance-controller start

Troubleshoot memory and disk issues

Complete the following steps:

  1. If termination protection is turned on, then use SSH to connect to the primary node.
  2. Review the instance-state log file.
  3. Analyze the instance metrics such as memory and disk listed in the instant-state log. You can use Linux commands such as free -m and df -h to analyze these metrics.
  4. Use the log file results to determine why the primary node uses a large amount of disk or memory.

Troubleshoot primary node EC2 instance status check failures

Review the instance status check metrics to determine whether the primary instance status check fails. If the instance status check fails, then troubleshoot the instance status check failure.

Note: when you start and stop your EC2 instance, your Amazon EMR cluster stops.

Troubleshoot primary nodes that have termination protection turned off and the cluster is already terminated

Take the following actions:

AWS OFFICIALUpdated 2 months ago