My Amazon EMR primary node fails with a "502 Bad Gateway" or "504 Gateway Time-out" error.
Short description
An Amazon EMR primary node might fail with one of the following errors:
"The master failed: Error occurred:<html>?? <head><title>502 Bad Gateway</title></head> <body>?? <center><h1>502 Bad Gateway</h1></center> <hr><center>nginx/1.20.0</center>?? </body>?? </html>??"
-or-
"The master failed: Error occurred: <html>??<head><title>504 Gateway Time-out</title></head>??<body>??<center><h1>504 Gateway Time-out</h1></center>??<hr><center>nginx/1.16.1</center>??</body>??</html>??"
You might receive these errors for one of the following reasons:
- The instance-controller daemon is in the stopped state or is down on the primary node instance.
- The primary node is out of memory or disk space.
- The Amazon Elastic Compute Cloud (Amazon EC2) instance status checks fail.
Resolution
Troubleshoot primary node instance-controller daemon failures
The instance controller on the primary node communicates with the Amazon EMR control plane and the rest of the cluster. If the instance controller can't communicate with the Amazon EMR control plane, then Amazon EMR classifies the primary node as unhealthy. If termination protection is turned on, then use SSH to connect to the primary node and then restart the instance controller process.
Amazon EMR version 5.30.0 and later:
-
To check the status of the instance controller, run the following command:
sudo systemctl status instance-controller.service
-
If the instance controller status is down, then run the following command to restart the instance controller:
sudo systemctl start instance-controller.service
Amazon EMR version 2 to 4:
-
To check the status of the instance controller, run the following command:
sudo /etc/init.d/instance-controller status
-
If the instance controller status is down, then run the following command to restart the instance controller:
sudo /etc/init.d/instance-controller start
Troubleshoot memory and disk issues
Complete the following steps:
- If termination protection is turned on, then use SSH to connect to the primary node.
- Review the instance-state log file.
- Analyze the instance metrics such as memory and disk listed in the instant-state log. You can use Linux commands such as free -m and df -h to analyze these metrics.
- Use the log file results to determine why the primary node uses a large amount of disk or memory.
Troubleshoot primary node EC2 instance status check failures
Review the instance status check metrics to determine whether the primary instance status check fails. If the instance status check fails, then troubleshoot the instance status check failure.
Note: when you start and stop your EC2 instance, your Amazon EMR cluster stops.
Troubleshoot primary nodes that have termination protection turned off and the cluster is already terminated
Take the following actions: