AWS announces preview of AWS Interconnect - multicloud
AWS announces AWS Interconnect – multicloud (preview), providing simple, resilient, high-speed private connections to other cloud service providers. AWS Interconnect - multicloud is easy to configure and provides high-speed, resilient connectivity with dedicated bandwidth, enabling customers to interconnect AWS networking services such as AWS Transit Gateway, AWS Cloud WAN, and Amazon VPC to other cloud service providers with ease.
How do I turn off Safemode for the NameNode service on my Amazon EMR cluster?
The NameNode service goes into Safemode when I try to run an Apache Hadoop or Apache Spark job on an Amazon EMR cluster. I turned off Safemode, but it comes back on immediately.
Short description
When you run an Apache Hadoop or Apache Spark job on an Amazon EMR cluster, you might receive one of the following error messages:
- "Cannot create file/user/test.txt._COPYING_. Name node is in safe mode."
- "org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /user/hadoop/.sparkStaging/application_15########_0001. Name node is in safe mode. It was turned on manually. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. NamenodeHostName:ip-###-##-##-##.ec2.internal"
Safemode for the NameNode is a read-only mode for the Hadoop Distributed File System (HDFS) cluster. In Safemode, you can't make any modifications to the file system or blocks.
After the DataNodes report that most file system blocks are available, the NameNode automatically leaves Safemode. However, the NameNode might enter Safemode again for the following reasons:
- Available space is less than the amount of space that's required for the NameNode storage directory. The parameter dfs.namenode.resource.du.reserved defines the required space for the NameNode directory.
- The NameNode can't load the FsImage and EditLog into memory.
- The NameNode didn't receive the block report from the DataNode.
- Some nodes in the cluster might be down and the blocks on the nodes become unavailable.
- Some blocks might be corrupt.
Check for the root cause of the issue in the NameNode log location, /var/log/hadoop-hdfs/.
Resolution
Before you leave Safemode, confirm that you know why the NameNode is stuck in Safemode. Review the status of all DataNodes and the NameNode logs.
Important: In some cases, when you manually turn off Safemode you might have data loss.
To manually turn off Safemode, run the following command:
sudo -u hdfs hadoop dfsadmin -safemode leave
Depending on the root cause of the error, complete one or more of the following troubleshooting steps to turn off Safemode.
Switch to a cluster with multiple primary nodes
Checkpointing isn't automatic in clusters with a single primary node. So HDFS can't back up edit logs to a new snapshot (FsImage) and remove them automatically. HDFS uses edit logs to record filesystem changes between snapshots. It's a best practice to manually remove the edit logs from a cluster with a single primary node. If you don't manually remove the edit logs, then the logs might use all the disk space in /mnt. To resolve this issue, launch a cluster with multiple primary nodes. Clusters with multiple primary nodes support high availability for HDFS NameNode.
Remove unnecessary files from /mnt
The parameter dfs.namenode.resource.du.reserved specifies the minimum available disk space for /mnt. When the amount of available disk space for /mnt drops to a value below the value that's set in dfs.namenode.resource.du.reserved, the NameNode enters Safemode. The default value for dfs.namenode.resource.du.reserved is 100 MB. When Safemode is on, NameNode blocks all filesystem and block modifications. To resolve this issue, you must remove the unnecessary files from /mnt.
To delete the files that you no longer need, complete the following steps:
-
Check the NameNode logs to verify that the NameNode is in Safemode because of insufficient disk space. If the disk space is sufficient, then the logs look similar to the following example:
2020-08-28 19:14:43,540 WARN org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker (org.apache.hadoop.hdfs.server.namenode.FSNamesystem$NameNodeResourceMonitor@5baaae4c): Space available on volume '/dev/xvdb2' is 76546048, which is below the configured reserved amount 104857600If the disk space is insufficient, then the logs look similar to the following example:
2020-09-28 19:14:43,540 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem (org.apache.hadoop.hdfs.server.namenode.FSNamesystem$NameNodeResourceMonitor@5baaae4c): NameNode low on available disk space. Already in safe mode. -
To confirm that the NameNode is still in Safemode, run the following command:
[root@ip-###-##-##-### mnt]# hdfs dfsadmin -safemode getSafe mode is ON -
Delete unnecessary files from /mnt. If the directory in/mnt/namenode/current uses a large amount of space on a cluster with one primary node, then create a new snapshot (FsImage). Then, remove the old edit logs.
-
Check the amount of available disk space in /mnt. If the available space is more than 100 MB, then check the status of Safemode again.
[hadoop@ip-###-##-##-### ~]$ hdfs dfsadmin -safemode getExample output:
Safe mode is ON -
Turn off Safemode:
[hadoop@ip-###-##-##-### ~]$ hdfs dfsadmin -safemode leaveExample output:
Safe mode is OFF
If /mnt still has less than 100 MB of available space, then perform one or more of the following actions:
- Remove more files.
- Increase the size of the /mnt volume.
Remove more files
Complete the following steps:
-
Navigate to the /mnt directory:
cd /mnt -
Determine which folders use the most disk space:
sudo du -hsx * | sort -rh | head -10 -
Check the largest subfolders within the folders that use the most diskspace. For example, if the var folder uses a large amount of disk space, then check the largest subfolders in var:
cd varsudo du -hsx \* | sort -rh | head -10 -
Delete the largest files first. Make sure that you delete only files that you no longer need. Amazon S3 logging bucket already stores backup copies of compressed log files from /mnt/var/log/hadoop-hdfs/ and /mnt/var/log/hadoop-yarn/. You can safely delete these log files.
-
After you delete the unnecessary files, check the status of Safemode again. Then, turn off Safemode:
[hadoop@ip-###-##-##-### ~]$ hdfs dfsadmin -safemode getExample output:
Safe mode is ON -
Turn off Safemode:
[hadoop@ip-###-##-##-### ~]$ hdfs dfsadmin -safemode leaveExample output:
Safe mode is OFF
Check for corrupt or missing blocks and files
Complete the following steps:
- To check the health of the cluster, run the following command:
Note: The output report also provides you with a percentage of under replicated blocks and a count of missing replicas.hdfs fsck / - To locate the DataNode for each block of the file, run the following command for each file in the list:
Note: Replace example_file_name with your file name.hdfs fsck example_file_name -locations -blocks -files
Example output:
The preceding example output shows which DataNode stores the block. For example, 192.168.0.2. You can check the DataNode's logs for any errors related to the specific block ID (blk_##).0. BP-762523015-192.168.0.2-1480061879099:blk_1073741830_1006 len=134217728 MISSING! 1. BP-762523015-192.168.0.2-1480061879099:blk_1073741831_1007 len=134217728 MISSING! 2. BP-762523015-192.168.0.2-1480061879099:blk_1073741832_1008 len=70846464 MISSING!
Note: Missing blocks often occur because nodes terminate unexpectedly. - To delete the corrupted files, exit Safemode and run the following command:
Note: Replace example_file_name with your file name.hdfs dfs -rm example_file_name
Use CloudWatch metrics to monitor the health of HDFS
Use the following Amazon CloudWatch metrics to identify why the NameNode enters Safemode:
- To identify the percentage of HDFS storage that’s used, review HDFSUtilization.
- To identify the number of blocks where HDFS has no replicas, review MissingBlocks. These might be corrupt blocks.
- To identify the number of blocks that need replication, review UnderReplicatedBlocks.
Related information
HDFS users guide (from the Apache Hadoop website)
- Topics
- Analytics
- Tags
- Amazon EMR
- Language
- English

Relevant content
- Accepted Answerasked 3 years ago
- asked 2 years ago
- Accepted Answer