Skip to content

How do I turn off Safemode for the NameNode service on my Amazon EMR cluster?

7 minute read
0

The NameNode service goes into Safemode when I try to run an Apache Hadoop or Apache Spark job on an Amazon EMR cluster. I turned off Safemode, but it comes back on immediately.

Short description

When you run an Apache Hadoop or Apache Spark job on an Amazon EMR cluster, you might receive one of the following error messages:

  • "Cannot create file/user/test.txt._COPYING_. Name node is in safe mode."
  • "org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /user/hadoop/.sparkStaging/application_15########_0001. Name node is in safe mode. It was turned on manually. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. NamenodeHostName:ip-###-##-##-##.ec2.internal"

Safemode for the NameNode is a read-only mode for the Hadoop Distributed File System (HDFS) cluster. In Safemode, you can't make any modifications to the file system or blocks.

After the DataNodes report that most file system blocks are available, the NameNode automatically leaves Safemode. However, the NameNode might enter Safemode again for the following reasons:

  • Available space is less than the amount of space that's required for the NameNode storage directory. The parameter dfs.namenode.resource.du.reserved defines the required space for the NameNode directory.
  • The NameNode can't load the FsImage and EditLog into memory.
  • The NameNode didn't receive the block report from the DataNode.
  • Some nodes in the cluster might be down and the blocks on the nodes become unavailable.
  • Some blocks might be corrupt.

Check for the root cause of the issue in the NameNode log location, /var/log/hadoop-hdfs/.

Resolution

Before you leave Safemode, confirm that you know why the NameNode is stuck in Safemode. Review the status of all DataNodes and the NameNode logs.

Important: In some cases, when you manually turn off Safemode you might have data loss.

To manually turn off Safemode, run the following command:

sudo -u hdfs hadoop dfsadmin -safemode leave

Depending on the root cause of the error, complete one or more of the following troubleshooting steps to turn off Safemode.

Switch to a cluster with multiple primary nodes

Checkpointing isn't automatic in clusters with a single primary node. So HDFS can't back up edit logs to a new snapshot (FsImage) and remove them automatically. HDFS uses edit logs to record filesystem changes between snapshots. It's a best practice to manually remove the edit logs from a cluster with a single primary node. If you don't manually remove the edit logs, then the logs might use all the disk space in /mnt. To resolve this issue, launch a cluster with multiple primary nodes. Clusters with multiple primary nodes support high availability for HDFS NameNode.

Remove unnecessary files from /mnt

The parameter dfs.namenode.resource.du.reserved specifies the minimum available disk space for /mnt. When the amount of available disk space for /mnt drops to a value below the value that's set in dfs.namenode.resource.du.reserved, the NameNode enters Safemode. The default value for dfs.namenode.resource.du.reserved is 100 MB. When Safemode is on, NameNode blocks all filesystem and block modifications. To resolve this issue, you must remove the unnecessary files from /mnt.

To delete the files that you no longer need, complete the following steps:

  1. Use SSH to connect to the primary node.

  2. Check the NameNode logs to verify that the NameNode is in Safemode because of insufficient disk space. If the disk space is sufficient, then the logs look similar to the following example:

    2020-08-28 19:14:43,540 WARN org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker (org.apache.hadoop.hdfs.server.namenode.FSNamesystem$NameNodeResourceMonitor@5baaae4c): Space available on volume '/dev/xvdb2' is 76546048, which is below the configured reserved amount 104857600

    If the disk space is insufficient, then the logs look similar to the following example:

    2020-09-28 19:14:43,540 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem (org.apache.hadoop.hdfs.server.namenode.FSNamesystem$NameNodeResourceMonitor@5baaae4c): NameNode low on available disk space. Already in safe mode.
  3. To confirm that the NameNode is still in Safemode, run the following command:

    [root@ip-###-##-##-### mnt]# hdfs dfsadmin -safemode getSafe mode is ON
  4. Delete unnecessary files from /mnt. If the directory in/mnt/namenode/current uses a large amount of space on a cluster with one primary node, then create a new snapshot (FsImage). Then, remove the old edit logs.

  5. Check the amount of available disk space in /mnt. If the available space is more than 100 MB, then check the status of Safemode again.

    [hadoop@ip-###-##-##-### ~]$ hdfs dfsadmin -safemode get

    Example output:

    Safe mode is ON
  6. Turn off Safemode:

    [hadoop@ip-###-##-##-### ~]$ hdfs dfsadmin -safemode leave

    Example output:

    Safe mode is OFF

If /mnt still has less than 100 MB of available space, then perform one or more of the following actions:

Remove more files

Complete the following steps:

  1. Use SSH to connect to the primary node.

  2. Navigate to the /mnt directory:

    cd /mnt
  3. Determine which folders use the most disk space:

    sudo du -hsx * | sort -rh | head -10
  4. Check the largest subfolders within the folders that use the most diskspace. For example, if the var folder uses a large amount of disk space, then check the largest subfolders in var:

    cd varsudo du -hsx \* | sort -rh | head -10
  5. Delete the largest files first. Make sure that you delete only files that you no longer need. Amazon S3 logging bucket already stores backup copies of compressed log files from /mnt/var/log/hadoop-hdfs/ and /mnt/var/log/hadoop-yarn/. You can safely delete these log files.

  6. After you delete the unnecessary files, check the status of Safemode again. Then, turn off Safemode:

    [hadoop@ip-###-##-##-### ~]$ hdfs dfsadmin -safemode get

    Example output:

    Safe mode is ON
  7. Turn off Safemode:

    [hadoop@ip-###-##-##-### ~]$ hdfs dfsadmin -safemode leave

    Example output:

    Safe mode is OFF

Check for corrupt or missing blocks and files

Complete the following steps:

  1. To check the health of the cluster, run the following command:
    hdfs fsck /
    Note: The output report also provides you with a percentage of under replicated blocks and a count of missing replicas.
  2. To locate the DataNode for each block of the file, run the following command for each file in the list:
    hdfs fsck example_file_name -locations -blocks -files
    Note: Replace example_file_name with your file name.
    Example output:
    0. BP-762523015-192.168.0.2-1480061879099:blk_1073741830_1006 len=134217728 MISSING!
    1. BP-762523015-192.168.0.2-1480061879099:blk_1073741831_1007 len=134217728 MISSING!
    2. BP-762523015-192.168.0.2-1480061879099:blk_1073741832_1008 len=70846464 MISSING!
    The preceding example output shows which DataNode stores the block. For example, 192.168.0.2. You can check the DataNode's logs for any errors related to the specific block ID (blk_##).
    Note: Missing blocks often occur because nodes terminate unexpectedly.
  3. To delete the corrupted files, exit Safemode and run the following command:
    hdfs dfs -rm example_file_name
    Note: Replace example_file_name with your file name.

Use CloudWatch metrics to monitor the health of HDFS

Use the following Amazon CloudWatch metrics to identify why the NameNode enters Safemode:

  • To identify the percentage of HDFS storage that’s used, review HDFSUtilization.
  • To identify the number of blocks where HDFS has no replicas, review MissingBlocks. These might be corrupt blocks.
  • To identify the number of blocks that need replication, review UnderReplicatedBlocks.

Related information

HDFS users guide (from the Apache Hadoop website)

AWS OFFICIALUpdated a month ago