In Hadoop Distributed File System (HDFS), the Namenode is responsible for managing the file system namespace and maintaining metadata about the files and directories stored in the system. When a Datanode fails, the Namenode detects the failure by the lack of a heartbeat from the Datanode. The Namenode then marks the Datanode as dead and begins replicating any block stored on the failed Datanode to other healthy Datanodes. This process is known as "replication management" and it helps to ensure the availability and durability of the data stored in HDFS. Once the replication is complete, the Namenode can safely remove the dead Datanode from the system.
For example, let's say that a Datanode named "DN1" is storing three blocks of data, labeled "A", "B", and "C." If DN1 fails, the Namenode will detect the failure and mark DN1 as dead. The Namenode will then begin replicating the data stored on DN1 to other healthy Datanodes. Let's say that there are two other Datanodes in the system, "DN2" and "DN3." The Namenode will create replicas of block "A" on DN2 and DN3, replicas of block "B" on DN2 and DN3 and replicas of block "C" on DN2 and DN3. Once the replication is complete, the Namenode can safely remove DN1 from the system. Now the data blocks are stored on multiple nodes and if any one of the nodes fails, the data can still be accessed from the replicas.
To check the list of dead datanodes in the HDFS, you can use the command hdfs dfsadmin -report. The output of this command will show the list of live and dead datanodes in the cluster.
To decommission a datanode, you can use the command hdfs dfsadmin -decommission <datanode-name>. This will gracefully remove the datanode from the cluster by making sure that all the data blocks stored on it are replicated to other datanodes before it is removed.
It is also possible to force the removal of a dead datanode by using the command hdfs dfsadmin -refreshNodes command, which will prompt the Namenode to re-read the hosts and exclude files and remove any nodes that are no longer found in the include files.
It is important to note that decommissioning or removing a datanode from the cluster should be done with caution and proper planning, as it can potentially lead to data loss if not done correctly.