When properly configured, HDFS is much more robust against metadata corruption than a local filesystem, because it stores multiple copies of everything. However, because HDFS is a truly robust system, we added the capability for an administrator to recover a partial or corrupted edit log. This new functionality is called manual NameNode recovery.
Similar to fsck, NameNode recovery is an offline process. An administrator can run NameNode recovery to recover a corrupted edit log. This can be very helpful for getting corrupted filesystems on their feet again.

NameNode Recovery in Action

Let’s test out recovery mode. To activate recovery mode, you start the NameNode with the -recover flag, like so:
./bin/hadoop namenode -recover
At this point, the NameNode will ask you whether you want to continue.
You have selected Metadata Recovery mode.  This mode is intended to recover
lost metadata on a corrupt filesystem.  Metadata recovery mode often
permanently deletes data from your HDFS filesystem.  Please back up your edit
log and fsimage before trying this!

Are you ready to proceed? (Y/N)
 (Y or N) 
Once you answer yes, the recovery process will read as much of the edit log as possible. When there is an error or an ambiguity, it will ask you how to proceed.
In this example, we encounter an error when trying to read transaction ID 3:
11:10:41,443 ERROR FSImage:147 - Error replaying edit log at offset 71.  Expected transaction ID was 3
Recent opcode offsets: 17 71
org.apache.hadoop.fs.ChecksumException: Transaction is corrupt. Calculated checksum is -1642375052 but read checksum -6897
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Reader.validateChecksum(FSEditLogOp.java:2356)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Reader.decodeOp(FSEditLogOp.java:2341)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Reader.readOp(FSEditLogOp.java:2247)
        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextOp(EditLogFileInputStream.java:110)
        at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:74)
        at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:140)
        at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:74)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:138)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:93)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:683)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:639)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:247)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:498)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:390)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:354)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.doRecovery(NameNode.java:1033)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1103)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1164)
11:10:41,444 ERROR MetaRecoveryContext:96 - We failed to read txId 3
11:10:41,444  INFO MetaRecoveryContext:64 -
Enter 'c' to continue, skipping the bad section in the log
Enter 's' to stop reading the edit log here, abandoning any later edits
Enter 'q' to quit without saving
Enter 'a' to always select the first choice in the future without prompting. (c/s/q/a)
There are four options here– continue, stop, quit, and always
Continue will try to skip over the bad section in the log. If the problem is just a stray byte or two, or a few bad sectors, this option will let you bypass it.
Stop stops reading the edit log and saves the current contents of the FSImage. In this case, all the edits that still haven’t been read will be permanently lost.
Quit exits the NameNode process without saving a new FSImage.
Always selects continue, and suppresses this prompt in the future. Once you select always, Recovery mode will stop prompting you and always select continue in the future.
In this case, I’m going to select continue, because I think there may be more edits following the corrupt region that I want to salvage. The next prompt informs me that an edit is missing– which is to be expected, considering the previous one was corrupt.
12:22:38,829  INFO MetaRecoveryContext:105 - Continuing.
12:22:38,860 ERROR MetaRecoveryContext:96 - There appears to be a gap in the edit log.  We expected txid 3, but got txid 4.
12:22:38,860  INFO MetaRecoveryContext:64 -
Enter 'c' to continue, ignoring missing  transaction IDs
Enter 's' to stop reading the edit log here, abandoning any later edits
Enter 'q' to quit without saving
Enter 'a' to always select the first choice in the future without prompting. (c/s/q/a)
Again I enter ‘c’ to continue.
Finally, recovery completes.
12:22:42,205  INFO MetaRecoveryContext:105 - Continuing.
12:22:42,207  INFO FSEditLogLoader:199 - replaying edit log: 4/5 transactions completed. (80%)
12:22:42,208  INFO FSImage:95 - Edits file /opt/hadoop/run4/name1/current/edits_0000000000000000001-0000000000000000005 of size 1048580 edits # 4 loaded in 4 seconds.
12:22:42,212  INFO FSImage:504 - Saving image file /opt/hadoop/run4/name2/current/fsimage.ckpt_0000000000000000005 using no compression
12:22:42,213  INFO FSImage:504 - Saving image file /opt/hadoop/run4/name1/current/fsimage.ckpt_0000000000000000005 using no compression
Then, the NameNode exits. Now, I can restart the NameNode and resume normal operation. The corruption has been fixed, although we have lost a small amount of metadata.

When Manual Recovery is the Best Choice

If there is another valid copy of the edit log somewhere else, it is preferrable to use that copy rather than trying to recover the corrupted copy. This is a case where high availability can help a lot. If there is a standby NameNode ready to take over, there should be no need to recover the edit log on the primary. Manual recovery is a good choice when there is no other copy of the edit log available.
Previous Post Next Post