Changes between Version 2 and Version 3 of BadBlockHowto


Ignore:
Timestamp:
Mar 25, 2017, 1:45:01 PM (7 years ago)
Author:
Gabriele Pohl
Comment:

ext2/ext3 first example

Legend:

Unmodified
Added
Removed
Modified
  • BadBlockHowto

    v2 v3  
    2222The authors would like to thank Sergey Vlasov, Theodore Ts'o, Michael Bendzick, and others for explaining this approach. The authors would like to add text showing how to do this for other file systems, in particular XFS, and JFS: please email if you can provide this information.
    2323
     24=== ext2/ext3 first example ===
     25
     26In this example, the disk is failing self-tests at Logical Block Address `LBA = 0x016561e9 = 23421417`. The LBA counts sectors in units of 512 bytes, and starts at zero.
     27
     28{{{
     29root]# smartctl -l selftest /dev/hda:
     30SMART Self-test log structure revision number 1
     31Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
     32# 1  Extended offline    Completed: read failure       90%       217         0x016561e9
     33}}}
     34
     35Note that other signs that there is a bad sector on the disk can be found in the non-zero value of the `Current_Pending_Sector` count:
     36
     37{{{
     38root]# smartctl -A /dev/hda
     39ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
     40  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
     41196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
     42197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       1
     43198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       1
     44}}}
     45
     46First Step: We need to locate the partition on which this sector of the disk lives:
     47
     48{{{
     49root]# fdisk -lu /dev/hda
     50Disk /dev/hda: 123.5 GB, 123522416640 bytes
     51255 heads, 63 sectors/track, 15017 cylinders, total 241254720 sectors
     52Units = sectors of 1 * 512 = 512 bytes
     53   Device Boot    Start       End    Blocks   Id  System
     54/dev/hda1   *        63   4209029   2104483+  83  Linux
     55/dev/hda2       4209030   5269319    530145   82  Linux swap
     56/dev/hda3       5269320 238227884 116479282+  83  Linux
     57/dev/hda4     238227885 241248104   1510110   83  Linux
     58}}}
     59
     60The partition `/dev/hda3` starts at `LBA 5269320` and extends past the ''problem'' LBA. The ''problem'' LBA is offset `23421417 - 5269320` = `18152097` sectors into the partition `/dev/hda3`.
     61
     62To verify the type of the file system and the mount point, look in `/etc/fstab`:
     63
     64{{{
     65root]# grep hda3 /etc/fstab
     66/dev/hda3 /data ext2 defaults 1 2
     67}}}
     68
     69You can see that this is an `ext2` file system, mounted at `/data`.
     70
     71Second Step: we need to find the block size of the file system (normally 4096 bytes for `ext2`):
     72
     73{{{
     74root]# tune2fs -l /dev/hda3 | grep Block
     75Block count:              29119820
     76Block size:               4096
     77}}}
     78
     79In this case the block size is 4096 bytes. Third Step: we need to determine which File System Block contains this LBA. The formula is:
     80
     81{{{
     82  b = (int)((L-S)*512/B)
     83where:
     84b = File System block number
     85B = File system block size in bytes
     86L = LBA of bad sector
     87S = Starting sector of partition as shown by fdisk -lu
     88and (int) denotes the integer part.
     89}}}
     90
     91In our example, `L=23421417`, `S=5269320`, and `B=4096`. Hence the ''problem'' LBA is in block number
     92
     93{{{
     94   b = (int)18152097*512/4096 = (int)2269012.125
     95so b=2269012.
     96}}}
     97
     98Note: the fractional part of `0.125` indicates that this problem LBA is actually the second of the eight sectors that make up this file system block.
     99
     100Fourth Step: we use `debugfs` to locate the inode stored in this block, and the file that contains that inode:
     101
     102{{{
     103root]# debugfs
     104debugfs 1.32 (09-Nov-2002)
     105debugfs:  open /dev/hda3
     106debugfs:  testb 2269012
     107Block 2269012 not in use
     108}}}
     109
     110If the block is not in use, as in the above example, then you can skip the rest of this step and go ahead to Step Five.
     111
     112If, on the other hand, the block is in use, we want to identify the file that uses it:
     113
     114{{{
     115debugfs:  testb 2269012
     116Block 2269012 marked in use
     117debugfs:  icheck 2269012
     118Block   Inode number
     1192269012 41032
     120debugfs:  ncheck 41032
     121Inode   Pathname
     12241032   /S1/R/H/714197568-714203359/H-R-714202192-16.gwf
     123}}}
     124
     125In this example, you can see that the problematic file (with the mount point included in the path) is: `/data/S1/R/H/714197568-714203359/H-R-714202192-16.gwf`
     126
     127When we are working with an `ext3` file system, it may happen that the affected file is the journal itself. Generally, if this is the case, the inode number will be very small. In any case, `debugfs` will not be able to get the file name:
     128
     129{{{
     130debugfs:  testb 2269012
     131Block 2269012 marked in use
     132debugfs:  icheck 2269012
     133Block   Inode number
     1342269012 8
     135debugfs:  ncheck 8
     136Inode   Pathname
     137debugfs:
     138}}}
     139
     140To get around this situation, we can remove the journal altogether:
     141
     142{{{
     143tune2fs -O ^has_journal /dev/hda3
     144}}}
     145
     146and then start again with Step Four: we should see this time that the wrong block is not in use any more. If we removed the journal file, at the end of the whole procedure we should remember to rebuild it:
     147
     148{{{
     149tune2fs -j /dev/hda3
     150}}}
     151
     152Fifth Step NOTE: '''This last step will permanently and irretrievably destroy the contents of the file system block that is damaged''': if the block was allocated to a file, some of the data that is in this file is going to be overwritten with zeros. You will not be able to recover that data unless you can replace the file with a fresh or correct version.
     153
     154To force the disk to reallocate this bad block we'll write zeros to the bad block, and sync the disk:
     155
     156{{{
     157root]# dd if=/dev/zero of=/dev/hda3 bs=4096 count=1 seek=2269012
     158root]# sync
     159}}}
     160
     161Now everything is back to normal: the sector has been reallocated. Compare the output just below to similar output near the top of this article:
     162
     163{{{
     164root]# smartctl -A /dev/hda
     165ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
     166  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       1
     167196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1
     168197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
     169198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       1
     170}}}
     171
     172Note: for some disks it may be necessary to update the SMART Attribute values by using `smartctl -t offline /dev/hda`
     173
     174We have corrected the first errored block. If more than one blocks were errored, we should repeat all the steps for the subsequent ones. After we do that, the disk will pass its self-tests again:
     175
     176{{{
     177root]# smartctl -t long /dev/hda  [wait until test completes, then]
     178root]# smartctl -l selftest /dev/hda
     179SMART Self-test log structure revision number 1
     180Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
     181# 1  Extended offline    Completed without error       00%       239         -
     182# 2  Extended offline    Completed: read failure       90%       217         0x016561e9
     183# 3  Extended offline    Completed: read failure       90%       212         0x016561e9
     184# 4  Extended offline    Completed: read failure       90%       181         0x016561e9
     185# 5  Extended offline    Completed without error       00%        14         -
     186# 6  Extended offline    Completed without error       00%         4         -
     187}}}
     188
     189and no longer shows any offline uncorrectable sectors:
     190
     191{{{
     192root]# smartctl -A /dev/hda
     193ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
     194  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       1
     195196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1
     196197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
     197198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
     198}}}
     199
     200== Footnotes ==
     201
    24202[=#footnote1 [1]] Self-Monitoring, Analysis and Reporting Technology -> SMART
    25203