Changes between Version 3 and Version 4 of BadBlockHowto


Ignore:
Timestamp:
Mar 25, 2017, 1:55:25 PM (8 years ago)
Author:
Gabriele Pohl
Comment:

ext2/ext3 second example

Legend:

Unmodified
Added
Removed
Modified
  • BadBlockHowto

    v3 v4  
    33This article describes what actions might be taken when smartmontools detects a bad block on a disk. It demonstrates how to identify the file associated with an unreadable disk sector, and how to force that sector to reallocate.
    44
    5 [[PageOutline(2,Table of Contents, inline)]]
     5[[PageOutline(2-3,Table of Contents, inline, unnumbered)]]
    66
    77
     
    198198}}}
    199199
     200=== ext2/ext3 second example ===
     201
     202On this drive, the first sign of trouble was this email from `smartd`:
     203
     204{{{
     205    To: ballen
     206    Subject: SMART error (selftest) detected on host: medusa-slave166.medusa.phys.uwm.edu
     207    This email was generated by the smartd daemon running on host:
     208    medusa-slave166.medusa.phys.uwm.edu in the domain: master001-nis
     209    The following warning/error was logged by the smartd daemon:
     210    Device: /dev/hda, Self-Test Log error count increased from 0 to 1
     211}}}
     212
     213Running `smartctl -a /dev/hda` confirmed the problem:
     214
     215{{{
     216Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
     217# 1  Extended offline    Completed: read failure       80%       682         0x021d9f44
     218Note that the failing LBA reported is 0x021d9f44 (base 16) = 35495748 (base 10)
     219ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
     220  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
     221196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
     222197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       3
     223198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       3
     224}}}
     225
     226and one can see above that there are 3 sectors on the list of pending sectors that the disk can't read but would like to reallocate.
     227
     228The device also shows errors in the SMART error log:
     229
     230{{{
     231Error 212 occurred at disk power-on lifetime: 690 hours
     232  After command completion occurred, registers were:
     233  ER ST SC SN CL CH DH
     234  -- -- -- -- -- -- --
     235  40 51 12 46 9f 1d e2  Error: UNC 18 sectors at LBA = 0x021d9f46 = 35495750
     236  Commands leading to the command that caused the error were:
     237  CR FR SC SN CL CH DH DC   Timestamp  Command/Feature_Name
     238  -- -- -- -- -- -- -- --   ---------  --------------------
     239  25 00 12 46 9f 1d e0 00 2485545.000  READ DMA EXT
     240}}}
     241
     242Signs of trouble at this LBA may also be found in `SYSLOG`:
     243
     244{{{
     245[root]# grep LBA /var/log/messages | awk '{print $12}' | sort | uniq
     246 LBAsect=35495748
     247 LBAsect=35495750
     248}}}
     249
     250So I decide to do a quick check to see how many bad sectors there really are. Using the `bash` shell I check 70 sectors around the trouble area:
     251
     252{{{
     253[root]# export i=35495730
     254[root]# while [ $i -lt 35495800 ]
     255        > do echo $i
     256        > dd if=/dev/hda of=/dev/null bs=512 count=1 skip=$i
     257        > let i+=1
     258        > done
     259<SNIP>
     26035495734
     2611+0 records in
     2621+0 records out
     26335495735
     264dd: reading `/dev/hda': Input/output error
     2650+0 records in
     2660+0 records out
     267<SNIP>
     26835495751
     269dd: reading `/dev/hda': Input/output error
     2700+0 records in
     2710+0 records out
     27235495752
     2731+0 records in
     2741+0 records out
     275<SNIP>
     276}}}
     277
     278which shows that the seventeen sectors `35495735-35495751` (inclusive) are not readable.
     279
     280Next, we identify the files at those locations. The partitioning information on this disk is identical to the first example above, and as in that case the problem sectors are on the third partition `/dev/hda3`. So we have:
     281
     282{{{
     283     L=35495735 to 35495751
     284     S=5269320
     285     B=4096
     286}}}
     287
     288so that `b=3778301` to `3778303` are the three bad blocks in the file system.
     289
     290{{{
     291[root]# debugfs
     292debugfs 1.32 (09-Nov-2002)
     293debugfs:  open /dev/hda3
     294debugfs:  icheck 3778301
     295Block   Inode number
     2963778301 45192
     297debugfs:  icheck 3778302
     298Block   Inode number
     2993778302 45192
     300debugfs:  icheck 3778303
     301Block   Inode number
     3023778303 45192
     303debugfs:  ncheck 45192
     304Inode   Pathname
     30545192   /S1/R/H/714979488-714985279/H-R-714979984-16.gwf
     306debugfs:  quit
     307}}}
     308
     309Note that the first few steps of this procedure could also be done with a single command, which is very helpful if there are many bad blocks (thanks to Danie Marais for pointing this out):
     310
     311{{{
     312debugfs: icheck 3778301 3778302 3778303
     313}}}
     314
     315And finally, just to confirm that this is really the damaged file:
     316
     317{{{
     318[root]# md5sum /data/S1/R/H/714979488-714985279/H-R-714979984-16.gwf
     319md5sum: /data/S1/R/H/714979488-714985279/H-R-714979984-16.gwf: Input/output error
     320}}}
     321
     322Finally we force the disk to reallocate the three bad blocks:
     323
     324{{{
     325[root]# dd if=/dev/zero of=/dev/hda3 bs=4096 count=3 seek=3778301
     326[root]# sync
     327}}}
     328
     329We could also probably use:
     330
     331{{{
     332[root]# dd if=/dev/zero of=/dev/hda bs=512 count=17 seek=35495735
     333}}}
     334
     335At this point we now have:
     336
     337{{{
     338ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
     339  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
     340196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
     341197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
     342198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
     343}}}
     344
     345which is encouraging, since the pending sectors count is now zero. Note that the drive reallocation count has not yet increased: the drive may now have confidence in these sectors and have decided not to reallocate them..
     346
     347A device self test:
     348
     349{{{
     350  [root#] smartctl -t long /dev/hda
     351(then wait about an hour) shows no unreadable sectors or errors:
     352Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
     353# 1  Extended offline    Completed without error       00%       692         -
     354# 2  Extended offline    Completed: read failure       80%       682         0x021d9f44
     355}}}
     356
    200357== Footnotes ==
    201358