Changes between Version 10 and Version 11 of BadBlockHowto


Ignore:
Timestamp:
Mar 25, 2017, 9:55:20 PM (8 years ago)
Author:
Gabriele Pohl
Comment:

Example

Legend:

Unmodified
Added
Removed
Modified
  • BadBlockHowto

    v10 v11  
    664664Here is an alternate brute force technique to consider: if the data on the SCSI or ATA disk has all been backed up (e.g. is held on the other disks in a RAID 5 enclosure), then simply reformatting the disk may be the least cumbersome approach.
    665665
     666==== Example ====
     667
     668Given a ''bad block'', it still may be useful to look at the `fdisk` command (if the disk has multiple partitions) to find out which partition is involved, then use `debugfs` (or a similar tool for the file system in question) to find out which, if any, file or other part of the file system may have been damaged. This is discussed in section [#Repairsinafilesystem Repairs in a file system].
     669
     670Then a program that can execute the `REASSIGN BLOCKS SCSI` command is required. In Linux (2.4 and 2.6 series), FreeBSD, Tru64(OSF) and Windows the author's `sg_reassign` utility in the `sg3_utils` package can be used. Also found in that package is `sg_verify` which can be used to check that a block is readable.
     671
     672Assume that `logical block address 1193046` (which is `123456` in hex) is corrupt [#footnote10 [10]] on the disk at `/dev/sdb`. A long selftest command like `smartctl -t long /dev/sdb` may result in log results like this:
     673
     674{{{
     675# smartctl -l selftest /dev/sdb
     676smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
     677Home page is http://smartmontools.sourceforge.net/
     678SMART Self-test log
     679Num  Test              Status            segment  LifeTime  LBA_first_err [SK ASC ASQ]
     680     Description                         number   (hours)
     681# 1  Background long   Failed in segment      -     354           1193046 [0x3 0x11 0x0]
     682# 2  Background short  Completed              -     323                 - [-   -    -]
     683# 3  Background short  Completed              -     194                 - [-   -    -]
     684}}}
     685
     686The `sg_verify` utility can be used to confirm that there is a problem at that address:
     687
     688{{{
     689# sg_verify --lba=1193046 /dev/sdb
     690verify (10):  Fixed format, current;  Sense key: Medium Error
     691 Additional sense: Unrecovered read error
     692  Info fld=0x123456 [1193046]
     693  Field replaceable unit code: 228
     694  Actual retry count: 0x008b
     695medium or hardware error, reported lba=0x123456
     696}}}
     697
     698Now the GLIST length is checked before the block reassignment:
     699
     700{{{
     701# sg_reassign --grown /dev/sdb
     702>> Elements in grown defect list: 0
     703}}}
     704
     705And now for the actual reassignment followed by another check of the GLIST length:
     706
     707{{{
     708# sg_reassign --address=1193046 /dev/sdb
     709# sg_reassign --grown /dev/sdb
     710>> Elements in grown defect list: 1
     711}}}
     712
     713The GLIST length has grown by one as expected. If the disk was unable to recover any data, then the ''new'' block at lba `0x123456` has vendor specific data in it. The `sg_reassign` utility can also do bulk reassigns, see `man sg_reassign` for more information.
     714
     715The `dd` command could be used to read the contents of the ''new'' block:
     716
     717{{{
     718# dd if=/dev/sdb iflag=direct skip=1193046 of=blk.img bs=512 count=1
     719}}}
     720
     721and a hex editor [#footnote11 [11]] used to view and potentially change the `blk.img` file. An altered `blk.img` file (or `/dev/zero`) could be written back with:
     722
     723{{{
     724# dd if=blk.img of=/dev/sdb seek=1193046 oflag=direct bs=512 count=1
     725}}}
     726
     727More work may be needed at the file system level, especially if the reassigned block held critical file system information such as a superblock or a directory.
     728
     729Even if a full backup of the disk is available, or the disk has been ''ejected'' from a RAID, it may still be worthwhile to reassign the bad block(s) that caused the problem (or simply format the disk (see `sg_format` in the `sg3_utils package`)) and re-use the disk later (not unlike the way a replacement disk from a manufacturer might be used).
     730
     731
    666732== Footnotes ==
    667733
     
    683749
    684750[=#footnote9 [9]] Often disks inside a hardware RAID have the ARRE and AWRE bits cleared (disabled) so the RAID controller can do things manually or flag the disk for replacement.
     751
     752[=#footnote10 [10]] In this case the corruption was manufactured by using the `WRITE LONG SCSI` command. See `sg_write_long` in `sg3_utils`.
     753
     754[=#footnote11 [11]] Most window managers have a handy calculator that will do hex to decimal conversions.