Changes between Version 11 and Version 12 of BadBlockHowto


Ignore:
Timestamp:
Mar 26, 2017, 1:17:50 PM (8 years ago)
Author:
Gabriele Pohl
Comment:

superscript footnote anchors

Legend:

Unmodified
Added
Removed
Modified
  • BadBlockHowto

    v11 v12  
    1010Handling bad blocks is a difficult problem as it often involves decisions about losing information. Modern storage devices tend to handle the simple cases automatically, for example by writing a disk sector that was read with difficulty to another area on the media. Even though such a remapping can be done by a disk drive transparently, there is still a lingering worry about media deterioration and the disk running out of spare sectors to remap.
    1111
    12 Can smartmontools help? As the SMART [#footnote1 [1]] acronym suggests, the `smartctl` command and the `smartd` daemon concentrate on monitoring and analysis. So apart from changing some reporting settings, smartmontools will not modify the raw data in a device. Also smartmontools only works with physical devices, it does not know about partitions and file systems. So other tools are needed. The job of smartmontools is to alert the user that something is wrong and user intervention may be required.
     12Can smartmontools help? As the SMART ^[#footnote1 [1]]^ acronym suggests, the `smartctl` command and the `smartd` daemon concentrate on monitoring and analysis. So apart from changing some reporting settings, smartmontools will not modify the raw data in a device. Also smartmontools only works with physical devices, it does not know about partitions and file systems. So other tools are needed. The job of smartmontools is to alert the user that something is wrong and user intervention may be required.
    1313
    1414When a bad block is reported one approach is to work out the mapping between the logical block address used by a storage device and a file or some other component of a file system using that device. Note that there may not be such a mapping reflecting that a bad block has been found at a location not currently used by the file system. A user may want to do this analysis to localize and minimize the number of replacement files that are retrieved from some backup store. This approach requires knowledge of the file system involved and this document uses the Linux ext2/ext3 and ReiserFS file systems for examples. Also the type of content may come into play. For example if an area storing video has a corrupted sector, it may be easiest to accept that a frame or two might be corrupted and instruct the disk not to retry as that may have the visual effect of causing a momentary blank into a 1 second pause (while the disk retries the faulty sector, often accompanied by a telltale clicking sound).
     
    1818== Repairs in a file system ==
    1919
    20 This section contains examples of what to do at the file system level when smartmontools reports a bad block. These examples assume the Linux operating system and either the ext2/ext3 or ReiserFS file system. The various Linux commands shown have man pages and the reader is encouraged to examine these. Of note is the `dd` command which is often used in repair work [#footnote2 [2]] and has a unique command line syntax.
     20This section contains examples of what to do at the file system level when smartmontools reports a bad block. These examples assume the Linux operating system and either the ext2/ext3 or ReiserFS file system. The various Linux commands shown have man pages and the reader is encouraged to examine these. Of note is the `dd` command which is often used in repair work ^[#footnote2 [2]]^ and has a unique command line syntax.
    2121
    2222The authors would like to thank Sergey Vlasov, Theodore Ts'o, Michael Bendzick, and others for explaining this approach. The authors would like to add text showing how to do this for other file systems, in particular XFS, and JFS: please email if you can provide this information.
     
    420420So it looks like we have the right (i.e. faulty) block address.
    421421
    422 [Step 4] Try then to find the affected file [#footnote3 [3]]:
     422[Step 4] Try then to find the affected file ^[#footnote3 [3]]^:
    423423
    424424{{{
     
    428428If you do not find any unreadable files, then the block may be free or located in some metadata of the file system.
    429429
    430 [Step 5] Try your luck: bang the affected block with `badblocks -n` (non-destructive read-write mode, do unmount first), if you are very lucky the failure is transient and you can provoke reallocation [#footnote4 [4]]:
     430[Step 5] Try your luck: bang the affected block with `badblocks -n` (non-destructive read-write mode, do unmount first), if you are very lucky the failure is transient and you can provoke reallocation ^[#footnote4 [4]]^:
    431431
    432432{{{
     
    465465Some software failures can lead to zeroes or random data being written on the first block of a disk. For disks that use a DOS-based partitioning scheme this will overwrite the partition table which is found at the end of the first block. This is a single point of failure so after the damage tools like `fdisk` have no alternate data to use so they report no partitions or a damaged partition table.
    466466
    467 One utility that may help is `testdisk` [#footnote6 [6]] which can scan a disk looking for partitions and recreate a partition table if requested.
     467One utility that may help is `testdisk` ^[#footnote6 [6]]^ which can scan a disk looking for partitions and recreate a partition table if requested.
    468468
    469469Programs that create DOS partitions often place the first partition at logical block address `63`. In Linux a loop back mount can be attempted at the appropriate offset of a disk with a damaged partition table. This approach may involve placing the disk with the damaged partition table in a working computer or perhaps an external USB enclosure. Assuming the disk with the damaged partition is `/dev/hdb`. Then the following read-only loop back mount could be tried:
     
    487487}}}
    488488
    489 Since the above command is potentially destructive it takes a copy of the block(s) holding the partition table(s) and puts it in part_block_prior.img prior to any changes. Then it changes the partition tables as indicated by `my_disk_partition_info.txt`. For what it is worth the author did test this on his system! [#footnote7 [7]]
     489Since the above command is potentially destructive it takes a copy of the block(s) holding the partition table(s) and puts it in part_block_prior.img prior to any changes. Then it changes the partition tables as indicated by `my_disk_partition_info.txt`. For what it is worth the author did test this on his system! ^[#footnote7 [7]]^
    490490
    491491For creating, destroying, resizing, checking and copying partitions, and the file systems on them, GNU's `parted` is worth examining. The [http://www.tldp.org/HOWTO/Large-Disk-HOWTO.html Large Disk HOWTO] is also a useful resource.
     
    654654Other things can go wrong, typically associated with the transport and they will be reported using a term other than ''medium error''. For example a disk may decide a read operation was successful but a computer's host bus adapter (HBA) checking the incoming data detects a CRC error due to a bad cable or termination.
    655655
    656 Depending on the disk vendor, recoverable errors can be ignored. After all, some disks have up to 68 bytes of ECC above the payload size of 512 bytes so why use up spare sectors which are limited in number [#footnote8 [8]] ? If the disk can recover the data and does decide to re-allocate (reassign) a sector, then first it checks the settings of the `ARRE` and `AWRE` bits in the read-write error recovery mode page. Usually these bits are set [#footnote9 [9]] enabling automatic (read or write) re-allocation. The automatic re-allocation may also fail if the zone (or disk) has run out of spare sectors.
     656Depending on the disk vendor, recoverable errors can be ignored. After all, some disks have up to 68 bytes of ECC above the payload size of 512 bytes so why use up spare sectors which are limited in number ^[#footnote8 [8]]^ ? If the disk can recover the data and does decide to re-allocate (reassign) a sector, then first it checks the settings of the `ARRE` and `AWRE` bits in the read-write error recovery mode page. Usually these bits are set ^[#footnote9 [9]]^ enabling automatic (read or write) re-allocation. The automatic re-allocation may also fail if the zone (or disk) has run out of spare sectors.
    657657
    658658Another consideration with RAIDs, and applications that require a high data rate without pauses, is that the controller logic may not want a disk to spend too long trying to recover an error.
     
    670670Then a program that can execute the `REASSIGN BLOCKS SCSI` command is required. In Linux (2.4 and 2.6 series), FreeBSD, Tru64(OSF) and Windows the author's `sg_reassign` utility in the `sg3_utils` package can be used. Also found in that package is `sg_verify` which can be used to check that a block is readable.
    671671
    672 Assume that `logical block address 1193046` (which is `123456` in hex) is corrupt [#footnote10 [10]] on the disk at `/dev/sdb`. A long selftest command like `smartctl -t long /dev/sdb` may result in log results like this:
     672Assume that `logical block address 1193046` (which is `123456` in hex) is corrupt ^[#footnote10 [10]]^ on the disk at `/dev/sdb`. A long selftest command like `smartctl -t long /dev/sdb` may result in log results like this:
    673673
    674674{{{
     
    719719}}}
    720720
    721 and a hex editor [#footnote11 [11]] used to view and potentially change the `blk.img` file. An altered `blk.img` file (or `/dev/zero`) could be written back with:
     721and a hex editor ^[#footnote11 [11]]^ used to view and potentially change the `blk.img` file. An altered `blk.img` file (or `/dev/zero`) could be written back with:
    722722
    723723{{{