Changes between Version 11 and Version 12 of BadBlockHowto
- Timestamp:
- Mar 26, 2017, 1:17:50 PM (8 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
BadBlockHowto
v11 v12 10 10 Handling bad blocks is a difficult problem as it often involves decisions about losing information. Modern storage devices tend to handle the simple cases automatically, for example by writing a disk sector that was read with difficulty to another area on the media. Even though such a remapping can be done by a disk drive transparently, there is still a lingering worry about media deterioration and the disk running out of spare sectors to remap. 11 11 12 Can smartmontools help? As the SMART [#footnote1 [1]]acronym suggests, the `smartctl` command and the `smartd` daemon concentrate on monitoring and analysis. So apart from changing some reporting settings, smartmontools will not modify the raw data in a device. Also smartmontools only works with physical devices, it does not know about partitions and file systems. So other tools are needed. The job of smartmontools is to alert the user that something is wrong and user intervention may be required.12 Can smartmontools help? As the SMART ^[#footnote1 [1]]^ acronym suggests, the `smartctl` command and the `smartd` daemon concentrate on monitoring and analysis. So apart from changing some reporting settings, smartmontools will not modify the raw data in a device. Also smartmontools only works with physical devices, it does not know about partitions and file systems. So other tools are needed. The job of smartmontools is to alert the user that something is wrong and user intervention may be required. 13 13 14 14 When a bad block is reported one approach is to work out the mapping between the logical block address used by a storage device and a file or some other component of a file system using that device. Note that there may not be such a mapping reflecting that a bad block has been found at a location not currently used by the file system. A user may want to do this analysis to localize and minimize the number of replacement files that are retrieved from some backup store. This approach requires knowledge of the file system involved and this document uses the Linux ext2/ext3 and ReiserFS file systems for examples. Also the type of content may come into play. For example if an area storing video has a corrupted sector, it may be easiest to accept that a frame or two might be corrupted and instruct the disk not to retry as that may have the visual effect of causing a momentary blank into a 1 second pause (while the disk retries the faulty sector, often accompanied by a telltale clicking sound). … … 18 18 == Repairs in a file system == 19 19 20 This section contains examples of what to do at the file system level when smartmontools reports a bad block. These examples assume the Linux operating system and either the ext2/ext3 or ReiserFS file system. The various Linux commands shown have man pages and the reader is encouraged to examine these. Of note is the `dd` command which is often used in repair work [#footnote2 [2]]and has a unique command line syntax.20 This section contains examples of what to do at the file system level when smartmontools reports a bad block. These examples assume the Linux operating system and either the ext2/ext3 or ReiserFS file system. The various Linux commands shown have man pages and the reader is encouraged to examine these. Of note is the `dd` command which is often used in repair work ^[#footnote2 [2]]^ and has a unique command line syntax. 21 21 22 22 The authors would like to thank Sergey Vlasov, Theodore Ts'o, Michael Bendzick, and others for explaining this approach. The authors would like to add text showing how to do this for other file systems, in particular XFS, and JFS: please email if you can provide this information. … … 420 420 So it looks like we have the right (i.e. faulty) block address. 421 421 422 [Step 4] Try then to find the affected file [#footnote3 [3]]:422 [Step 4] Try then to find the affected file ^[#footnote3 [3]]^: 423 423 424 424 {{{ … … 428 428 If you do not find any unreadable files, then the block may be free or located in some metadata of the file system. 429 429 430 [Step 5] Try your luck: bang the affected block with `badblocks -n` (non-destructive read-write mode, do unmount first), if you are very lucky the failure is transient and you can provoke reallocation [#footnote4 [4]]:430 [Step 5] Try your luck: bang the affected block with `badblocks -n` (non-destructive read-write mode, do unmount first), if you are very lucky the failure is transient and you can provoke reallocation ^[#footnote4 [4]]^: 431 431 432 432 {{{ … … 465 465 Some software failures can lead to zeroes or random data being written on the first block of a disk. For disks that use a DOS-based partitioning scheme this will overwrite the partition table which is found at the end of the first block. This is a single point of failure so after the damage tools like `fdisk` have no alternate data to use so they report no partitions or a damaged partition table. 466 466 467 One utility that may help is `testdisk` [#footnote6 [6]]which can scan a disk looking for partitions and recreate a partition table if requested.467 One utility that may help is `testdisk` ^[#footnote6 [6]]^ which can scan a disk looking for partitions and recreate a partition table if requested. 468 468 469 469 Programs that create DOS partitions often place the first partition at logical block address `63`. In Linux a loop back mount can be attempted at the appropriate offset of a disk with a damaged partition table. This approach may involve placing the disk with the damaged partition table in a working computer or perhaps an external USB enclosure. Assuming the disk with the damaged partition is `/dev/hdb`. Then the following read-only loop back mount could be tried: … … 487 487 }}} 488 488 489 Since the above command is potentially destructive it takes a copy of the block(s) holding the partition table(s) and puts it in part_block_prior.img prior to any changes. Then it changes the partition tables as indicated by `my_disk_partition_info.txt`. For what it is worth the author did test this on his system! [#footnote7 [7]]489 Since the above command is potentially destructive it takes a copy of the block(s) holding the partition table(s) and puts it in part_block_prior.img prior to any changes. Then it changes the partition tables as indicated by `my_disk_partition_info.txt`. For what it is worth the author did test this on his system! ^[#footnote7 [7]]^ 490 490 491 491 For creating, destroying, resizing, checking and copying partitions, and the file systems on them, GNU's `parted` is worth examining. The [http://www.tldp.org/HOWTO/Large-Disk-HOWTO.html Large Disk HOWTO] is also a useful resource. … … 654 654 Other things can go wrong, typically associated with the transport and they will be reported using a term other than ''medium error''. For example a disk may decide a read operation was successful but a computer's host bus adapter (HBA) checking the incoming data detects a CRC error due to a bad cable or termination. 655 655 656 Depending on the disk vendor, recoverable errors can be ignored. After all, some disks have up to 68 bytes of ECC above the payload size of 512 bytes so why use up spare sectors which are limited in number [#footnote8 [8]] ? If the disk can recover the data and does decide to re-allocate (reassign) a sector, then first it checks the settings of the `ARRE` and `AWRE` bits in the read-write error recovery mode page. Usually these bits are set [#footnote9 [9]]enabling automatic (read or write) re-allocation. The automatic re-allocation may also fail if the zone (or disk) has run out of spare sectors.656 Depending on the disk vendor, recoverable errors can be ignored. After all, some disks have up to 68 bytes of ECC above the payload size of 512 bytes so why use up spare sectors which are limited in number ^[#footnote8 [8]]^ ? If the disk can recover the data and does decide to re-allocate (reassign) a sector, then first it checks the settings of the `ARRE` and `AWRE` bits in the read-write error recovery mode page. Usually these bits are set ^[#footnote9 [9]]^ enabling automatic (read or write) re-allocation. The automatic re-allocation may also fail if the zone (or disk) has run out of spare sectors. 657 657 658 658 Another consideration with RAIDs, and applications that require a high data rate without pauses, is that the controller logic may not want a disk to spend too long trying to recover an error. … … 670 670 Then a program that can execute the `REASSIGN BLOCKS SCSI` command is required. In Linux (2.4 and 2.6 series), FreeBSD, Tru64(OSF) and Windows the author's `sg_reassign` utility in the `sg3_utils` package can be used. Also found in that package is `sg_verify` which can be used to check that a block is readable. 671 671 672 Assume that `logical block address 1193046` (which is `123456` in hex) is corrupt [#footnote10 [10]]on the disk at `/dev/sdb`. A long selftest command like `smartctl -t long /dev/sdb` may result in log results like this:672 Assume that `logical block address 1193046` (which is `123456` in hex) is corrupt ^[#footnote10 [10]]^ on the disk at `/dev/sdb`. A long selftest command like `smartctl -t long /dev/sdb` may result in log results like this: 673 673 674 674 {{{ … … 719 719 }}} 720 720 721 and a hex editor [#footnote11 [11]]used to view and potentially change the `blk.img` file. An altered `blk.img` file (or `/dev/zero`) could be written back with:721 and a hex editor ^[#footnote11 [11]]^ used to view and potentially change the `blk.img` file. An altered `blk.img` file (or `/dev/zero`) could be written back with: 722 722 723 723 {{{