| 24 | === ext2/ext3 first example === |
| 25 | |
| 26 | In this example, the disk is failing self-tests at Logical Block Address `LBA = 0x016561e9 = 23421417`. The LBA counts sectors in units of 512 bytes, and starts at zero. |
| 27 | |
| 28 | {{{ |
| 29 | root]# smartctl -l selftest /dev/hda: |
| 30 | SMART Self-test log structure revision number 1 |
| 31 | Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error |
| 32 | # 1 Extended offline Completed: read failure 90% 217 0x016561e9 |
| 33 | }}} |
| 34 | |
| 35 | Note that other signs that there is a bad sector on the disk can be found in the non-zero value of the `Current_Pending_Sector` count: |
| 36 | |
| 37 | {{{ |
| 38 | root]# smartctl -A /dev/hda |
| 39 | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE |
| 40 | 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 |
| 41 | 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 |
| 42 | 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 1 |
| 43 | 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 1 |
| 44 | }}} |
| 45 | |
| 46 | First Step: We need to locate the partition on which this sector of the disk lives: |
| 47 | |
| 48 | {{{ |
| 49 | root]# fdisk -lu /dev/hda |
| 50 | Disk /dev/hda: 123.5 GB, 123522416640 bytes |
| 51 | 255 heads, 63 sectors/track, 15017 cylinders, total 241254720 sectors |
| 52 | Units = sectors of 1 * 512 = 512 bytes |
| 53 | Device Boot Start End Blocks Id System |
| 54 | /dev/hda1 * 63 4209029 2104483+ 83 Linux |
| 55 | /dev/hda2 4209030 5269319 530145 82 Linux swap |
| 56 | /dev/hda3 5269320 238227884 116479282+ 83 Linux |
| 57 | /dev/hda4 238227885 241248104 1510110 83 Linux |
| 58 | }}} |
| 59 | |
| 60 | The partition `/dev/hda3` starts at `LBA 5269320` and extends past the ''problem'' LBA. The ''problem'' LBA is offset `23421417 - 5269320` = `18152097` sectors into the partition `/dev/hda3`. |
| 61 | |
| 62 | To verify the type of the file system and the mount point, look in `/etc/fstab`: |
| 63 | |
| 64 | {{{ |
| 65 | root]# grep hda3 /etc/fstab |
| 66 | /dev/hda3 /data ext2 defaults 1 2 |
| 67 | }}} |
| 68 | |
| 69 | You can see that this is an `ext2` file system, mounted at `/data`. |
| 70 | |
| 71 | Second Step: we need to find the block size of the file system (normally 4096 bytes for `ext2`): |
| 72 | |
| 73 | {{{ |
| 74 | root]# tune2fs -l /dev/hda3 | grep Block |
| 75 | Block count: 29119820 |
| 76 | Block size: 4096 |
| 77 | }}} |
| 78 | |
| 79 | In this case the block size is 4096 bytes. Third Step: we need to determine which File System Block contains this LBA. The formula is: |
| 80 | |
| 81 | {{{ |
| 82 | b = (int)((L-S)*512/B) |
| 83 | where: |
| 84 | b = File System block number |
| 85 | B = File system block size in bytes |
| 86 | L = LBA of bad sector |
| 87 | S = Starting sector of partition as shown by fdisk -lu |
| 88 | and (int) denotes the integer part. |
| 89 | }}} |
| 90 | |
| 91 | In our example, `L=23421417`, `S=5269320`, and `B=4096`. Hence the ''problem'' LBA is in block number |
| 92 | |
| 93 | {{{ |
| 94 | b = (int)18152097*512/4096 = (int)2269012.125 |
| 95 | so b=2269012. |
| 96 | }}} |
| 97 | |
| 98 | Note: the fractional part of `0.125` indicates that this problem LBA is actually the second of the eight sectors that make up this file system block. |
| 99 | |
| 100 | Fourth Step: we use `debugfs` to locate the inode stored in this block, and the file that contains that inode: |
| 101 | |
| 102 | {{{ |
| 103 | root]# debugfs |
| 104 | debugfs 1.32 (09-Nov-2002) |
| 105 | debugfs: open /dev/hda3 |
| 106 | debugfs: testb 2269012 |
| 107 | Block 2269012 not in use |
| 108 | }}} |
| 109 | |
| 110 | If the block is not in use, as in the above example, then you can skip the rest of this step and go ahead to Step Five. |
| 111 | |
| 112 | If, on the other hand, the block is in use, we want to identify the file that uses it: |
| 113 | |
| 114 | {{{ |
| 115 | debugfs: testb 2269012 |
| 116 | Block 2269012 marked in use |
| 117 | debugfs: icheck 2269012 |
| 118 | Block Inode number |
| 119 | 2269012 41032 |
| 120 | debugfs: ncheck 41032 |
| 121 | Inode Pathname |
| 122 | 41032 /S1/R/H/714197568-714203359/H-R-714202192-16.gwf |
| 123 | }}} |
| 124 | |
| 125 | In this example, you can see that the problematic file (with the mount point included in the path) is: `/data/S1/R/H/714197568-714203359/H-R-714202192-16.gwf` |
| 126 | |
| 127 | When we are working with an `ext3` file system, it may happen that the affected file is the journal itself. Generally, if this is the case, the inode number will be very small. In any case, `debugfs` will not be able to get the file name: |
| 128 | |
| 129 | {{{ |
| 130 | debugfs: testb 2269012 |
| 131 | Block 2269012 marked in use |
| 132 | debugfs: icheck 2269012 |
| 133 | Block Inode number |
| 134 | 2269012 8 |
| 135 | debugfs: ncheck 8 |
| 136 | Inode Pathname |
| 137 | debugfs: |
| 138 | }}} |
| 139 | |
| 140 | To get around this situation, we can remove the journal altogether: |
| 141 | |
| 142 | {{{ |
| 143 | tune2fs -O ^has_journal /dev/hda3 |
| 144 | }}} |
| 145 | |
| 146 | and then start again with Step Four: we should see this time that the wrong block is not in use any more. If we removed the journal file, at the end of the whole procedure we should remember to rebuild it: |
| 147 | |
| 148 | {{{ |
| 149 | tune2fs -j /dev/hda3 |
| 150 | }}} |
| 151 | |
| 152 | Fifth Step NOTE: '''This last step will permanently and irretrievably destroy the contents of the file system block that is damaged''': if the block was allocated to a file, some of the data that is in this file is going to be overwritten with zeros. You will not be able to recover that data unless you can replace the file with a fresh or correct version. |
| 153 | |
| 154 | To force the disk to reallocate this bad block we'll write zeros to the bad block, and sync the disk: |
| 155 | |
| 156 | {{{ |
| 157 | root]# dd if=/dev/zero of=/dev/hda3 bs=4096 count=1 seek=2269012 |
| 158 | root]# sync |
| 159 | }}} |
| 160 | |
| 161 | Now everything is back to normal: the sector has been reallocated. Compare the output just below to similar output near the top of this article: |
| 162 | |
| 163 | {{{ |
| 164 | root]# smartctl -A /dev/hda |
| 165 | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE |
| 166 | 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 1 |
| 167 | 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1 |
| 168 | 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 |
| 169 | 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 1 |
| 170 | }}} |
| 171 | |
| 172 | Note: for some disks it may be necessary to update the SMART Attribute values by using `smartctl -t offline /dev/hda` |
| 173 | |
| 174 | We have corrected the first errored block. If more than one blocks were errored, we should repeat all the steps for the subsequent ones. After we do that, the disk will pass its self-tests again: |
| 175 | |
| 176 | {{{ |
| 177 | root]# smartctl -t long /dev/hda [wait until test completes, then] |
| 178 | root]# smartctl -l selftest /dev/hda |
| 179 | SMART Self-test log structure revision number 1 |
| 180 | Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error |
| 181 | # 1 Extended offline Completed without error 00% 239 - |
| 182 | # 2 Extended offline Completed: read failure 90% 217 0x016561e9 |
| 183 | # 3 Extended offline Completed: read failure 90% 212 0x016561e9 |
| 184 | # 4 Extended offline Completed: read failure 90% 181 0x016561e9 |
| 185 | # 5 Extended offline Completed without error 00% 14 - |
| 186 | # 6 Extended offline Completed without error 00% 4 - |
| 187 | }}} |
| 188 | |
| 189 | and no longer shows any offline uncorrectable sectors: |
| 190 | |
| 191 | {{{ |
| 192 | root]# smartctl -A /dev/hda |
| 193 | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE |
| 194 | 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 1 |
| 195 | 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1 |
| 196 | 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 |
| 197 | 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 |
| 198 | }}} |
| 199 | |
| 200 | == Footnotes == |
| 201 | |