| 200 | === ext2/ext3 second example === |
| 201 | |
| 202 | On this drive, the first sign of trouble was this email from `smartd`: |
| 203 | |
| 204 | {{{ |
| 205 | To: ballen |
| 206 | Subject: SMART error (selftest) detected on host: medusa-slave166.medusa.phys.uwm.edu |
| 207 | This email was generated by the smartd daemon running on host: |
| 208 | medusa-slave166.medusa.phys.uwm.edu in the domain: master001-nis |
| 209 | The following warning/error was logged by the smartd daemon: |
| 210 | Device: /dev/hda, Self-Test Log error count increased from 0 to 1 |
| 211 | }}} |
| 212 | |
| 213 | Running `smartctl -a /dev/hda` confirmed the problem: |
| 214 | |
| 215 | {{{ |
| 216 | Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error |
| 217 | # 1 Extended offline Completed: read failure 80% 682 0x021d9f44 |
| 218 | Note that the failing LBA reported is 0x021d9f44 (base 16) = 35495748 (base 10) |
| 219 | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE |
| 220 | 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 |
| 221 | 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 |
| 222 | 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 3 |
| 223 | 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 3 |
| 224 | }}} |
| 225 | |
| 226 | and one can see above that there are 3 sectors on the list of pending sectors that the disk can't read but would like to reallocate. |
| 227 | |
| 228 | The device also shows errors in the SMART error log: |
| 229 | |
| 230 | {{{ |
| 231 | Error 212 occurred at disk power-on lifetime: 690 hours |
| 232 | After command completion occurred, registers were: |
| 233 | ER ST SC SN CL CH DH |
| 234 | -- -- -- -- -- -- -- |
| 235 | 40 51 12 46 9f 1d e2 Error: UNC 18 sectors at LBA = 0x021d9f46 = 35495750 |
| 236 | Commands leading to the command that caused the error were: |
| 237 | CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name |
| 238 | -- -- -- -- -- -- -- -- --------- -------------------- |
| 239 | 25 00 12 46 9f 1d e0 00 2485545.000 READ DMA EXT |
| 240 | }}} |
| 241 | |
| 242 | Signs of trouble at this LBA may also be found in `SYSLOG`: |
| 243 | |
| 244 | {{{ |
| 245 | [root]# grep LBA /var/log/messages | awk '{print $12}' | sort | uniq |
| 246 | LBAsect=35495748 |
| 247 | LBAsect=35495750 |
| 248 | }}} |
| 249 | |
| 250 | So I decide to do a quick check to see how many bad sectors there really are. Using the `bash` shell I check 70 sectors around the trouble area: |
| 251 | |
| 252 | {{{ |
| 253 | [root]# export i=35495730 |
| 254 | [root]# while [ $i -lt 35495800 ] |
| 255 | > do echo $i |
| 256 | > dd if=/dev/hda of=/dev/null bs=512 count=1 skip=$i |
| 257 | > let i+=1 |
| 258 | > done |
| 259 | <SNIP> |
| 260 | 35495734 |
| 261 | 1+0 records in |
| 262 | 1+0 records out |
| 263 | 35495735 |
| 264 | dd: reading `/dev/hda': Input/output error |
| 265 | 0+0 records in |
| 266 | 0+0 records out |
| 267 | <SNIP> |
| 268 | 35495751 |
| 269 | dd: reading `/dev/hda': Input/output error |
| 270 | 0+0 records in |
| 271 | 0+0 records out |
| 272 | 35495752 |
| 273 | 1+0 records in |
| 274 | 1+0 records out |
| 275 | <SNIP> |
| 276 | }}} |
| 277 | |
| 278 | which shows that the seventeen sectors `35495735-35495751` (inclusive) are not readable. |
| 279 | |
| 280 | Next, we identify the files at those locations. The partitioning information on this disk is identical to the first example above, and as in that case the problem sectors are on the third partition `/dev/hda3`. So we have: |
| 281 | |
| 282 | {{{ |
| 283 | L=35495735 to 35495751 |
| 284 | S=5269320 |
| 285 | B=4096 |
| 286 | }}} |
| 287 | |
| 288 | so that `b=3778301` to `3778303` are the three bad blocks in the file system. |
| 289 | |
| 290 | {{{ |
| 291 | [root]# debugfs |
| 292 | debugfs 1.32 (09-Nov-2002) |
| 293 | debugfs: open /dev/hda3 |
| 294 | debugfs: icheck 3778301 |
| 295 | Block Inode number |
| 296 | 3778301 45192 |
| 297 | debugfs: icheck 3778302 |
| 298 | Block Inode number |
| 299 | 3778302 45192 |
| 300 | debugfs: icheck 3778303 |
| 301 | Block Inode number |
| 302 | 3778303 45192 |
| 303 | debugfs: ncheck 45192 |
| 304 | Inode Pathname |
| 305 | 45192 /S1/R/H/714979488-714985279/H-R-714979984-16.gwf |
| 306 | debugfs: quit |
| 307 | }}} |
| 308 | |
| 309 | Note that the first few steps of this procedure could also be done with a single command, which is very helpful if there are many bad blocks (thanks to Danie Marais for pointing this out): |
| 310 | |
| 311 | {{{ |
| 312 | debugfs: icheck 3778301 3778302 3778303 |
| 313 | }}} |
| 314 | |
| 315 | And finally, just to confirm that this is really the damaged file: |
| 316 | |
| 317 | {{{ |
| 318 | [root]# md5sum /data/S1/R/H/714979488-714985279/H-R-714979984-16.gwf |
| 319 | md5sum: /data/S1/R/H/714979488-714985279/H-R-714979984-16.gwf: Input/output error |
| 320 | }}} |
| 321 | |
| 322 | Finally we force the disk to reallocate the three bad blocks: |
| 323 | |
| 324 | {{{ |
| 325 | [root]# dd if=/dev/zero of=/dev/hda3 bs=4096 count=3 seek=3778301 |
| 326 | [root]# sync |
| 327 | }}} |
| 328 | |
| 329 | We could also probably use: |
| 330 | |
| 331 | {{{ |
| 332 | [root]# dd if=/dev/zero of=/dev/hda bs=512 count=17 seek=35495735 |
| 333 | }}} |
| 334 | |
| 335 | At this point we now have: |
| 336 | |
| 337 | {{{ |
| 338 | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE |
| 339 | 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 |
| 340 | 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 |
| 341 | 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 |
| 342 | 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 |
| 343 | }}} |
| 344 | |
| 345 | which is encouraging, since the pending sectors count is now zero. Note that the drive reallocation count has not yet increased: the drive may now have confidence in these sectors and have decided not to reallocate them.. |
| 346 | |
| 347 | A device self test: |
| 348 | |
| 349 | {{{ |
| 350 | [root#] smartctl -t long /dev/hda |
| 351 | (then wait about an hour) shows no unreadable sectors or errors: |
| 352 | Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error |
| 353 | # 1 Extended offline Completed without error 00% 692 - |
| 354 | # 2 Extended offline Completed: read failure 80% 682 0x021d9f44 |
| 355 | }}} |
| 356 | |