Opened 5 years ago
Last modified 22 months ago
#1245 reopened defect
smartd continues to write identical disk data for a failed drive
Reported by: | Ulrich | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | undecided |
Component: | smartd | Version: | |
Keywords: | scsi | Cc: |
Description
We had a SCSI drive failure in a HP Smart Array (cciss) during a long self test. The start of the test was logged, a temperature change was logged, too, but from then on each S.M.A.R.T request failed (because the disk died during self-test).
Amazingly smart continues to write identical smart values to the CSV file, and there's no indication that the drive is actually dead.
The version of smartmontools being used is 6.6 of SLES 12 SP4.
Attachments (2)
Change History (11)
comment:1 by , 5 years ago
Keywords: | scsi added |
---|---|
Milestone: | → undecided |
comment:2 by , 5 years ago
or the easier part,I'm afraid this is not the information you are after:
"smartd -r ioctl,2 -q onecheck" seems to output an endless number of zeros:
smartd 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-95.32-default] (SUSE RPM) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org Opened configuration file /etc/smartd.conf Configuration file /etc/smartd.conf parsed. ===== [LUN DATA] DATA START (BASE-16) ===== 000-015: 00 00 00 18 00 00 00 00 00 00 00 c0 00 00 00 01 016-031: 00 00 00 c0 00 00 01 01 00 00 00 c0 00 00 fa 01 032-047: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 048-063: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 064-079: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 080-095: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 096-111: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 112-127: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 128-143: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 144-159: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 160-175: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 176-191: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 192-207: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 208-223: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 224-239: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 240-255: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 256-271: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 272-287: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 288-303: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 304-319: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 320-335: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 336-351: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 352-367: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 368-383: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 384-399: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 400-415: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 416-431: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 432-447: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 448-463: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 464-479: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 480-495: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 496-511: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 512-527: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 528-543: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 544-559: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 560-575: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ... 7184-7199: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 7200-7215: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 7216-7231: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 7232-7247: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 7248-7263: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 7264-7279: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 7280-7295: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 7296-7311: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 7312-7327: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 7328-7343: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 7344-7359: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...
Note on the CSV: Failure occurred on 2019-10-01 between 5 and 6 'o clock. After a reboot of the server on 2019-10-08 in the morning, the disk became "alive" again.
Syslog:
2019-10-01T05:48:25.750970+02:00 h06 smartd[3212]: Device: /dev/cciss/c0d0 [cciss_disk_01] [SCSI], self-test in progress 2019-10-01T05:48:25.751243+02:00 h06 smartd[3212]: Device: /dev/cciss/c0d0 [cciss_disk_01] [SCSI], Temperature changed +2 Celsius to 28 Celsius (Min/Max 22/29) 2019-10-01T06:18:25.793639+02:00 h06 smartd[3212]: Device: /dev/cciss/c0d0 [cciss_disk_01] [SCSI], failed to read SMART values 2019-10-01T06:18:25.793958+02:00 h06 smartd[3212]: Device: /dev/cciss/c0d0 [cciss_disk_01] [SCSI], failed to read Temperature 2019-10-01T06:48:26.020312+02:00 h06 smartd[3212]: Device: /dev/cciss/c0d0 [cciss_disk_01] [SCSI], failed to read Temperature
# after reboot
2019-10-08T09:57:26.433922+02:00 h06 smartd[2932]: Device: /dev/cciss/c0d0 [cciss_disk_01] [SCSI], initial Temperature is 24 Celsius (Min/Max 22/29)
comment:3 by , 4 years ago
Milestone: | undecided |
---|---|
Resolution: | → worksforme |
Status: | new → closed |
Root of the problem is unknown. Problem could not be reproduced.
comment:4 by , 22 months ago
Problem happened again: One of two SSDs in Megaraid had failed after a successful short selftest.
Messages in syslog say "failed to read SMART Attribute Data" and "Read SMART Selftest Log Failed" about ever 30 minutes.
Still the attrlog*ata.csv is written with new data lines. From that data everything looks OK.
OS is SLES12 x86_64 SP5 using smartmointools-6.6-6.6.3.x86_64
SSD had failed before 2023-01-20T3:49:41, but last entry in corresponding attrlog file is 2023-01-26T12:49:41 at the time of writing this, and no value indicates that there might be any problem. Liewise the ata.state file looks sound, too.
comment:5 by , 22 months ago
Resolution: | worksforme |
---|---|
Status: | closed → reopened |
comment:6 by , 22 months ago
A side-note: "smartd -r ioctl,2 -q onecheck" complains in the output:
..., please try adding "-d megaraid,N'
But according to the manual page of smartd, "-d" turn on debugging and has no arguments.
Also when trying that, I get a syntax error.
comment:7 by , 22 months ago
Milestone: | → undecided |
---|
Device types like -d megaraid,N
need to be specified in smartd.conf
, see man smartd.conf
.
comment:8 by , 22 months ago
/etc/smartd.conf contains basically only two lines, like this:
DEFAULT -d removable -s (...some complex regex...)
DEVICESCAN
So I'm not sure how to apply "-d megaraid,N" as only two SSDs are attached to a "megaraid"; the other disks are SAN SCSI disks. I'm mentioning this, because I can imagine that "-d removable" might have to do with the effect I see.
comment:9 by , 22 months ago
smartd 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-122.144-default] (SUSE RPM)
This version is 5+ years old. If possible, please retry with a newer version.
'Device: /dev/bus/0 [megaraid_disk_00] [SAT], SSDSC2KG240G7R, S/N:PHYM8183018R240AGN, WWN:5-5cd2e4-14f3ad4b0, FW:SCV1DL5C, 240 GB'
-d megaraid,0
is automatically set and therefore not needed in smartd.conf
.
Please provide the related (around time of failure) syslog and CSV outputs of smartd. If the drive is still accessible, please provide also a
smartd -r ioctl,2 -q onecheck
output.