#1227 closed defect (fixed)
Crucial/Micron client SSDs: Don't interpret attribute 197 as Current_Pending_Sector
Reported by: | hse | Owned by: | Christian Franke |
---|---|---|---|
Priority: | minor | Milestone: | Release 7.2 |
Component: | drivedb | Version: | 7.0 |
Keywords: | ssd | Cc: | sersorrel |
Description
Hello.
On some servers we use CRUCIAL CT2000MX500SSD1 ssd's. We test them regularly every day. The tests finishes most of the time without problem but sometimes we receive following notification:
This message was generated by the smartd daemon running on: host name: example DNS domain: example.com The following warning/error was logged by the smartd daemon: Device: /dev/sdx [SAT], 1 Currently unreadable (pending) sectors Device info: CT2000MX500SSD1, S/N:00000000000, WWN:000000000000000, FW:M3CR023, 2.00 TB For details see host's SYSLOG. You can also use the smartctl utility for further investigation. No additional messages about this problem will be sent.
But when I check the SMART Self-test revisions there is no issue reported:
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 199 - # 2 Short offline Completed without error 00% 191 - # 3 Short offline Completed without error 00% 182 - # 4 Short offline Completed without error 00% 174 - # 5 Extended offline Completed without error 00% 167 - # 6 Short offline Completed without error 00% 166 - # 7 Short offline Completed without error 00% 159 - # 8 Short offline Completed without error 00% 152 - # 9 Short offline Completed without error 00% 145 - #10 Short offline Completed without error 00% 139 - #11 Short offline Completed without error 00% 132 - #12 Short offline Completed without error 00% 125 - #13 Extended offline Completed without error 00% 119 - #14 Short offline Completed without error 00% 118 - #15 Short offline Completed without error 00% 111 - #16 Short offline Completed without error 00% 104 - #17 Short offline Completed without error 00% 96 - #18 Short offline Completed without error 00% 89 - #19 Short offline Completed without error 00% 82 - #20 Short offline Completed without error 00% 75 - #21 Extended offline Completed without error 00% 68 -
In the SMART Attributes there is no telling something is wrong with the device. No Pending sectors has been logged and the drive is completely new:
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 203 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 1 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 100 100 000 Old_age Always - 4 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 0 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 83 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 059 046 000 Old_age Always - 41 (Min/Max 0/54) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 100 100 001 Old_age Offline - 0 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0 246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 2966529424 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 56912802 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 149379986
smartctl -i /dev/sdx:
# smartctl -i /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs Device Model: CT2000MX500SSD1 Serial Number: 000000000000 LU WWN Device Id: 0 000000 000000000 Firmware Version: M3CR023 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Aug 7 13:09:20 2019 -00 SMART support is: Available - device has SMART capability. SMART support is: Enabled
So I am creating this as bug with false positive defect.
Thank you and have a nice day.
Attachments (1)
Change History (31)
comment:1 by , 5 years ago
Component: | smartctl → smartd |
---|---|
Milestone: | → undecided |
comment:2 by , 5 years ago
Chiming in on this ticket. I have 2 seemingly bogus reports about a CurrentPendingSector
error. I do not run a smart test regularly. I only ran a short one after I got the first report. I have no clue what it all means but it seems a bit contradictionary (output at the end of this).
System: Debian 10
Package: smartmontools 6.6-1
1st email was on Fri, 4 Oct 2019 23:32:47 +0200
2nd email was on Mon, 7 Oct 2019 08:02:47 +0200
System startup messages concerning harddisk:
Oct 3 19:02:47 kirika kernel: [ 2.567604] sd 0:0:0:0: [sda] 976773168 512-byte logical blocks: (500 GB/466 GiB) Oct 3 19:02:47 kirika kernel: [ 2.567606] sd 0:0:0:0: [sda] 4096-byte physical blocks Oct 3 19:02:47 kirika kernel: [ 2.567613] sd 0:0:0:0: [sda] Write Protect is off Oct 3 19:02:47 kirika kernel: [ 2.567614] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 Oct 3 19:02:47 kirika kernel: [ 2.567624] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Oct 3 19:02:47 kirika kernel: [ 2.568219] sda: sda1 Oct 3 19:02:47 kirika kernel: [ 2.568941] sd 0:0:0:0: [sda] supports TCG Opal Oct 3 19:02:47 kirika kernel: [ 2.568944] sd 0:0:0:0: [sda] Attached SCSI disk Oct 3 19:02:47 kirika kernel: [ 3.734781] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null) Oct 3 19:02:47 kirika kernel: [ 4.070924] EXT4-fs (sda1): re-mounted. Opts: errors=remount-ro Oct 3 19:02:47 kirika smartd[465]: Device: /dev/sda [SAT], CT500MX500SSD1, S/N:1927E2119CC7, WWN:5-00a075-1e2119cc7, FW:M3CR023, 500 GB Oct 3 19:02:47 kirika smartd[465]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list. Oct 3 19:02:47 kirika smartd[465]: Device: /dev/sda [SAT], not found in smartd database. Oct 3 19:02:47 kirika smartd[465]: Device: /dev/sda [SAT], opened Oct 3 19:02:47 kirika smartd[465]: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.CT500MX500SSD1-1927E2119CC7.ata.state Oct 3 19:02:47 kirika smartd[465]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.CT500MX500SSD1-1927E2119CC7.ata.state Oct 3 19:02:47 kirika smartd[465]: Device: /dev/sda, type changed from 'scsi' to 'sat'
Syslogs without temperature lines (temp changes between 65 and 67 values)
for 1st email:
Oct 4 23:32:47 kirika smartd[465]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors Oct 5 00:02:47 kirika smartd[465]: Device: /dev/sda [SAT], No more Currently unreadable (pending) sectors, warning condition reset after 1 email
for 2nd email:
Oct 7 08:02:47 kirika smartd[465]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors Oct 7 08:32:47 kirika smartd[465]: Device: /dev/sda [SAT], No more Currently unreadable (pending) sectors, warning condition reset after 1 email
I upgraded /var/lib/smartmontools/drivedb/drivedb.h in between the reports hoping it would fix something. I see more names in smartctl since but the second email still arrived.
$ ls -l /var/lib/smartmontools/drivedb total 380 -rw-r--r-- 1 root root 207995 okt 5 09:41 drivedb.h -rw-r--r-- 1 root root 179580 okt 15 2018 drivedb.h.old
Here, the .old file is the one shipped by Debian. I did not restart the smartd daemon though, it's only just now that I did that. Maybe that would help?
The /etc/smartd.conf is the original debian one, one line when comments are stipped:
DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
Current output of smartctl -a /dev/sda:
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.2.0-0.bpo.2-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs Device Model: CT500MX500SSD1 Serial Number: 1927E2119CC7 LU WWN Device Id: 5 00a075 1e2119cc7 Firmware Version: M3CR023 User Capacity: 500,107,862,016 bytes [500 GB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Oct 7 23:10:39 2019 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 30) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x0031) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 206 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 100 100 000 Old_age Always - 3 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 0 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 42 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 067 049 000 Old_age Always - 33 (Min/Max 0/51) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 100 100 001 Old_age Offline - 0 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0 246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 501960200 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 9157853 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 25021935 SMART Error Log Version: 1 Warning: ATA error count 0 inconsistent with error log pointer 1 ATA Error Count: 0 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 0 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 00 ec 00 00 00 00 00 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 00 00 00:00:00.000 IDENTIFY DEVICE ec 00 00 00 00 00 00 00 00:00:00.000 IDENTIFY DEVICE ec 00 00 00 00 00 00 00 00:00:00.000 IDENTIFY DEVICE ec 00 00 00 00 00 00 00 00:00:00.000 IDENTIFY DEVICE c8 00 00 00 00 00 00 00 00:00:00.000 READ DMA SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 160 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
The drive is barely two weeks old yet the short offline error report talks about 0 days/hours. Also the "device state unknown" seems that there is a compatibility issue.
comment:3 by , 5 years ago
Can we do something about the CurrentPendingSector errors? I used to get one per week and now is 1-2/day. It was never more than one sector.
Clearly Crucial is not going to ever fix that issue with a new firmware but at least can we silence this in particular? How?
Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors
comment:5 by , 5 years ago
This appears to be an issue with the Crucial MX500 firmware, I just got one and it's generating 5-10 of these errors daily, but there's never a corresponding increase to Reallocated_Event_Count nor can I find the "bad" sector in question as it goes away without any other errors. The one thing I can think of as a fix other than totally disabling the current pending sectors check is to allow it to complain if the value is greater than 1, rather than just non-zero. Obviously this should be configurable, as for drives without this bug a value of 1 may indicate a real problem, but a configurable threshold would solve this problem and make it easier to handle future buggy drives as well...
by , 5 years ago
Attachment: | smart_drivedb.h added |
---|
Local drive database entry for MX500 with firmware M3CR023
comment:6 by , 5 years ago
Component: | smartd → drivedb |
---|---|
Keywords: | ssd added |
Summary: | SMART test generates possible false positives → Crucial MX500 firmware M3CR023 returns bogus attribute 197 |
Type: | defect → enhancement |
The attached local drive database entry should suppress Currently unreadable (pending) sectors
reports from smartd
for this specific drive and firmware only.
Copy the file to the default location of the local(!) drive database or append it to the existing file. See -B
option on smartctl man page or smartctl -h
output for the configured default location (usually /etc/smart_drivedb.h
).
Please report the test result.
comment:7 by , 5 years ago
startup shows the warning
Dec 9 18:51:18 kirika smartd[29114]: smartd 6.6 2017-11-05 r4594 [x86_64-linux-5.2.0-0.bpo.2-amd64] (local build) Dec 9 18:51:18 kirika smartd[29114]: Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org Dec 9 18:51:18 kirika smartd[29114]: Opened configuration file /etc/smartd.conf Dec 9 18:51:18 kirika smartd[29114]: Drive: DEVICESCAN, implied '-a' Directive on line 21 of file /etc/smartd.conf Dec 9 18:51:18 kirika smartd[29114]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices Dec 9 18:51:18 kirika smartd[29114]: Device: /dev/sda, type changed from 'scsi' to 'sat' Dec 9 18:51:18 kirika smartd[29114]: Device: /dev/sda [SAT], opened Dec 9 18:51:18 kirika smartd[29114]: Device: /dev/sda [SAT], CT500MX500SSD1, S/N:1927E2119CC7, WWN:5-00a075-1e2119cc7, FW:M3CR023, 500 GB Dec 9 18:51:18 kirika smartd[29114]: Device: /dev/sda [SAT], found in smartd database: Crucial/Micron MX500 SSDs Dec 9 18:51:18 kirika smartd[29114]: Device: /dev/sda [SAT], WARNING: This firmware returns bogus raw values in attribute 197 Dec 9 18:51:18 kirika smartd[29114]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list. Dec 9 18:51:18 kirika smartd[29114]: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.CT500MX500SSD1-1927E2119CC7.ata.state
smartctl -a /dev/sda shows the bogus line.
the comment in the file mentions ticket #1237, but that should be ticket #1227
comment:8 by , 5 years ago
the comment in the file mentions ticket #1237, but that should be ticket #1227
Thanks for catching.
Please check whether the new entry actually suppresses the Currently unreadable (pending) sectors
messages and emails.
If -R 197
(or -R 197!
, see man page) is temporarily added to smartd.conf
, changes of this attribute are logged, for example ... Attribute: 197 ... changed from 100 [Raw 0] to 100 [Raw 1]
.
comment:9 by , 5 years ago
This seems to be happening in earlier firmwares too, I can confirm it for "M3CR020" and "M3CR022", which your current update doesn't address. I'm testing with:
"M3CR02[0-3]", Firmware with bogus attribute 197 (see ticket #1237)
which detects my drives with other firmwares and will report on proper message suppression in a few days.
comment:10 by , 5 years ago
Can confirm it suppresses the messages, saw these in logs (with -R 197) and was not notified of it via email:
Dec 27 11:31:51 spire smartd[13210]: Device: /dev/sdi [SAT], SMART Usage Attribute: 197 Bogus_Current_Pend_Sect changed from 100 [Raw 0] to 100 [Raw 1] Dec 27 12:01:52 spire smartd[13210]: Device: /dev/sdi [SAT], SMART Usage Attribute: 197 Bogus_Current_Pend_Sect changed from 100 [Raw 1] to 100 [Raw 0]
sdi is a MX500.
comment:11 by , 5 years ago
Milestone: | undecided → Release 7.1 |
---|---|
Owner: | set to |
Status: | new → accepted |
Thanks for testing. Then we could add the (enhanced) entry to drive database.
comment:14 by , 4 years ago
FYI, the attribute returned by this SSD is not bogus, but shouldn't be used for monitoring (unless maybe if steadily in this state without writes maybe...).
I have obtained the controller's technical documentation from Crucial directly - according to them it applies to all their consumer drives, even my BX300 that I was specifically inquiring about too even if it's not listed. You should be able to find it online too, search for tnfd22_client_ssd_smart_attributes.pdf
According to this document, 197 "Current pending ECC count" is:
"This value represents the total number of ECC events found as a result of host com-
mands (for example, READ commands) or during background operations."
An older version of the M500 firmware has its own doc with a slightly different description for that attribute, yet that could still fall within the newer firmware's description:
"This value gives the number of blocks waiting to be remapped"
This really is a number of current, in-flight events on the disk. I would expect uncorrectable errors to be counted in "187 Reported Uncorrectable Errors" and since it's not likely to be recoverable upon later reads like for magnetic media I wold assume these to be remapped almost immediately. In fact testing has showed I will occasionally get a 1 during heavy writes but it never last.
This is clearly different to the traditional spinning disk's meaning - the number of unreadable sectors that haven't yet been remapped (which can be remapped by writing over full sectors, i.e. 4k - smaller writes requires sectors to be readable as disks have been using 4k internally long before AF). The disk will let you retry reads as many times as you want, then it will remap or fix these upon successful read or full write, but it will not fix then otherwise.
comment:15 by , 4 years ago
Milestone: | Release 7.1 → Release 7.2 |
---|---|
Resolution: | fixed |
Status: | closed → reopened |
Type: | enhancement → defect |
Reopen because new info is available, see above.
comment:16 by , 4 years ago
Thanks for the info. I found revision E of tnfd22_client_ssd_smart_attributes.pdf
. The same search also found revision D in ticket #812 :-). The documentation of attributes 197/198 has been changed (fixed?):
Revision C (2014-12-19: M500 (FW>=MU03), M510, M550, MX100, M600, MX200):
197: Current Pending Sector Count - This value gives the number of blocks waiting to be remapped.
198: SMART offline scan uncorrectable sector count - This value is the cumulative number of unrecoverable read errors found in a background media scan. If no background media scan has been run, a value of 0 will be returned.
Revisions D (2016-09-23: +1100, +MX300) and E (2018-09-28: +1300):
197: Current pending ECC count - This value represents the total number of ECC events found as a result of host commands (for example, READ commands) or during background operations.
198: SMART offline scan uncorrectable error count - This value is the cumulative number of unrecoverable read errors (UECC) found in the most recent media scan triggered by a SMART EXECUTE OFF-LINE IMMEDIATE command. At the beginning of each media scan, this value shall reset to zero. If no media scan has been previously run, this field will be zero.
Conclusion: Attribute 197 is not bogus but has a different meaning and should not be monitored as Current_Pending_Sector
. Attribute 198 could be left as Offline_Uncorrectable
.
comment:17 by , 4 years ago
Summary: | Crucial MX500 firmware M3CR023 returns bogus attribute 197 → Crucial/Micron client SSDs: Don't interpret attribute 197 as Current_Pending_Sector |
---|
Change summary accordingly.
comment:19 by , 4 years ago
Cc: | added |
---|
comment:21 by , 4 years ago
I don't know whenever it' been fixed in the doc or not, but I seen to recall it missing on some of my crucial drives, or changing the definition to a different one caused it to start throwing alerts (there seem to be multiple definitions for drives using the same chipset/SMART specs - at least in the version of drivedb I have - some cleanup would be needed...)
I'll have to look what causes this error, I get it on my system - but I just realized for some reason I don't recall I commented out my own drive's entry, so it used the other definition that should have the same smart specs regardless (the regex match bot both, I had commented the one that comes first)
This message was generated by the smartd daemon running on: host name: debian DNS domain: local The following warning/error was logged by the smartd daemon: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors Device info: CT250MX500SSD1, S/N:XXXXXXXXXXXX, WWN:5-00a075-XXXXXXXXX, FW:M3CR023, 250 GB For details see host's SYSLOG. You can also use the smartctl utility for further investigation. Another message will be sent in 24 hours if the problem persists.
If this message comes from smartmontool (and not some debian log monitoring add-on), besides correcting drivedb.h we should make sure it doesnn't alert if that attributes is the ECC one or the drive isn't in drivedb.h (it is already the case?)
Here's the diff between the two definitions (same drive, I just commended one regex to match the other definition instead):
diff -u1 1 2 --- 1 2020-10-03 19:27:52.922133699 -0400 +++ 2 2020-10-03 19:26:30.927444584 -0400 @@ -4,3 +4,3 @@ === START OF INFORMATION SECTION === -Model Family: Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs +Model Family: Crucial/Micron MX500 SSDs Device Model: CT250MX500SSD1 @@ -17,2 +17,5 @@ Local Time is: Sat Oct 3 19:26:30 2020 EDT + +==> WARNING: This firmware returns bogus raw values in attribute 197 + SMART support is: Available - device has SMART capability. @@ -73,3 +76,3 @@ 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 -197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 +197 Current_Pend_ECC_Ct 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
Oddly the warning is on the definition that also has the proper name!
comment:22 by , 4 years ago
This should be only logged if the drive database entry does NOT override the name of attribute 197:
Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors
This should be only printed if the drive database entry contains this warning string:
==> WARNING: This firmware returns bogus raw values in attribute 197
comment:23 by , 4 years ago
To be clear, by drive can really match two drive definitions, both describing attributes for the same chipset / based on the same technical document - if I "comment out" the first regex the 2nd one match, but I also had an older drivedb.h...
So the first definition (Model Family: Crucial/Micron MX500 SSDs
):
- had
Current_Pend_ECC_Ct
, now renamed toBogus_Current_Pend_Sect
(I think this is wrong) - shows the warning text on attr 197
The 2nd definition (Model Family: Crucial/Micron BX/MX1/2/3/500, M5/600, 11/1300 SSDs
):
- has
Current_Pending_Sector
- doesn't show the warnign about attr 197
What I need to test now:
I realize I may have not restarted smartctl after changing the drivedb.h, so if it's loaded once my test didn't work. I will check if the alert stops now that the name is Bogus_Current_Pend_Sect
, and will try again with the older name which I believe should be used.
Also unrelated to 197, attr 202 name is also wrong for both, it's used, not remaining lifetime.
Once we get it right both entries should be merged - to make it clean it will require a in-depth analysis of the regexes for both and merging them - I'm fluent with regex so I could submit a PR with clear comments of what is being changed/merged and how for review.
comment:24 by , 4 years ago
The MX500
entry intentionally matches a proper subset of the ...BX/MX1/2/3/500...
entry. I will simply remove the MX500
entry (this will remove the WARNING: This firmware...
) and change attribute 197 in the remaining entry accordingly (this will suppress the smartd Currently unreadable ...
logs).
Thanks for the hint about attribute 202.
comment:25 by , 4 years ago
You're right - I overlooked it, only one model match in the MX500 entry, and the 2nd is firmware match? so only specific firmwares on top of that.
I'm curious about the origin of that warning - It appears these drives (mine included) are plagued with a pretty insane write amplification "issue" (Crucial wouldn't qualify it that way...), maybe it's better in later firmwares, but one thing for sure when the drive writes 10 time as many cells as written from the host then it's 10 times more likely to be caught with a pending ECC operation when smartd looks it up (I see one every week or so, on average, also varies based on write load on my desktop). My SSD will last just about 5 years based on current estimations which is I believe the warranty period of that drive, that is if write amplification doesn't get any worse... so I might be on the higher-end of desktop write loads too.
I have a bx300 I could monitor too if smartd runs on Windows - I snuck it in a pseudo-deskop-TV-appliance I pretty much use only as a TV... and Windows can run sshd nowadays :)
comment:26 by , 4 years ago
Status: | reopened → accepted |
---|
comment:28 by , 4 years ago
Dear developers,
I see these errors are deemed to be bogus and will be ignored, they may actually be a valid warning rather than a bogus one.
I'm quoting Lucretia19 from tomshardware forum below. He was having issues with getting confirmation email and kindly asked me to post this:
"By logging SMART data at a high rate (every second) using smartctl.exe, I established that the Bogus_Current_Pending_Sectors bug correlates perfectly with the Crucial MX500's excessive write amplification bug. Specifically, Current_Pending_Sectors changes to 1 when the ssd's FTL controller begins writing a multiple of about 37000 NAND pages (37000 NAND pages is approximately 1 GByte) and changes back to 0 when the FTL write burst ends. Although the correlation is perfect, it's unknown which is more closely related to the cause and which is more closely related to the effect. (Crucial presumably knows.) Fortunately, the excessive write amplification can be largely tamed by running ssd selftests nearly nonstop. (I insert a 30 seconds pause between 19.5 minutes selftests as a precaution, just in case the ssd's health depends on occasional FTL write bursts.) My logs show that the FTL write bursts occur only during the pauses between selftests, presumably because an FTL write burst is a lower priority process than a selftest. Selftests appear not to slow the ssd performance, presumably because a selftest is a lower priority process than host reads and writes. The only known downside is that the ssd appears to consume about 1 watt extra while running a selftest. The selftests raise the ssd temperature by a few degrees Celsius and keep the ssd temperature more stable."
His thread about the write amplification issue is at https://forums.tomshardware.com/threads/crucial-mx500-500gb-sata-ssd-remaining-life-decreasing-fast-despite-few-bytes-being-written.3571220/
A lot of other users on other forums are also facing the WAF issue so Current_Pending_Sector may be a useful warning about it.
comment:29 by , 4 years ago
Thanks for the info. All we could possibly do is to re-add the MX500 specific entry with an updated (which?) warning and attribute 197 unchanged such that the smartd Current_Pending_Sector
warning is no longer suppressed. Please create a new ticket and suggest details. Don't reopen this ticket.
comment:30 by , 4 years ago
I have one of such "high WAF" drives - the warnings currently in place are sufficient - drive pct used is a clear indication whenever you're going to hit a wall in two years or if your drive will last for many years to come. Mine - if waf doesn't get worse - is going to last for about 5 years total with almost one year in - no particularly intensive write load. It's disappointing but I can get a better drive by then.
I've written a lot more on my previous Crucial M4 128G, running write-heavy stuff for well over a year, and yet it seems that under the same load as my MX500 500GB it could over-last it! It went to 35 pct used from Dec 2013 to Dec 2019, with my oldest smartd logs showing it was already at 27pct in July 2015 due to its early life hammering. By comparison the MX500 is at 14pct since Dec 2019!
Calculating the WAF from attr 247-248 is another way to see something is odd (yet afaik Crucial consider it a normal behavior...) Mine is at 7.94 agv since the beginning, going up and down a lot but maybe more up that down...
If you wish to give people life expectancy pre-warnings, i.e. "at the current usage rate your ssd will stop working in X years/days.." that might actually help the users determine if something is wrong, but sending random warning about an attribute that goes to 1 as part of the normal ssd function (albeit quite more often when said ssh write GB's in batch in the background) I don't think that's very helpful.
Please provide configuration line for this device from
smartd.conf
file.Please check syslog for
smartd ... Currently unreadable (pending) sectors
messages.