Opened 22 months ago

Last modified 7 months ago

#1715 new enhancement

Allow to ignore certain bits of NVMe Critical Warning byte

Reported by: BradGeeeeeeeeeeeeeeeeeee Owned by:
Priority: minor Milestone: unscheduled
Component: smartd Version: 7.2
Keywords: nvme Cc: Felix E.

Description

I have a Samsung SSD 960 EVO 250GB with 424TB written. The drive stores large amounts of RRD files for an SNMP-type monitoring system and gets re-written constantly.

The manufacturer warranty for this drive is 100TB, so we are 424% beyond the warranty and thus the "Percentage Used" value. However, the drive works fine and shows no other signs of wearing out.

I recently did an OS update on this Debian host, from Debian 10 to 11, and along with it came a new version of smartmontools. Unfortunately it now complains every 24 hours with the following error:

The following warning/error was logged by the smartd daemon:

Device: /dev/nvme0, Critical Warning (0x04): Reliability

I have been forced to add a "/dev/nvme0 -d ignore" to my /etc/smartd.conf file, but this prevents me from being alerted to any other possible problems, including any reduction of the "Available Spare" value, or thermal warnings.

With ATA drives, it's possible it ignore certain specific attributes with the -i or -I arguments, but I'm not aware of any similar feature which might be helpful here.

Would it be possible to ignore such warnings, while still monitoring the device for other problems? What do you advise?

I suspect this problem will occur more regularly as time goes on and more too-reliable-for-their-own-damned-good drives will begin to annoy their administrators.

The only crime this drive has committed is it's failure to fail! Please end this unfair persecution of my poor, abused but reliable, NVME drive!

This request on Serverfault is similar to mine and might be worth a read (I am not the author):

https://serverfault.com/questions/1118718/how-to-add-to-excludes-alerts-on-smartmontool

My searching found that this request is somewhat similar to bug 1434:

https://www.smartmontools.org/ticket/1434

Below is full output from a smartctl -a.

> sudo smartctl -a /dev/disk/by-id/nvme-Samsung_SSD_960_EVO_250GB_xxxxxxxxxxxxxxx

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-21-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 960 EVO 250GB
Serial Number:                      xxxxxxxxxxxxxxx
Firmware Version:                   2B7QCXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 250,059,350,016 [250 GB]
Unallocated NVM Capacity:           0
Controller ID:                      2
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          250,059,350,016 [250 GB]
Namespace 1 Utilization:            191,818,444,800 [191 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            xxxxxxxxxxxxxxxxx
Local Time is:                      Thu Apr  6 20:23:49 2023 MST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     79 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.04W       -        -    0  0  0  0        0       0
 1 +     5.09W       -        -    1  1  1  1        0       0
 2 +     4.08W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1500
 4 -   0.0050W       -        -    4  4  4  4     2200    6000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x04
Temperature:                        28 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    255%
Data Units Read:                    151,762,677 [77.7 TB]
Data Units Written:                 830,060,020 [424 TB]
Host Read Commands:                 6,877,731,354
Host Write Commands:                51,000,719,462
Controller Busy Time:               79,390
Power Cycles:                       35
Power On Hours:                     31,418
Unsafe Shutdowns:                   21
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               28 Celsius
Temperature Sensor 2:               35 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Change History (5)

comment:1 by Christian Franke, 22 months ago

Keywords: nvme added; ignore Critical Warning removed
Milestone: unscheduled
Summary: NVME Critical Warning: 0x04 due to excessive Data Units WrittenAllow to ignore certain bits of NVMe Critical Warning byte

For NVMe devices, smartd only sends warnings if Critical Warning is non zero (-H directive), the Error Information Log Entries count has changed (-l error directive) or a Temperature [Sensor N] reaches the critical limit (-W ... directive).

If no directive or -a is specified, the default for NVMe is -H -l error.

I agree that it would be useful to optionally ignore certain Critical Warning bits. For example by adding an optional argument to the -H directive: -H 0xfb should ignore 0x04. Changing ticket summary accordingly.

... including any reduction of the "Available Spare" value, or thermal warnings.

Thermal warnings require -W ... directive, see man smartd.conf.

Available Spare is not monitored, but should be, because Available Spare Threshold is provided. Please create a separate ticket for this feature.

comment:2 by mhaamann, 15 months ago

We have the same issue now on most of our production servers. Posting comment here to subscribe for updates

in reply to:  2 comment:3 by BradGeeeeeeeeeeeeeeeeeee, 15 months ago

Replying to mhaamann:

We have the same issue now on most of our production servers. Posting comment here to subscribe for updates

See https://www.smartmontools.org/ticket/1716

comment:4 by Felix E., 7 months ago

Cc: Felix E. added

Hi,

I was trying my hand at this and came to more "architectural" questions:

  1. Do we want to ignore the selected critical_warning bit(s) just for the result of the pass/fail check (i.e. exit code, "self-assessment test result" & json smart_status.passed), or do we want the bit to seemingly vanish from all related values (long-form output of set bits, related json smart_status.nvme subkey(s) and smart_status.value)?

I think this heavily depends on what the use case is:

  • If its to work around a device bug, we probably do not want dependant applications to know that the bit was set in the first place. In that case, the above secondly named values would need to be redacted.
  • If its just a case of "I don't want a full smartd alarm if the TBW value is over 100%" (critical_warning bit 0x04), it might still be relevant for monitoring purposes, so those downstream tools should still be able to get the actual value (but not the FAILED state).
  1. Are there other fields/values except for critical_warning with NVMes that we might want to ignore in the future (in --health)? If so, maybe a bitmask directly as -H param isn't the best idea?

Also partially related: is the output of smartctl supposed to be machine-readable?
I was thinking of explicitly mentioning set critical_warning bits when they are ignored, but with a special message behind them, maybe like this:

smartctl pre-7.5 2024-05-08 r5613 [x86_64-linux-6.8.8-2-pve] (local build)
Copyright (C) 2002-24, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
- NVM subsystem reliability has been degraded (ignored by -H bitmask)

And one last (meta-) question: What exactly is the preferred way for contribution here? A patch/diff and a maintainer applies that? Mailinglist? I really am not familiar with SVN (sorry) :)

Really sorry for all the questions. I am a beginner with C/C++ and this project.

in reply to:  4 comment:5 by Christian Franke, 7 months ago

Replying to Felix E.:

...

  1. Are there other fields/values except for critical_warning with NVMes that we might want to ignore in the future (in --health)? If so, maybe a bitmask directly as -H param isn't the best idea?

This ticket is for smartd and its evaluation of Critical Warning byte only. If you have any suggestion for more complex functionality, please create a new ticket and provide detailed suggestions there.

Also partially related: is the output of smartctl supposed to be machine-readable?

Yes, with --json[=cgiosuvy] option, available since smartmontools 7.0 (Dec 2018).

I was thinking of explicitly mentioning set critical_warning bits when they are ignored, but with a special message behind them, maybe like this:

If you want this functionality also for smartctl, please create a new ticket.

And one last (meta-) question: What exactly is the preferred way for contribution here? A patch/diff and a maintainer applies that? Mailinglist? I really am not familiar with SVN (sorry) :)

Attach a patch file to a new ticket or create a PR in our github R/O mirror. Or wait until we have moved the project from SF to github (which will happen possibly until the end of this year) and create a PR then.

Note: See TracTickets for help on using tickets.