Opened 4 months ago

Last modified 7 weeks ago

#1835 accepted enhancement

Smartd should also ignore 'Set Feature' related errors from NVMe Error Information log

Reported by: Christian Franke Owned by: Christian Franke
Priority: minor Milestone: Release 7.5
Component: smartd Version: 7.4
Keywords: nvme Cc: BrianG

Description

Recent comments from ticket #1222 show that the NVMe error "Feature Identifier Not Saveable" (SCT=0x1, SC=0x0d) may also appear in the error log after each reboot:

Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0       4813     0  0x2010  0x4004      -            0     1     -  Invalid Field in Command
  1       4812     0  0x0010  0x4004      -            0     0     -  Invalid Field in Command
  2       4811     0  0x001b  0x421a  0x028            0     0     -  Feature Identifier Not Saveable
  3       4810     0  0x0012  0x4004      -            0     0     -  Invalid Field in Command
  4       4809     0  0x3007  0x4004      -            0     1     -  Invalid Field in Command
  5       4808     0  0x1003  0x4004      -            0     0     -  Invalid Field in Command
  6       4807     0  0x001b  0x421a  0x028            0     0     -  Feature Identifier Not Saveable

smartd syslog:

2024-05-25T14:44:49+0000 smartd[834]: Device: /dev/nvme0, Samsung SSD 960 PRO 512GB, S/N:***************, FW:2B6QCXP7, 512 GB
...
2024-05-25T14:44:49+0000 smartd[834]: Device: /dev/nvme0, NVMe error [1], count 4811, status 0x421a: Feature Identifier Not Saveable
2024-05-25T14:44:49+0000 smartd[834]: Device: /dev/nvme0, NVMe error count increased from 4808 to 4812 (1 new, 3 ignored, 0 unknown)

This suggests that the kernel (or another component run during boot) issues a Set Features NVMe command with SV (Save) bit set without a prior check whether this bit is supported. If the kernel does it, this is IMO a kernel bug.

Smartd should also ignore this error.

Change History (7)

comment:1 by Christian Franke, 4 months ago

Owner: set to Christian Franke
Status: newaccepted

in reply to:  description comment:2 by Christian Franke, 4 months ago

This suggests that the kernel (or another component run during boot) issues a Set Features NVMe command with SV (Save) bit set without a prior check whether this bit is supported. If the kernel does it, this is IMO a kernel bug.

I take that back because this device indicates SV bit (Sav/Sel_Feat) support:

Model Number:                       Samsung SSD 960 PRO 512GB
...
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat

The individual Feature ID used may not support SV.

comment:3 by BrianG, 4 months ago

Christian, thanks a lot for the prompt reply!

Is it possible to identify the offending component? Let me know if I can help with any additional information.

comment:4 by Christian Franke, 3 months ago

If a restart of services without reboot (e.g. systemctl rescue and then ^D, or systemctl soft-reboot) results in new Feature Identifier Not Saveable log entries, some of the restarted services might be the root of the problem.

I could not reproduce the behavior on a system with Debian 12 (Console/SSH only, no GUI) using a Samsung SSD 970 EVO Plus 500GB. Only one Invalid Field in Command appears after each reboot.

AFAICS from the Linux kernel sources, there is support for NVMe Set Features commands, see core.c. This is (only?) used by pci.c to change the power state and by hwmon.c to change the temperature thresholds.

comment:5 by ThoughtPolice84, 3 months ago

I am still following this ticket and #1222 because I still receive NVMe errors.
Now with another Linux build and a newer version of smartmontools:

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
The following warning/error was logged by the smartd daemon:

Device: /dev/nvme0, number of Error Log entries increased from 1098 to 1099

Device info:
SAMSUNG MZVLB256HAHQ-000H1, S/N:S425NX1M782076, FW:EXD70H1Q, 256 GB

comment:6 by BrianG, 3 months ago

@ThoughtPolice84, the new behavior was added on build [5472]. You need to update smartmontools to version 7.4.

Christian, let me know if I can help with any additional information.

comment:7 by BrianG, 7 weeks ago

FWIW, I moved my hard drive to a different machine and just turning the system on registers the new error entries, not even mounting any partition of the disk.

6.9.7-1~bpo12+1 (2024-07-03) x86_64 GNU/Linux

smartctl 7.4 2023-08-01 r5530

Note: See TracTickets for help on using tickets.