Opened 8 months ago

Closed 2 months ago

#1835 closed enhancement (fixed)

Smartd should also ignore 'Set Feature' related errors from NVMe Error Information log

Reported by: Christian Franke Owned by: Christian Franke
Priority: minor Milestone: Release 7.5
Component: smartd Version: 7.4
Keywords: nvme Cc: BrianG

Description

Recent comments from ticket #1222 show that the NVMe error "Feature Identifier Not Saveable" (SCT=0x1, SC=0x0d) may also appear in the error log after each reboot:

Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0       4813     0  0x2010  0x4004      -            0     1     -  Invalid Field in Command
  1       4812     0  0x0010  0x4004      -            0     0     -  Invalid Field in Command
  2       4811     0  0x001b  0x421a  0x028            0     0     -  Feature Identifier Not Saveable
  3       4810     0  0x0012  0x4004      -            0     0     -  Invalid Field in Command
  4       4809     0  0x3007  0x4004      -            0     1     -  Invalid Field in Command
  5       4808     0  0x1003  0x4004      -            0     0     -  Invalid Field in Command
  6       4807     0  0x001b  0x421a  0x028            0     0     -  Feature Identifier Not Saveable

smartd syslog:

2024-05-25T14:44:49+0000 smartd[834]: Device: /dev/nvme0, Samsung SSD 960 PRO 512GB, S/N:***************, FW:2B6QCXP7, 512 GB
...
2024-05-25T14:44:49+0000 smartd[834]: Device: /dev/nvme0, NVMe error [1], count 4811, status 0x421a: Feature Identifier Not Saveable
2024-05-25T14:44:49+0000 smartd[834]: Device: /dev/nvme0, NVMe error count increased from 4808 to 4812 (1 new, 3 ignored, 0 unknown)

This suggests that the kernel (or another component run during boot) issues a Set Features NVMe command with SV (Save) bit set without a prior check whether this bit is supported. If the kernel does it, this is IMO a kernel bug.

Smartd should also ignore this error.

Attachments (3)

smartctl-r5620.txt (1.8 KB ) - added by BrianG 3 months ago.
Output of smartctl -l error /dev/nvme0n1 -- r5620
smartctl-2024-11-22T21:35Z.txt (1.8 KB ) - added by BrianG 2 months ago.
smartctl -l error /dev/nvme0 - 2024-11-22T21:35Z
journalctl-83e61d4458eb41488649ccd34a87fc3b-smartmontools-Nov_22.log (4.8 KB ) - added by BrianG 2 months ago.
journalctl --utc -xe --no-hostname --priority 0..7 --boot 83e61d4458eb41488649ccd34a87fc3b -u smartmontools.service - 2024-11-22

Download all attachments as: .zip

Change History (24)

comment:1 by Christian Franke, 8 months ago

Owner: set to Christian Franke
Status: newaccepted

in reply to:  description comment:2 by Christian Franke, 8 months ago

This suggests that the kernel (or another component run during boot) issues a Set Features NVMe command with SV (Save) bit set without a prior check whether this bit is supported. If the kernel does it, this is IMO a kernel bug.

I take that back because this device indicates SV bit (Sav/Sel_Feat) support:

Model Number:                       Samsung SSD 960 PRO 512GB
...
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat

The individual Feature ID used may not support SV.

comment:3 by BrianG, 8 months ago

Christian, thanks a lot for the prompt reply!

Is it possible to identify the offending component? Let me know if I can help with any additional information.

comment:4 by Christian Franke, 8 months ago

If a restart of services without reboot (e.g. systemctl rescue and then ^D, or systemctl soft-reboot) results in new Feature Identifier Not Saveable log entries, some of the restarted services might be the root of the problem.

I could not reproduce the behavior on a system with Debian 12 (Console/SSH only, no GUI) using a Samsung SSD 970 EVO Plus 500GB. Only one Invalid Field in Command appears after each reboot.

AFAICS from the Linux kernel sources, there is support for NVMe Set Features commands, see core.c. This is (only?) used by pci.c to change the power state and by hwmon.c to change the temperature thresholds.

comment:5 by ThoughtPolice84, 8 months ago

I am still following this ticket and #1222 because I still receive NVMe errors.
Now with another Linux build and a newer version of smartmontools:

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
The following warning/error was logged by the smartd daemon:

Device: /dev/nvme0, number of Error Log entries increased from 1098 to 1099

Device info:
SAMSUNG MZVLB256HAHQ-000H1, S/N:S425NX1M782076, FW:EXD70H1Q, 256 GB

comment:6 by BrianG, 8 months ago

@ThoughtPolice84, the new behavior was added on build [5472]. You need to update smartmontools to version 7.4.

Christian, let me know if I can help with any additional information.

comment:7 by BrianG, 6 months ago

FWIW, I moved my hard drive to a different machine and just turning the system on registers the new error entries, not even mounting any partition of the disk.

6.9.7-1~bpo12+1 (2024-07-03) x86_64 GNU/Linux

smartctl 7.4 2023-08-01 r5530

comment:8 by Christian Franke, 3 months ago

Resolution: fixed
Status: acceptedclosed

comment:9 by BrianG, 3 months ago

Christian, is it enough just copy the binaries smartctl and smartd replacing the original ones from Debian?

If so, I am still getting an increasing error count using the build r5620.

comment:10 by Christian Franke, 3 months ago

Overwriting binaries from packages may work, but also break package integrity checks.

Please provide a smartctl -l error output for further diagnostics.

by BrianG, 3 months ago

Attachment: smartctl-r5620.txt added

Output of smartctl -l error /dev/nvme0n1 -- r5620

comment:11 by BrianG, 3 months ago

I attached file smartctl-r5620.txt.

There is just one type of error recorded, Invalid Field in Command. Probably just the first 4 entries were added after I had updated to r5620.

As I've moved this disk to a different machine than the original bug report I've not been using it as much as before. Then, the other error Feature Identifier Not Saveable may still be present too, but not triggered recently with the different usage.

comment:12 by Christian Franke, 3 months ago

Please provide the related lines in the syslog output of smartd:

... Device: /dev/nvme..., NVMe error [...], count ... , status 0x...: ...
... Device: /dev/nvme..., NVMe error count increased from ... to ... (...)

The first line is only logged (as LOG_INFO) for errors interpreted as device related. In this case, the last line is logged as LOG_CRIT, otherwise as LOG_INFO.

comment:13 by BrianG, 3 months ago

I don't see the mentioned lines in the logs, probably because smartd is not logging anything other than:

smartd[1112]: This version of smartd was build without 'Type=notify' support.

Is there any binary that is compiled with notifications support that I can download for testing?

comment:14 by Christian Franke, 3 months ago

Smartd exits with status 1 after this message. Remove Type=notify from smartd.service (on some systems smartmontools.service) file and retry.

The binaries at https://builds.smartmontools.org/ do not use libsystemd because otherwise these won't run on systems not using systemd.

comment:15 by BrianG, 3 months ago

I enabled the notifications and I started observing some messages logged, but no errors at all.

After booting, the latest messages logged are:

journalctl --utc -xe --priority 0..7 --boot 0 -u smartmontools.service

[...]
Oct 30 21:55:16 smartd[1127]: Device: /dev/nvme0, opened
Oct 30 21:55:16 smartd[1127]: Device: /dev/nvme0, Samsung SSD 960 PRO 512GB, S/N:*****, FW:2B6QCXP7, 512 GB
Oct 30 21:55:16 smartd[1127]: Device: /dev/nvme0, is SMART capable. Adding to "monitor" list.
Oct 30 21:55:16 smartd[1127]: Monitoring 2 ATA/SATA, 0 SCSI/SAS and 1 NVMe devices
Oct 30 21:55:16 smartd[1127]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 63 to 62

comment:16 by BrianG, 2 months ago

Resolution: fixed
Status: closedreopened

I'm reopening this issue, so it won't get forgotten.
Feel free to close again if more appropriate.

comment:17 by Christian Franke, 2 months ago

Milestone: Release 7.5undecided

If state persistence is not enabled (by default or by smartd -s ...), no NVMe error log related messages would occur after boot because the previous state is unknown.

Please provide outputs of smartctl -l error /dev/nvme0 before and after boot.

by BrianG, 2 months ago

smartctl -l error /dev/nvme0 - 2024-11-22T21:35Z

by BrianG, 2 months ago

journalctl --utc -xe --no-hostname --priority 0..7 --boot 83e61d4458eb41488649ccd34a87fc3b -u smartmontools.service - 2024-11-22

comment:18 by BrianG, 2 months ago

OK, I'm getting one of the error messages you had pointed (increasing the counter), but the other one (specifying the error status) is not being logged.
Moreover, one of the errors that happens during boot is taking exactly 30 minutes to get into the logs.

I'm not sure if I'm doing something wrong, as I never tweaked any smartd options before, but I modified the smartmontools.service as:

ExecStart=/usr/sbin/smartd -n -s /var/log/

Then, it started logging and the state files were created.

I noted that the NVMe drive always logs only two lines, different from the other SSDs that record many values into their state files:
cat /var/log/Samsung_SSD_960_PRO_512GB-XXXXXXXXX.nvme.state}

# smartd state file
nvme-err-log-entries = 5432

Then, running the system reboot tests I realize the counter on the state-file gets increased by two.
In journalctl the logs confirm:

Nov 22 16:30:58 smartd[1084]: Device: /dev/nvme0, NVMe error count increased from 5432 to 5434 (0 new, 2 ignored, 0 unknown)

However, running smartctl -l error /dev/nvme0 I realize the counter actually increased by three, reaching 5435.
I kept the machine on, and this error "5435" finally logged the counter increment in journalctl after exactly 30 minutes:

Nov 22 17:00:58 smartd[1084]: Device: /dev/nvme0, NVMe error count increased from 5434 to 5435 (0 new, 1 ignored, 0 unknown)

The same behavior could be observed after other consecutive tests. The system boots, the error counter increases, but one error is not logged yet it can be observed running smartctl -l error /dev/nvme0. Finally, after keeping the system on for 30 minutes, then this counter increase is logged. If the system is turned off before 30 minutes the message regarding the counter increase is never logged.

Therefore, one of those error counter increments is being logged only after exactly 30 minutes.

I also tested keeping the system turned on for two hours - not using the disk - to observe if any new errors would be logged, but I got no new error messages.

The journalctl output is identical after every boot, so I'm attaching only a single boot record.
Attached the journalctl and the smartctl output:

comment:19 by BrianG, 2 months ago

One additional piece of information I got from nvme error-log /dev/nvme0:

 Entry[ 6]   
.................
error_count	: 5435
sqid		: 0
cmdid		: 0x19
status_field	: 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
phase_tag	: 0
parm_err_loc	: 0xffff
lba		: 0
nsid		: 0x1
vs		: 0
trtype		: The transport type is not indicated or the error is not transport related.
cs		: 0
trtype_spec_info: 0
.................

in reply to:  18 comment:20 by Christian Franke, 2 months ago

I'm not sure if I'm doing something wrong, as I never tweaked any smartd options before, but I modified the smartmontools.service as:

ExecStart=/usr/sbin/smartd -n -s /var/log/

Using /var/log as the directory for *.state files may work, but is not recommended. A usual configuration is -s /var/lib/smartmontools/smartd- which may be the configured default, see smartd -h output.

Then, running the system reboot tests I realize the counter on the state-file gets increased by two.
In journalctl the logs confirm:

Nov 22 16:30:58 smartd[1084]: Device: /dev/nvme0, NVMe error count increased from 5432 to 5434 (0 new, 2 ignored, 0 unknown)

smartd still logs changes of the error count, but only with LOG_INFO level. It it does not log (and send a warning mail if configured) these non-device errors with LOG_CRIT level (see 0 new, 2 ignored).

Conclusion: It works as expected. BTW, your report is unrelated to this specific ticket because Set Feature related errors are not listed in the provided logs.

Please create a new ticket if not convinced or further issues are found.

comment:21 by Christian Franke, 2 months ago

Milestone: undecidedRelease 7.5
Resolution: fixed
Status: reopenedclosed

r5620 (restoring original resolution).

Note: See TracTickets for help on using tickets.