Opened 8 months ago
Closed 2 months ago
#1835 closed enhancement (fixed)
Smartd should also ignore 'Set Feature' related errors from NVMe Error Information log
Reported by: | Christian Franke | Owned by: | Christian Franke |
---|---|---|---|
Priority: | minor | Milestone: | Release 7.5 |
Component: | smartd | Version: | 7.4 |
Keywords: | nvme | Cc: | BrianG |
Description
Recent comments from ticket #1222 show that the NVMe error "Feature Identifier Not Saveable" (SCT=0x1, SC=0x0d) may also appear in the error log after each reboot:
Num ErrCount SQId CmdId Status PELoc LBA NSID VS Message 0 4813 0 0x2010 0x4004 - 0 1 - Invalid Field in Command 1 4812 0 0x0010 0x4004 - 0 0 - Invalid Field in Command 2 4811 0 0x001b 0x421a 0x028 0 0 - Feature Identifier Not Saveable 3 4810 0 0x0012 0x4004 - 0 0 - Invalid Field in Command 4 4809 0 0x3007 0x4004 - 0 1 - Invalid Field in Command 5 4808 0 0x1003 0x4004 - 0 0 - Invalid Field in Command 6 4807 0 0x001b 0x421a 0x028 0 0 - Feature Identifier Not Saveable
smartd syslog:
2024-05-25T14:44:49+0000 smartd[834]: Device: /dev/nvme0, Samsung SSD 960 PRO 512GB, S/N:***************, FW:2B6QCXP7, 512 GB ... 2024-05-25T14:44:49+0000 smartd[834]: Device: /dev/nvme0, NVMe error [1], count 4811, status 0x421a: Feature Identifier Not Saveable 2024-05-25T14:44:49+0000 smartd[834]: Device: /dev/nvme0, NVMe error count increased from 4808 to 4812 (1 new, 3 ignored, 0 unknown)
This suggests that the kernel (or another component run during boot) issues a Set Features NVMe command with SV (Save) bit set without a prior check whether this bit is supported. If the kernel does it, this is IMO a kernel bug.
Smartd should also ignore this error.
Attachments (3)
Change History (24)
comment:1 by , 8 months ago
Owner: | set to |
---|---|
Status: | new → accepted |
comment:2 by , 8 months ago
comment:3 by , 8 months ago
Christian, thanks a lot for the prompt reply!
Is it possible to identify the offending component? Let me know if I can help with any additional information.
comment:4 by , 8 months ago
If a restart of services without reboot (e.g. systemctl rescue
and then ^D
, or systemctl soft-reboot
) results in new Feature Identifier Not Saveable
log entries, some of the restarted services might be the root of the problem.
I could not reproduce the behavior on a system with Debian 12 (Console/SSH only, no GUI) using a Samsung SSD 970 EVO Plus 500GB
. Only one Invalid Field in Command
appears after each reboot.
AFAICS from the Linux kernel sources, there is support for NVMe Set Features commands, see core.c. This is (only?) used by pci.c to change the power state and by hwmon.c to change the temperature thresholds.
comment:5 by , 8 months ago
I am still following this ticket and #1222 because I still receive NVMe errors.
Now with another Linux build and a newer version of smartmontools:
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-21-amd64] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
The following warning/error was logged by the smartd daemon: Device: /dev/nvme0, number of Error Log entries increased from 1098 to 1099 Device info: SAMSUNG MZVLB256HAHQ-000H1, S/N:S425NX1M782076, FW:EXD70H1Q, 256 GB
comment:6 by , 8 months ago
@ThoughtPolice84, the new behavior was added on build [5472]. You need to update smartmontools to version 7.4.
Christian, let me know if I can help with any additional information.
comment:7 by , 6 months ago
FWIW, I moved my hard drive to a different machine and just turning the system on registers the new error entries, not even mounting any partition of the disk.
6.9.7-1~bpo12+1 (2024-07-03) x86_64 GNU/Linux
smartctl 7.4 2023-08-01 r5530
comment:9 by , 3 months ago
Christian, is it enough just copy the binaries smartctl
and smartd
replacing the original ones from Debian?
If so, I am still getting an increasing error count using the build r5620
.
comment:10 by , 3 months ago
Overwriting binaries from packages may work, but also break package integrity checks.
Please provide a smartctl -l error
output for further diagnostics.
by , 3 months ago
Attachment: | smartctl-r5620.txt added |
---|
Output of smartctl -l error /dev/nvme0n1
-- r5620
comment:11 by , 3 months ago
I attached file smartctl-r5620.txt
.
There is just one type of error recorded, Invalid Field in Command
. Probably just the first 4 entries were added after I had updated to r5620
.
As I've moved this disk to a different machine than the original bug report I've not been using it as much as before. Then, the other error Feature Identifier Not Saveable
may still be present too, but not triggered recently with the different usage.
comment:12 by , 3 months ago
Please provide the related lines in the syslog output of smartd:
... Device: /dev/nvme..., NVMe error [...], count ... , status 0x...: ... ... Device: /dev/nvme..., NVMe error count increased from ... to ... (...)
The first line is only logged (as LOG_INFO
) for errors interpreted as device related. In this case, the last line is logged as LOG_CRIT
, otherwise as LOG_INFO
.
comment:13 by , 3 months ago
I don't see the mentioned lines in the logs, probably because smartd
is not logging anything other than:
smartd[1112]: This version of smartd was build without 'Type=notify' support.
Is there any binary that is compiled with notifications support that I can download for testing?
comment:14 by , 3 months ago
Smartd exits with status 1 after this message. Remove Type=notify
from smartd.service
(on some systems smartmontools.service
) file and retry.
The binaries at https://builds.smartmontools.org/ do not use libsystemd because otherwise these won't run on systems not using systemd.
comment:15 by , 3 months ago
I enabled the notifications and I started observing some messages logged, but no errors at all.
After booting, the latest messages logged are:
journalctl --utc -xe --priority 0..7 --boot 0 -u smartmontools.service
[...] Oct 30 21:55:16 smartd[1127]: Device: /dev/nvme0, opened Oct 30 21:55:16 smartd[1127]: Device: /dev/nvme0, Samsung SSD 960 PRO 512GB, S/N:*****, FW:2B6QCXP7, 512 GB Oct 30 21:55:16 smartd[1127]: Device: /dev/nvme0, is SMART capable. Adding to "monitor" list. Oct 30 21:55:16 smartd[1127]: Monitoring 2 ATA/SATA, 0 SCSI/SAS and 1 NVMe devices Oct 30 21:55:16 smartd[1127]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 63 to 62
comment:16 by , 2 months ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
I'm reopening this issue, so it won't get forgotten.
Feel free to close again if more appropriate.
comment:17 by , 2 months ago
Milestone: | Release 7.5 → undecided |
---|
If state persistence is not enabled (by default or by smartd -s ...
), no NVMe error log related messages would occur after boot because the previous state is unknown.
Please provide outputs of smartctl -l error /dev/nvme0
before and after boot.
by , 2 months ago
Attachment: | smartctl-2024-11-22T21:35Z.txt added |
---|
smartctl -l error /dev/nvme0 - 2024-11-22T21:35Z
by , 2 months ago
Attachment: | journalctl-83e61d4458eb41488649ccd34a87fc3b-smartmontools-Nov_22.log added |
---|
journalctl --utc -xe --no-hostname --priority 0..7 --boot 83e61d4458eb41488649ccd34a87fc3b -u smartmontools.service
- 2024-11-22
follow-up: 20 comment:18 by , 2 months ago
OK, I'm getting one of the error messages you had pointed (increasing the counter), but the other one (specifying the error status) is not being logged.
Moreover, one of the errors that happens during boot is taking exactly 30 minutes to get into the logs.
I'm not sure if I'm doing something wrong, as I never tweaked any smartd options before, but I modified the smartmontools.service
as:
ExecStart=/usr/sbin/smartd -n -s /var/log/
Then, it started logging and the state files were created.
I noted that the NVMe drive always logs only two lines, different from the other SSDs that record many values into their state files:
cat /var/log/Samsung_SSD_960_PRO_512GB-XXXXXXXXX.nvme.state
}
# smartd state file nvme-err-log-entries = 5432
Then, running the system reboot tests I realize the counter on the state-file gets increased by two.
In journalctl
the logs confirm:
Nov 22 16:30:58 smartd[1084]: Device: /dev/nvme0, NVMe error count increased from 5432 to 5434 (0 new, 2 ignored, 0 unknown)
However, running smartctl -l error /dev/nvme0
I realize the counter actually increased by three, reaching 5435.
I kept the machine on, and this error "5435" finally logged the counter increment in journalctl
after exactly 30 minutes:
Nov 22 17:00:58 smartd[1084]: Device: /dev/nvme0, NVMe error count increased from 5434 to 5435 (0 new, 1 ignored, 0 unknown)
The same behavior could be observed after other consecutive tests. The system boots, the error counter increases, but one error is not logged yet it can be observed running smartctl -l error /dev/nvme0
. Finally, after keeping the system on for 30 minutes, then this counter increase is logged. If the system is turned off before 30 minutes the message regarding the counter increase is never logged.
Therefore, one of those error counter increments is being logged only after exactly 30 minutes.
I also tested keeping the system turned on for two hours - not using the disk - to observe if any new errors would be logged, but I got no new error messages.
The journalctl
output is identical after every boot, so I'm attaching only a single boot record.
Attached the journalctl
and the smartctl
output:
comment:19 by , 2 months ago
One additional piece of information I got from nvme error-log /dev/nvme0
:
Entry[ 6] ................. error_count : 5435 sqid : 0 cmdid : 0x19 status_field : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field) phase_tag : 0 parm_err_loc : 0xffff lba : 0 nsid : 0x1 vs : 0 trtype : The transport type is not indicated or the error is not transport related. cs : 0 trtype_spec_info: 0 .................
comment:20 by , 2 months ago
I'm not sure if I'm doing something wrong, as I never tweaked any smartd options before, but I modified the
smartmontools.service
as:
ExecStart=/usr/sbin/smartd -n -s /var/log/
Using /var/log
as the directory for *.state
files may work, but is not recommended. A usual configuration is -s /var/lib/smartmontools/smartd-
which may be the configured default, see smartd -h
output.
Then, running the system reboot tests I realize the counter on the state-file gets increased by two.
Injournalctl
the logs confirm:
Nov 22 16:30:58 smartd[1084]: Device: /dev/nvme0, NVMe error count increased from 5432 to 5434 (0 new, 2 ignored, 0 unknown)
smartd
still logs changes of the error count, but only with LOG_INFO
level. It it does not log (and send a warning mail if configured) these non-device errors with LOG_CRIT level (see 0 new, 2 ignored
).
Conclusion: It works as expected. BTW, your report is unrelated to this specific ticket because Set Feature
related errors are not listed in the provided logs.
Please create a new ticket if not convinced or further issues are found.
comment:21 by , 2 months ago
Milestone: | undecided → Release 7.5 |
---|---|
Resolution: | → fixed |
Status: | reopened → closed |
r5620 (restoring original resolution).
I take that back because this device indicates SV bit (
Sav/Sel_Feat
) support:The individual Feature ID used may not support SV.