Opened 5 years ago

Closed 7 months ago

#1222 closed enhancement (fixed)

Smartd should ignore non-error entries from NVMe Error Information log

Reported by: Christian Franke Owned by: Christian Franke
Priority: minor Milestone: Release 7.4
Component: smartd Version: 7.0
Keywords: nvme Cc: Gerald Turner, Adam Piggott, Patrick Decat, Peter Nowee

Description

Some drives frequently add entries to the NVMe Error Information log which do not reflect an actual error.

Smartd issues a LOG_CRIT message if the Number of Error Information Log Entries from the SMART/Health Information log has increased since the last check. This is misleading for such drives.

Smartd should check whether the number of actual errors has increased since the last check.

For original report and sample outputs see Debian Bug 900244.

Attachments (2)

comment_21-journalctl_smartmontools.txt (2.2 KB ) - added by BrianG 7 months ago.
Comment 21 - journalctl smartmontools
comment_21-smartmontools.txt (4.4 KB ) - added by BrianG 7 months ago.
Comment 21 - smartmontools

Download all attachments as: .zip

Change History (25)

comment:1 by Gerald Turner, 5 years ago

Cc: Gerald Turner added

comment:2 by Christian Franke, 5 years ago

NVMe 1.4a section 5.14.1.1 - Error Information: The controller should clear this log page by removing all entries on power cycle and Controller Level Reset.

Therefore the Error Information log may be empty after an increase of Number of Error Information Log Entries has been detected. Smartd is unable to check for new actual errors then.

See Ubuntu Bug 1878264 for an example.

comment:3 by Adam Piggott, 4 years ago

Cc: Adam Piggott added

comment:4 by Marcel Partap, 2 years ago

As more and more computers are equipped with NMVe SSDs, and as we would like people to run quality operating systems with proper disk surveillance, these spurious errors (upon resuming from standby mostly) are a growing UX defect.. Here's a sample specimen:

.................                                                                                                                                                                                                    
 Entry[42]                                                                                                                                                                                                           
.................                                                                                                                                                                                                    
error_count     : 0                                                                                                                                                                                                  
sqid            : 0                                                                                                                                                                                                  
cmdid           : 0                                                                                                                                                                                                  
status_field    : 0(Successful Completion: The command completed without error)                                                                                                                                      
phase_tag       : 0                                                                                                                                                                                                  
parm_err_loc    : 0                                                                                                                                                                                                  
lba             : 0                                                                                                                                                                                                  
nsid            : 0                                                                                                                                                                                                  
vs              : 0                                                                                                                                                                                                  
trtype          : The transport type is not indicated or the error is not transport related.                                                                                                                         
cs              : 0                                                                                                                                                                                                  
trtype_spec_info: 0                                                                                                                                                                                                  
.................                                                                                                                                                                                                    

Thus, I'd propose this simple solution: do not raise critical alert for NVMe storage devices if both error_count == 0 and status_field == 0. Does anyone see a potential downside of this?

comment:5 by Christian Franke, 20 months ago

Results from various sources suggest that smartd should ignore error information log entries with ((status_field >> 1) & 0xfff) <= 0x002:
SCT/SC 0x0/0x00: Generic Command Status / Successful Completion,
SCT/SC 0x0/0x01: Generic Command Status / Invalid Command Opcode,
SCT/SC 0x0/0x02: Generic Command Status / Invalid Field in Command.

comment:6 by Christian Franke, 20 months ago

Related: ticket #1663.

comment:7 by Christian Franke, 20 months ago

Ticket #1722 has been marked as a duplicate of this ticket.

comment:8 by kolAflash, 19 months ago

@Christian Franke:
I checked my nvme error-log output.
Excerpt:
status_field : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
And error_count is increasing with each error log entry.
See here for full details: ticket:1722#comment:5

Also I created a Linux kernel ticket.
https://bugzilla.kernel.org/show_bug.cgi?id=217445

And I guess these can be related too:

comment:9 by Christian Franke, 19 months ago

Milestone: unscheduledRelease 7.4
Owner: set to Christian Franke
Status: newaccepted

comment:10 by Christian Franke, 19 months ago

Resolution: fixed
Status: acceptedclosed

comment:11 by Patrick Decat, 19 months ago

Cc: Patrick Decat added

comment:12 by Christian Franke, 16 months ago

GH issues/208 has been marked as a duplicate of this ticket.

comment:13 by Peter Nowee, 15 months ago

Cc: Peter Nowee added

comment:14 by ThoughtPolice84, 13 months ago

I have this error on a Samsung NVMe SSD drive in my openmediavault NAS system.
Smartd daemon error output:

The following warning/error was logged by the smartd daemon:

Device: /dev/disk/by-id/nvme-SAMSUNG_MZVLB256HAHQ-000H1, number of Error Log entries increased from 348 to 349

Device info:
SAMSUNG MZVLB256HAHQ-000H1, S/N:*, FW:EXD70H1Q, NSID:1, 256 GB

Is see this issue has been marked as resolved (r5472). Is there anything I should do in the configuration of smartmontools?

in reply to:  14 comment:15 by Christian Franke, 13 months ago

Replying to ThoughtPolice84:

I have this error on a Samsung NVMe SSD drive in my openmediavault NAS system.
Smartd daemon error output: ...

Please check the syslog for a LOG_INFO message like

Device: /dev/disk/by-id/nvme-SAMSUNG_MZVLB256HAHQ-000H1, NVMe error [INDEX], count 349, status 0xSTATUS: MESSAGE

and report it here including the version of smartd.

The message should precede this LOG_CRIT message:

Device: /dev/disk/by-id/nvme-SAMSUNG_MZVLB256HAHQ-000H1, number of Error Log entries increased from 348 to 349

comment:16 by ThoughtPolice84, 13 months ago

Version:

smartd 7.2 2020-12-30 r5155 [x86_64-linux-6.1.0-0.deb11.11-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

The error:

Device: /dev/disk/by-id/nvme-SAMSUNG_MZVLB256HAHQ-000H1_S425NX1M782076, number of Error Log entries increased from 348 to 349

in reply to:  16 comment:17 by Christian Franke, 13 months ago

smartd 7.2 2020-12-30 r5155 [x86_64-linux-6.1.0-0.deb11.11-amd64] (local build)

This version is far too old. The new behavior added in r5472 requires smartmontools release 7.4.

comment:18 by BrianG, 7 months ago

I am observing this issue in the Debian package version 7.4-2~bpo12+1. I am using Samsung 960 Pro, and the error count keeps increasing on every reboot. Let me know if you need additional information.

$ smartctl --version      
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.4-3-pve] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

smartctl comes with ABSOLUTELY NO WARRANTY. This is free
software, and you are welcome to redistribute it under
the terms of the GNU General Public License; either
version 2, or (at your option) any later version.
See https://www.gnu.org for further details.

smartmontools release 7.4 dated 2023-08-01 at 10:59:45 UTC
smartmontools SVN rev 5530 dated 2023-08-01 at 11:00:21
smartmontools build host: x86_64-pc-linux-gnu
smartmontools build with: C++11, GCC 12.2.0
smartmontools configure arguments: [hidden in reproducible builds]
reproducible build SOURCE_DATE_EPOCH: 1701758292 (2023-12-05 03:38:12)
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0       4649     0  0x0015  0x4004      -            0     1     -  Invalid Field in Command
  1       4648     0  0x0018  0x4004      -            0     0     -  Invalid Field in Command
  2       4647     0  0x001b  0x421a  0x028            0     0     -  Feature Identifier Not Saveable
  3       4646     0  0x0012  0x4004      -            0     0     -  Invalid Field in Command
  4       4645     0  0x400c  0x4004      -            0     1     -  Invalid Field in Command
  5       4644     0  0x200e  0x4004      -            0     0     -  Invalid Field in Command
  6       4643     0  0x001b  0x421a  0x028            0     0     -  Feature Identifier Not Saveable
  7       4642     0  0x0012  0x4004      -            0     0     -  Invalid Field in Command
  8       4641     0  0x901d  0x4004      -            0     1     -  Invalid Field in Command
  9       4640     0  0x0008  0x4004      -            0     0     -  Invalid Field in Command
 10       4639     0  0x001b  0x421a  0x028            0     0     -  Feature Identifier Not Saveable
 11       4638     0  0x0012  0x4004      -            0     0     -  Invalid Field in Command
 12       4637     0  0xe01c  0x4004      -            0     1     -  Invalid Field in Command
 13       4636     0  0x601d  0x4004      -            0     0     -  Invalid Field in Command
 14       4635     0  0x001b  0x421a  0x028            0     0     -  Feature Identifier Not Saveable
 15       4634     0  0x0012  0x4004      -            0     0     -  Invalid Field in Command
... (48 entries not read)
 Entry[61]   
.................
error_count	: 4588
sqid		: 0
cmdid		: 0x1b
status_field	: 0x210d(Feature Identifier Not Saveable: The Feature Identifier specified does not support a saveable value)
phase_tag	: 0
parm_err_loc	: 0x28
lba		: 0
nsid		: 0
vs		: 0
trtype		: The transport type is not indicated or the error is not transport related.
cs		: 0
trtype_spec_info: 0
.................
 Entry[62]   
.................
error_count	: 4587
sqid		: 0
cmdid		: 0x12
status_field	: 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
phase_tag	: 0
parm_err_loc	: 0xffff
lba		: 0
nsid		: 0
vs		: 0
trtype		: The transport type is not indicated or the error is not transport related.
cs		: 0
trtype_spec_info: 0
.................

comment:19 by BrianG, 7 months ago

Resolution: fixed
Status: closedreopened

I think I had forgotten to reopen the issue.

comment:20 by Christian Franke, 7 months ago

Milestone: Release 7.4undecided

I am using Samsung 960 Pro, and the error count keeps increasing on every reboot.

This is because the Linux kernel issues some unsupported commands during boot.

 0       4649     0  0x0015  0x4004      -            0     1     -  Invalid Field in Command
 1       4648     0  0x0018  0x4004      -            0     0     -  Invalid Field in Command
 2       4647     0  0x001b  0x421a  0x028            0     0     -  Feature Identifier Not Saveable

Please provide the resulting syslog output of smartd. The Invalid Field in Command should not trigger a warning from smartd but the Feature Identifier Not Saveable may do.

by BrianG, 7 months ago

Comment 21 - journalctl smartmontools

by BrianG, 7 months ago

Comment 21 - smartmontools

comment:21 by BrianG, 7 months ago

As you had mentioned, I don't see Invalid Field in Command in journalctl nor in dmesg. However, all error counters in smartctl are increased from boot too boot.

The Feature Identifier Not Saveable is observed in the attached journalctl output.

The counters increase by (at least) 3 from boot to boot. In additon, I observe the counters sometimes increase by one more error, probably during the system reboots and this extra error is not logged in journalctl.

I'm attaching the output of the following commands:

  • File comment_21-journalctl_smartmontools.txt:

$ journalctl --no-hostname --utc -xo short-iso --boot 0 --unit=smartmontools

  • File comment_21-smartmontools:

$ smartctl -a /dev/nvme0n1

Let me know if you need any additional information.

in reply to:  21 comment:22 by Christian Franke, 7 months ago

Thanks for the detailed info which shows that smartd works as expected but should ignore more error codes, see the new ticket #1835.

comment:23 by Christian Franke, 7 months ago

Milestone: undecidedRelease 7.4
Resolution: fixed
Status: reopenedclosed

r5472 (restoring original resolution, for new errors see ticket #1835).

Note: See TracTickets for help on using tickets.