Opened 9 months ago

Closed 9 months ago

Last modified 3 months ago

#1828 closed defect (duplicate)

smartctl -x creates new NVMe errors

Reported by: rzsn Owned by:
Priority: major Milestone: Release 7.5
Component: smartctl Version: 7.4
Keywords: nvme Cc: rzsn

Description

I was given a drive to check the condition, but errors kept increasing. After a while I actually noticed that the errors increased by 1 by each smartctl -x call (I was repeatedly checking the state of the ssd while ext4 lazy inode init was doing its job after formatting the partition).

Since the errors are not about failure, but rather a malformed command, could be done something in smartctl in order to prevent such things happening?

My expectation: an error checking tool shall not generate errors while doing so.

I see multiple reasons why to fix this - if somebody calls smartctl repetitively, the errors increase but its not actually representing any error condition of the drive.

If somebody would call smartctl too aggressively, then actual errors might get lost.
(if the nvme error log is a circular buffer which afaik is, 64 entries on mine).

There was a #1222 ticket about something similar?
Smartd should ignore non-error entries from NVMe Error Information log

But hiding is one thing.. while making actual errors is something NOT desireable.

So the drive in question is a WD SN640 nvme:

=== START OF INFORMATION SECTION ===
Model Number:                       WUS4BB076D7P3E3
Serial Number:                      ********
Firmware Version:                   R111000L
PCI Vendor/Subsystem ID:            0x1b96
IEEE OUI Identifier:                0x0014ee
Total NVM Capacity:                 7,681,501,126,656 [7.68 TB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          7,681,501,126,656 [7.68 TB]
Namespace 1 Formatted LBA Size:     4096
Namespace 1 IEEE EUI-64:            0014ee 83066cbb80
Local Time is:                      Tue Apr 30 20:30:15 2024 CEST
Firmware Updates (0x19):            4 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x001f):   Security Format Frmw_DL NS_Mngmt Self_Test
Optional NVM Commands (0x005e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Warning  Comp. Temp. Threshold:     70 Celsius
Critical Comp. Temp. Threshold:     80 Celsius
Namespace 1 Features (0x02):        NA_Fields

Current state at end of smartctl -x call:

Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0         12     0  0xd009  0xc004      -            0     1     -  Invalid Field in Command
  1         11     0  0xc008  0xc004      -            0     1     -  Invalid Field in Command
  2         10     0  0xa00b  0xc004      -            0     1     -  Invalid Field in Command
  3          9     0  0x900a  0xc004      -            0     1     -  Invalid Field in Command
  4          8     0  0x8009  0xc004      -            0     1     -  Invalid Field in Command
  5          7     0  0xa00e  0xc004      -            0     1     -  Invalid Field in Command
  6          6     0  0x900d  0xc004      -            0     1     -  Invalid Field in Command
  7          5     0  0x7008  0xc004      -            0     1     -  Invalid Field in Command
  8          4     0  0x800c  0xc004      -            0     1     -  Invalid Field in Command
  9          3     0  0x600c  0xc004      -            0     1     -  Invalid Field in Command
 10          2     0  0x100a  0xc004      -            0     1     -  Invalid Field in Command
 11          1     0  0x300e  0xc004  0x028            0     0     -  Invalid Field in Command

And using nvme-cli error-log, i see that all these errors (except the oldest) are of this kind:

.................
 Entry[ 0]
.................
error_count     : 12
sqid            : 0
cmdid           : 0xd009
status_field    : 0x6002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
phase_tag       : 0
parm_err_loc    : 0xffff
lba             : 0
nsid            : 0x1
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
csi             : 0
opcode          : 0
cs              : 0
trtype_spec_info: 0
log_page_version: 0

How can we trace this to an exact query which smartctl does?

Change History (6)

comment:1 by Christian Franke, 9 months ago

Keywords: error log removed
Milestone: undecided

smartctl -x is the same as smartctl -H -i -c -A -l error -l selftest.

Please try the latter without -l selftest or use broadcast namespace instead of namespace 1 or use a recent CI build from https://builds.smartmontools.org/

If this works, this ticket is a duplicate of ticket #1741.

comment:2 by rzsn, 9 months ago

On my gentoo installed version 7.4:

  • errors are NOT generated when using /dev/nvme0
  • errors are GENERATED with /dev/nvme0n1 as an argument
  • errors are NOT generated when the -l -selftest is omitted (with either nvme0 or nvme0n1)

Now with CI build version:
smartctl pre-7.5 2024-04-26 r5612 [x86_64-linux-6.8.1-gentoo-x86_64] (CircleCI)

  • no errors are generated, for all 4 combinations (-l selftest/without)*(nvme0/nvme0n1)

I declare that your hint was correct and right on spot - and that this bug is resolved in the CI build.

The wording in bug #1741 is different than mine, but the root cause of both tickets is same - I was more worried having the drive's error log filled with bogus entries, than that your sub-command actually fails. But yes - after further review of the outputs, I am now seeing that message which I have overlooked before:

Read Self-test Log failed: Invalid Field in Command (0x6002)

Thanks very much for excellent support!

comment:3 by rzsn, 9 months ago

Resolution: duplicate
Status: newclosed

This ticket is a duplicate of ticket #1741.

comment:4 by Christian Franke, 9 months ago

You're welcome. Thanks for the detailed test report.

comment:5 by Christian Franke, 9 months ago

Milestone: undecidedRelease 7.5

comment:6 by Christian Franke, 3 months ago

Please see updated info in recent comment of ticket #1741.

Note: See TracTickets for help on using tickets.