#1404 closed defect (fixed)
"smartctl -l error" on NVMe device crashes the Linux kernel
Reported by: | Jerome Kieffer | Owned by: | Christian Franke |
---|---|---|---|
Priority: | major | Milestone: | Release 7.2 |
Component: | all | Version: | 7.1 |
Keywords: | linux nvme | Cc: |
Description (last modified by )
Reading the error logged in the drive crashes the computer. The NVMe drive can be found (again) only after power cycle.
Device:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.9.0-3-amd64] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: Micron_2200_MTFDHBA1T0TCK Serial Number: 1941246F5E81 Firmware Version: P1MU003 PCI Vendor/Subsystem ID: 0x1344 IEEE OUI Identifier: 0x00a075 Controller ID: 0 Number of Namespaces: 1 Namespace 1 Size/Capacity: 1,024,209,543,168 [1.02 TB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 00a075 01246f5e81 Local Time is: Mon Nov 30 21:32:25 2020 CET
Log of the linux kernel (version 5.9)when crashing:
Nov 30 20:59:31 antarctica kernel: [ 1068.251549] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:31 antarctica kernel: [ 1068.262228] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:31 antarctica kernel: [ 1068.273520] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:31 antarctica kernel: [ 1068.284264] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:32 antarctica kernel: [ 1068.294988] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:32 antarctica kernel: [ 1068.305695] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:32 antarctica kernel: [ 1068.316401] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:32 antarctica kernel: [ 1068.327219] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:32 antarctica kernel: [ 1068.337824] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:32 antarctica kernel: [ 1068.348643] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:32 antarctica kernel: [ 1068.359339] amd_iommu_report_page_fault: 4 callbacks suppressed Nov 30 20:59:32 antarctica kernel: [ 1068.359342] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:32 antarctica kernel: [ 1068.369938] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:32 antarctica kernel: [ 1068.380664] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:32 antarctica kernel: [ 1068.391459] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:32 antarctica kernel: [ 1068.402169] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:32 antarctica kernel: [ 1068.412776] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:32 antarctica kernel: [ 1068.424438] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:32 antarctica kernel: [ 1068.435706] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000] Nov 30 20:59:32 antarctica kernel: [ 1068.435722] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0380 flags=0x0000] Nov 30 20:59:32 antarctica kernel: [ 1068.446479] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000]
Change History (15)
comment:1 by , 4 years ago
Description: | modified (diff) |
---|
comment:2 by , 4 years ago
Keywords: | linux nvme added; NVMe removed |
---|---|
Milestone: | → undecided |
Priority: | minor → major |
comment:3 by , 4 years ago
Summary: | "smartctl -l error" on NVMe device crashes the computer → "smartctl -l error" on NVMe device crashes the Linux kernel |
---|
comment:4 by , 4 years ago
Dear Christian,
Thanks for taking seriously this bug. I checked if Micron has a new firmware without success. Dell and HP, who ship this kind of SSDs in their laptop, did upgrade their firmware.
~$ sudo nvme id-ctrl /dev/nvme0n1 |grep elp elpe : 255
The first entry could be read without error:
~$ sudo nvme error-log --log-entries=1 /dev/nvme0n1 Error Log Entries for device:nvme0n1 entries:1 ................. Entry[ 0] ................. error_count : 0 sqid : 0 cmdid : 0 status_field : 0(SUCCESS: The command completed successfully) parm_err_loc : 0 lba : 0 nsid : 0 vs : 0 trtype : The transport type is not indicated or the error is not transport related. cs : 0 trtype_spec_info: 0 .................
And the last one:
~$ sudo nvme error-log --log-entries=256 /dev/nvme0n1 NVMe status: INVALID_FIELD: A reserved coded value or an unsupported value in a defined field(0x2)
Apparently only 64 fields are readable unlike the 256 advertised.
None of the commands suggested triggered any warnings in the logs not crashes.
comment:5 by , 4 years ago
The root of the problem is possibly the transfer size. nvme error-log ...
limits single transfer to 4KiB (64 entries) since commit 465a4d (requires NVMe 1.2.1+).
Please check the MDTS (Maximum Data Transfer Size):
$ sudo smartctl -c /dev/nvme0n1 | grep Maximum $ sudo nvme id-ctrl /dev/nvme0n1 | grep mdts
comment:6 by , 4 years ago
$ sudo smartctl -c /dev/nvme0n1 | grep Maximum Maximum Data Transfer Size: 128 Pages $ sudo nvme id-ctrl /dev/nvme0n1 | grep mdts mdts : 7
Debian testing provides (currently) nvme-cli version 1.12-5
comment:7 by , 4 years ago
Milestone: | undecided → Release 7.2 |
---|---|
Owner: | set to |
Status: | new → accepted |
nvme-cli 1.12-... includes this change.
With MDTS 5 (512KiB), reading the whole 16KiB error log, as smartctl currently does, should work. There might be another restriction (or bug) in the NVMe pass-through I/O-control of the kernel. Similar problems were not reported for other Platforms.
I will change smartctl such that log transfers are limited to 4KiB. This should prevent such kernel or device crashes. Drawback: Error log entries > 64 could not be read from older (NMVe 1.2 or earlier) drives or if NVMe pass-through layer could not pass CDW12 (-d sntrealtek
).
comment:9 by , 4 years ago
If possible, please test r5123 or later. Binaries are available here: https://builds.smartmontools.org/.
comment:10 by , 4 years ago
Thank you for your reactivity. I tested the r5123 and the logs can be read without issue:
local/sbin$ sudo ./smartctl -a /dev/nvme0n1 [sudo] password for jerome: smartctl 7.2 2020-12-03 r5123 [x86_64-linux-5.9.0-3-amd64] (CircleCI) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: Micron_2200_MTFDHBA1T0TCK Serial Number: 1941246F5E81 Firmware Version: P1MU003 PCI Vendor/Subsystem ID: 0x1344 IEEE OUI Identifier: 0x00a075 Controller ID: 0 NVMe Version: 1.2.1 Number of Namespaces: 1 Namespace 1 Size/Capacity: 1,024,209,543,168 [1.02 TB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 00a075 01246f5e81 Local Time is: Thu Dec 3 21:39:40 2020 CET Firmware Updates (0x02): 1 Slot Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test Optional NVM Commands (0x0017): Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Maximum Data Transfer Size: 128 Pages Warning Comp. Temp. Threshold: 82 Celsius Critical Comp. Temp. Threshold: 85 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 8.25W - - 0 0 0 0 0 0 1 + 2.40W - - 1 1 1 1 0 0 2 + 1.90W - - 2 2 2 2 0 0 3 - 0.0800W - - 3 3 3 3 10000 2500 4 - 0.0050W - - 4 4 4 4 5000 44000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 44 Celsius Available Spare: 100% Available Spare Threshold: 5% Percentage Used: 0% Data Units Read: 64,222 [32.8 GB] Data Units Written: 296,424 [151 GB] Host Read Commands: 692,480 Host Write Commands: 986,574 Controller Busy Time: 24 Power Cycles: 34 Power On Hours: 8 Unsafe Shutdowns: 12 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 44 Celsius Temperature Sensor 2: 47 Celsius Error Information (NVMe Log 0x01, 16 of 256 entries) No Errors Logged
comment:11 by , 4 years ago
Thanks! Please test also smartctl -l error,256 /dev/nvme0n1
as I don't have a device with > 64 error log entries so reads with LPO (Log Page Offset) are not needed.
comment:12 by , 4 years ago
The 192 missing logs are seen as such:
local/sbin$ sudo ./smartctl -l error,256 /dev/nvme0n1 smartctl 7.2 2020-12-03 r5123 [x86_64-linux-5.9.0-3-amd64] (CircleCI) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF SMART DATA SECTION === Read Error Information Log failed, 192 entries missing: NVMe Status 0x02 Error Information (NVMe Log 0x01, 64 of 256 entries) No Errors Logged
comment:13 by , 4 years ago
NVMe Status 0x02
means Invalid Field in Command. This suggests that this device does not implement LPO (Log Page Offset) support. It IMO should as it advertises NVMe 1.2.1 compatibility. So the full error log could neither be read in one step nor in smaller chunks.
I consider this ticket as fixed, although the actual reason for the crash could not be fully determined (bug in device firmware, bug in Linux kernel, still hidden bug in smartmontools, ...).
Thanks for this bug report and testing.
comment:15 by , 4 years ago
... this device does not implement LPO (Log Page Offset) support. It IMO should as it advertises NVMe 1.2.1 compatibility.
I take that back, LPO support is optional and indicated in LPA field of Identify Controller data structure. Fixed in r5124. With r5125, LPA field is printed as Log Page Attributes ...
.
See Debian Bug 947803 for a related report for a Micron 2200S with firmware 22001030 and 22001040 under Linux.
How many error log pages does this device support?
If possible, please run (from package
nvme-cli
):The value
elpe+1
(Error Log Page Entries) specifies number of error log pages.Does the crash also occur with
nvme-cli
and if only a subset of all pages is read?