Opened 5 years ago
Last modified 3 weeks ago
#1218 new defect
Self tests pass under Windows 10 1809 but get stuck at 90% under Linux
Reported by: | Artem S. Tashkinov | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | undecided |
Component: | smartctl | Version: | 7.0 |
Keywords: | ata | Cc: |
Description
I've got a strange issue with my SSD disk: all self tests pass under Windows 10 1809 but get stuck at 90% under Linux no matter how much I wait for their completion. I barely have any disk activity, so there's no reason for the tests to get stuck.
Under both OSes I use the same smartmontools version - 7.0 release.
In the attached log 90% "aborted by host" are Linux results. 00% completed
Attachments (1)
Change History (12)
by , 5 years ago
Attachment: | smartmontools-x.log added |
---|
comment:1 by , 5 years ago
comment:2 by , 5 years ago
This has been happening for quite some time under Linux kernels 4.18, 4.19, 5.0 and 5.1. Haven't tested Linux 5.2 because it hasn't been made available in Fedora 30 yet. I'm not sure any previous kernels worked at all.
This looks like a weird bug in the Linux kernel SATA layer but I'm an absolute newbie in this area, so it's just a wild guess.
comment:4 by , 5 years ago
Component: | all → smartctl |
---|---|
Keywords: | ata added |
Milestone: | → undecided |
... I barely have any disk activity, so there's no reason for the tests to get stuck.
The missing disk activity may actually be the reason for aborted self-tests. See the FAQ for explanation and possible workaround.
comment:5 by , 5 years ago
Also, I'm curious why I see so many errors.
This is unrelated to the above.
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE ... 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 9 Power_On_Hours -O--CK 092 092 001 - 4987 ... 187 Uncorrectable_Error_Cnt -O--CK 100 100 000 - 5 ... 196 Reallocated_Event_Count -O--CK 253 253 001 - 0 198 Offline_Uncorrectable ----CK 100 100 001 - 0
No reallocated sectors, no sectors pending for reallocation, ...
SMART Extended Comprehensive Error Log Version: 1 (1 sectors) ... Error 5 [0] occurred at disk power-on lifetime: 3665 hours (152 days + 17 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 40 00 00 00 00 00 14 40 00 Error: UNC at LBA = 0x00000014 = 20 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 00 01 00 40 00 00 00 00 00 14 40 08 00:02:31.000 READ FPDMA QUEUED ... Error 4 [3] occurred at disk power-on lifetime: 3525 hours (146 days + 21 hours) ... 40 -- 51 00 20 00 00 00 00 00 10 40 00 Error: UNC at LBA = 0x00000010 = 16 ... Error 3 [2] occurred at disk power-on lifetime: 3483 hours (145 days + 3 hours) ... 40 -- 51 00 40 00 00 00 00 00 14 40 00 Error: UNC at LBA = 0x00000014 = 20 ... Error 2 [1] occurred at disk power-on lifetime: 3348 hours (139 days + 12 hours) ... 40 -- 51 00 40 00 00 00 00 00 14 40 00 Error: UNC at LBA = 0x00000014 = 20 ...
... very few transient read errors on LBA 16 and 20 occurred more than 1300 power on hours ago, ...
SMART Extended Self-test Log Version: 1 (1 sectors) Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 4987 -
... but a recent extended test succeeded. Conclusion: No sign of trouble. The possible weak sectors have been fixed by new write commands.
PS: This is a bug tracker, not a support forum. For future support questions, please use the smartmontools-support mailing list instead.
comment:6 by , 5 years ago
The missing disk activity may actually be the reason for aborted self-tests. See the FAQ for explanation and possible workaround.
This is definitely not the case. It's a system disk, so there's some activity.
What I meant with "next to no activity" is that I'm not copying huge files between partitions or anything like that.
comment:7 by , 5 years ago
Check kernel log for any disk related messages which occur during self-test.
Check these counters before and after self-test:
SATA Phy Event Counters (GP Log 0x11) ID Size Value Description ... 0x0009 2 2 Transition from drive PhyRdy to drive PhyNRdy 0x000a 2 2 Device-to-host register FISes sent due to a COMRESET
If there is any increase, there must have been an event which resets the SATA link or the device. This typically aborts any self-test.
comment:8 by , 3 years ago
The issue persists even with kernel 5.16.19.
No idea what's going on, nothing is logged in dmesg
, all the tests invariably fail with "Interrupted (host reset)" even if I run smartctl -a
hours after I run smartctl -t long/short
.
It's as if tests immediately fail when launched under Linux.
comment:9 by , 3 years ago
Before and after running smartct -t short /dev/sda
which should have completed in 2 minutes. The second smartctl -x
was run 5 minutes after attempting to run a short self-test. The test was aborted at 90%.
-
.txt
diff --git a/before.txt b/after.txt index 4f3d22c..d2d7522 100644
old new General SMART Values: 32 32 Offline data collection status: (0x82) Offline data collection activity 33 33 was completed without error. 34 34 Auto Offline Data Collection: Enabled. 35 Self-test execution status: ( 0) The previous self-test routine completed 36 without error or no self-test has ever 37 been run. 35 Self-test execution status: ( 249) Self-test routine in progress... 36 90% of test remaining. 38 37 Total time to complete Offline 39 38 data collection: ( 1800) seconds. 40 39 Offline data collection … … ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 73 72 184 End-to-End_Error PO--CK 100 100 097 - 0 74 73 187 Uncorrectable_Error_Cnt -O--CK 100 100 000 - 5 75 74 188 Command_Timeout -O--CK 100 100 000 - 0 76 190 Airflow_Temperature_Cel -O---K 0 62 045 040 - 3875 190 Airflow_Temperature_Cel -O---K 059 045 040 - 41 77 76 196 Reallocated_Event_Count -O--CK 253 253 001 - 0 78 77 198 Offline_Uncorrectable ----CK 100 100 001 - 0 79 78 199 CRC_Error_Count -O---K 100 100 001 - 0 … … Num Test_Description Status Remaining LifeTime(hours) LBA 213 212 214 213 Warning! SMART Selective Self-Test Log Structure error: invalid SMART checksum. 215 214 SMART Selective self-test log data structure revision number 1 216 SPAN MIN_LBAMAX_LBA CURRENT_TEST_STATUS217 1 010 Not_testing218 2 00 Not_testing219 3 00 Not_testing220 4 00 Not_testing221 5 00 Not_testing222 255 0 65535Read_scanning was completed without error215 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 216 1 0 10 Not_testing 217 2 0 0 Not_testing 218 3 0 0 Not_testing 219 4 0 0 Not_testing 220 5 0 0 Not_testing 221 255 247475376 247540911 Read_scanning was completed without error 223 222 Selective self-test flags (0x0): 224 223 After scanning selected spans, do NOT read-scan remainder of disk. 225 224 If Selective self-test is pending on power-up, resume after 0 minute delay. … … SCT Commands not supported 229 228 Device Statistics (GP Log 0x04) 230 229 Page Offset Size Value Flags Description 231 230 0x01 ===== = = === == General Statistics (rev 2) == 232 0x01 0x018 6 9158167139 --- Logical Sectors Written 233 0x01 0x028 6 9286355418 --- Logical Sectors Read 234 0x01 0x1f8 7 0 NDC+ Unknown 231 0x01 0x018 6 9158258099 --- Logical Sectors Written 232 0x01 0x028 6 9286355426 --- Logical Sectors Read 235 233 0x05 ===== = = === == Temperature Statistics (rev 1) == 236 0x05 0x008 1 38--- Current Temperature234 0x05 0x008 1 41 --- Current Temperature 237 235 0x05 0x020 1 50 --- Highest Temperature 238 236 0x05 0x028 1 26 --- Lowest Temperature 239 237 |||_ C monitored condition met … … ID Size Value Description 252 250 0x0006 2 0 R_ERR response for device-to-host non-data FIS 253 251 0x0007 2 0 R_ERR response for host-to-device non-data FIS 254 252 0x0008 2 0 Device-to-host non-data FIS retries 255 0x0009 2 0Transition from drive PhyRdy to drive PhyNRdy253 0x0009 2 1 Transition from drive PhyRdy to drive PhyNRdy 256 254 0x000a 2 2 Device-to-host register FISes sent due to a COMRESET 257 255 0x000b 2 0 CRC errors within host-to-device FIS 258 256 0x000d 2 0 Non-CRC errors within host-to-device FIS
comment:11 by , 3 weeks ago
I had the same problem yesterday in Debian 12 stable (bookworm) running kernel
$ uname -a Linux XXXXXXXXXXXXXXXXXXX 6.1.0-27-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.115-1 (2024-11-01) x86_64 GNU/Linux
smartctl output is:
...<snip> Model Family: Samsung based SSDs Device Model: Samsung SSD 840 EVO 1TB ...<snip> SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error ...<snip> # 6 Extended offline Aborted by host 00% 13 - ...<snip>
However, I started smartctl via the GUI gsmartctl. The problem could be related to gsmartctl. In the next few days I will run the same extended tests again with the command line program smartcl directly. I hope that the test will then run without stopping at 90%. If so, then it is due to gsmartctl and not smartctl, if not, then smartctl or somewhere in the kernel is the problem.
With short self-tests, the time is usually too short for the bug to be reliably triggered.
... 00% "completed without error" are Windows results.