Opened 5 years ago

Last modified 2 months ago

#1218 new defect

Self tests pass under Windows 10 1809 but get stuck at 90% under Linux

Reported by: Artem S. Tashkinov Owned by:
Priority: minor Milestone: undecided
Component: smartctl Version: 7.0
Keywords: ata Cc:

Description

I've got a strange issue with my SSD disk: all self tests pass under Windows 10 1809 but get stuck at 90% under Linux no matter how much I wait for their completion. I barely have any disk activity, so there's no reason for the tests to get stuck.

Under both OSes I use the same smartmontools version - 7.0 release.

In the attached log 90% "aborted by host" are Linux results. 00% completed

Attachments (1)

smartmontools-x.log (13.8 KB ) - added by Artem S. Tashkinov 5 years ago.

Download all attachments as: .zip

Change History (12)

by Artem S. Tashkinov, 5 years ago

Attachment: smartmontools-x.log added

comment:1 by Artem S. Tashkinov, 5 years ago

... 00% "completed without error" are Windows results.

comment:2 by Artem S. Tashkinov, 5 years ago

This has been happening for quite some time under Linux kernels 4.18, 4.19, 5.0 and 5.1. Haven't tested Linux 5.2 because it hasn't been made available in Fedora 30 yet. I'm not sure any previous kernels worked at all.

This looks like a weird bug in the Linux kernel SATA layer but I'm an absolute newbie in this area, so it's just a wild guess.

comment:3 by Artem S. Tashkinov, 5 years ago

Also, I'm curious why I see so many errors.

comment:4 by Christian Franke, 5 years ago

Component: allsmartctl
Keywords: ata added
Milestone: undecided

... I barely have any disk activity, so there's no reason for the tests to get stuck.

The missing disk activity may actually be the reason for aborted self-tests. See the FAQ for explanation and possible workaround.

comment:5 by Christian Franke, 5 years ago

Also, I'm curious why I see so many errors.

This is unrelated to the above.

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
...
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   092   092   001    -    4987
...
187 Uncorrectable_Error_Cnt -O--CK   100   100   000    -    5
...
196 Reallocated_Event_Count -O--CK   253   253   001    -    0
198 Offline_Uncorrectable   ----CK   100   100   001    -    0

No reallocated sectors, no sectors pending for reallocation, ...

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
...
Error 5 [0] occurred at disk power-on lifetime: 3665 hours (152 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 40 00 00 00 00 00 14 40 00  Error: UNC at LBA = 0x00000014 = 20

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 01 00 40 00 00 00 00 00 14 40 08     00:02:31.000  READ FPDMA QUEUED
...
Error 4 [3] occurred at disk power-on lifetime: 3525 hours (146 days + 21 hours)
...
  40 -- 51 00 20 00 00 00 00 00 10 40 00  Error: UNC at LBA = 0x00000010 = 16
...
Error 3 [2] occurred at disk power-on lifetime: 3483 hours (145 days + 3 hours)
...
  40 -- 51 00 40 00 00 00 00 00 14 40 00  Error: UNC at LBA = 0x00000014 = 20
...
Error 2 [1] occurred at disk power-on lifetime: 3348 hours (139 days + 12 hours)
...
  40 -- 51 00 40 00 00 00 00 00 14 40 00  Error: UNC at LBA = 0x00000014 = 20
...

... very few transient read errors on LBA 16 and 20 occurred more than 1300 power on hours ago, ...

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      4987         -

... but a recent extended test succeeded. Conclusion: No sign of trouble. The possible weak sectors have been fixed by new write commands.

PS: This is a bug tracker, not a support forum. For future support questions, please use the smartmontools-support mailing list instead.

comment:6 by Artem S. Tashkinov, 5 years ago

The missing disk activity may actually be the reason for aborted self-tests. See the FAQ for explanation and possible workaround.

This is definitely not the case. It's a system disk, so there's some activity.

What I meant with "next to no activity" is that I'm not copying huge files between partitions or anything like that.

comment:7 by Christian Franke, 5 years ago

Check kernel log for any disk related messages which occur during self-test.

Check these counters before and after self-test:

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
...
0x0009  2            2  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET

If there is any increase, there must have been an event which resets the SATA link or the device. This typically aborts any self-test.

comment:8 by Artem S. Tashkinov, 3 years ago

The issue persists even with kernel 5.16.19.

No idea what's going on, nothing is logged in dmesg, all the tests invariably fail with "Interrupted (host reset)" even if I run smartctl -a hours after I run smartctl -t long/short.

It's as if tests immediately fail when launched under Linux.

comment:9 by Artem S. Tashkinov, 3 years ago

Before and after running smartct -t short /dev/sda which should have completed in 2 minutes. The second smartctl -x was run 5 minutes after attempting to run a short self-test. The test was aborted at 90%.

  • .txt

    diff --git a/before.txt b/after.txt
    index 4f3d22c..d2d7522 100644
    old new General SMART Values:  
    3232Offline data collection status:  (0x82) Offline data collection activity
    3333                                        was completed without error.
    3434                                        Auto Offline Data Collection: Enabled.
    35 Self-test execution status:      (   0) The previous self-test routine completed
    36                                         without error or no self-test has ever
    37                                         been run.
     35Self-test execution status:      ( 249) Self-test routine in progress...
     36                                        90% of test remaining.
    3837Total time to complete Offline
    3938data collection:                ( 1800) seconds.
    4039Offline data collection
    ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE  
    7372184 End-to-End_Error        PO--CK   100   100   097    -    0
    7473187 Uncorrectable_Error_Cnt -O--CK   100   100   000    -    5
    7574188 Command_Timeout         -O--CK   100   100   000    -    0
    76 190 Airflow_Temperature_Cel -O---K   062   045   040    -    38
     75190 Airflow_Temperature_Cel -O---K   059   045   040    -    41
    7776196 Reallocated_Event_Count -O--CK   253   253   001    -    0
    7877198 Offline_Uncorrectable   ----CK   100   100   001    -    0
    7978199 CRC_Error_Count         -O---K   100   100   001    -    0
    Num Test_Description Status Remaining LifeTime(hours) LBA  
    213212
    214213Warning! SMART Selective Self-Test Log Structure error: invalid SMART checksum.
    215214SMART Selective self-test log data structure revision number 1
    216  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    217     1        0       10  Not_testing
    218     2        0        0  Not_testing
    219     3        0        0  Not_testing
    220     4        0        0  Not_testing
    221     5        0        0  Not_testing
    222   255        0    65535  Read_scanning was completed without error
     215 SPAN    MIN_LBA    MAX_LBA  CURRENT_TEST_STATUS
     216    1          0         10  Not_testing
     217    2          0          0  Not_testing
     218    3          0          0  Not_testing
     219    4          0          0  Not_testing
     220    5          0          0  Not_testing
     221  255  247475376  247540911  Read_scanning was completed without error
    223222Selective self-test flags (0x0):
    224223  After scanning selected spans, do NOT read-scan remainder of disk.
    225224If Selective self-test is pending on power-up, resume after 0 minute delay.
    SCT Commands not supported  
    229228Device Statistics (GP Log 0x04)
    230229Page  Offset Size        Value Flags Description
    2312300x01  =====  =               =  ===  == General Statistics (rev 2) ==
    232 0x01  0x018  6      9158167139  ---  Logical Sectors Written
    233 0x01  0x028  6      9286355418  ---  Logical Sectors Read
    234 0x01  0x1f8  7               0  NDC+ Unknown
     2310x01  0x018  6      9158258099  ---  Logical Sectors Written
     2320x01  0x028  6      9286355426  ---  Logical Sectors Read
    2352330x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
    236 0x05  0x008  1              38  ---  Current Temperature
     2340x05  0x008  1              41  ---  Current Temperature
    2372350x05  0x020  1              50  ---  Highest Temperature
    2382360x05  0x028  1              26  ---  Lowest Temperature
    239237                                |||_ C monitored condition met
    ID Size Value Description  
    2522500x0006  2            0  R_ERR response for device-to-host non-data FIS
    2532510x0007  2            0  R_ERR response for host-to-device non-data FIS
    2542520x0008  2            0  Device-to-host non-data FIS retries
    255 0x0009  2            0  Transition from drive PhyRdy to drive PhyNRdy
     2530x0009  2            1  Transition from drive PhyRdy to drive PhyNRdy
    2562540x000a  2            2  Device-to-host register FISes sent due to a COMRESET
    2572550x000b  2            0  CRC errors within host-to-device FIS
    2582560x000d  2            0  Non-CRC errors within host-to-device FIS

comment:10 by Artem S. Tashkinov, 3 years ago

There's absolutely nothing in dmesg.

comment:11 by asm, 2 months ago

I had the same problem yesterday in Debian 12 stable (bookworm) running kernel

$ uname -a
Linux XXXXXXXXXXXXXXXXXXX 6.1.0-27-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.115-1 (2024-11-01) x86_64 GNU/Linux

smartctl output is:

...<snip>
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 840 EVO 1TB
...<snip>
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
...<snip>
# 6  Extended offline    Aborted by host               00%        13         -
...<snip>

However, I started smartctl via the GUI gsmartctl. The problem could be related to gsmartctl. In the next few days I will run the same extended tests again with the command line program smartcl directly. I hope that the test will then run without stopping at 90%. If so, then it is due to gsmartctl and not smartctl, if not, then smartctl or somewhere in the kernel is the problem.

With short self-tests, the time is usually too short for the bug to be reliably triggered.

Note: See TracTickets for help on using tickets.