Opened 4 years ago
Last modified 4 years ago
#1443 new enhancement
[smartd] fix for otherwise always aborted selftests
Reported by: | Ch.Ris | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | undecided |
Component: | smartd | Version: | |
Keywords: | smartd.conf | Cc: |
Description
Bug:
smartd
is often not able to schedule successful selftests on temporarily idle disks (i.e. especially at night when selftests should run).
This often happens because external disk adapters abort the test with power saving features, but it is also the case when the OS or a utility like hd-idle is configured to issue spin-down commands on idle disks when the disk's own power saving features are not available or accessible.
FAQ item:
https://www.smartmontools.org/wiki/FAQ#Whyarelongself-testskeepgettinginterrupted
Problems:
- Most often the adapter's power saving features can not be configured nor be disabled.
- It's not trivial to externally generate artificial disk activity or temporarily disable OS power saving features to allow the smartd scheduled selftests to finish.
Proposed solution:
Make the smartd
scheduled selftest work, by letting smartd
poll some disk information after triggering the selftest, as long as the selftest is running.
Proof of concept:
It's a working workaround to start a test manually and then "smartctl -a" in a while loop:
while smartctl -a /dev/sdX | grep -q "in progress" ; do sleep 60; done
Change History (24)
comment:1 by , 4 years ago
Keywords: | smartd.conf added |
---|---|
Milestone: | → unscheduled |
Priority: | major → minor |
Type: | defect → enhancement |
comment:2 by , 4 years ago
Identified some problems and a solution.
I could test with a directly connected sata drive under hd-idle power management and one connected to a VIA usb3-to-sata adapter (all x86_64).
At the end I found my initial "proof of concept" and all the other smartctl keep-awake command candidates did only work by accident: When starting the selftest while the disk already was in standby, and there actually was absolutely no other disk activity (smartctl readouts don't count) during the entire selftest. The powermanagement would then simply not issue a new power saving command during the selftest, assuming the disk to be spun-down all the time.
(Additional irritation was caused by the Seagate Firecuda drive, as it does not seem to actually always have to spin-up when going from standby into active, due to its SSD cache.)
But here are the conclusions about what worked:
- Reliably working for all setups were only keep-awake commands doing disk writes, i.e. did simply
touch $MOUNTPOINT/smartd-selftest-keep-awake.file
. (Reads could come from a buffer and not reach the adapter.)
- The first such keep-awake command MUST be executed before starting the selftest, to ensure the disk is active and the host powermangement won't spin-down and stop the selftest before smartd's first keep-alive.
- From the given smartctl commands I only saw info about the still running selftest in the output of "smartctl -a" (to properly stop the keep-awakes after the selftest is done). So
smartd
would probably have to poll the drive for the selftest progress with the shorter interval, and issue a separate keep-awake command as long as the selftest is still running.
- I guess the implementation could be the same for ATA/SATA/SCSI/SAS etc., but may be needing a keep-awake "mountpoint" or "touchpoint" path in smartd.conf, or is that detectable?
=> Is there already a way (command) to start smartd to do a single selftest on demand?
It would be a great advantage to have a reliable way to start selftests that don't abort due to adapter or OS powermanagment.
comment:3 by , 4 years ago
I am thinking that may be its a good idea to add flag to the -c
command which would exit when self test is complete and poll once an interval otherwise. This should fix this problem.
Writing something in smartctl is a very bad idea, as this tool does not work with a filesystems and can be used with non-mounted drive at all.
comment:4 by , 4 years ago
Also, tbh, from my expirience having disks to stop is always an evil. I did few tests and disks with hdidle been dying much faster compared to drives always running. Of course your mileage may vary.
comment:5 by , 4 years ago
Hello
it seems the captive tests are having their own problems, though:
https://www.smartmontools.org/ticket/303
https://www.smartmontools.org/ticket/1153
With devices that are not mounted the problem may be even more apparent. Maybe there is still a better keep-awake command to solve this.
Without any disk activity the background selftests would even get more regularily aborted when using an adapter or OS powermanagement is active (common default).
But if the device is not mounted one could, e.g. do a small mbr or gpt flag or label adjustment as keep-awake command. Without any partition table one could use dd.
I think keep-awakes are needed to allow testing of temporarily attached devices, idling backup- or spare-drives, not so for allways-on server or busy system drives.
comment:6 by , 4 years ago
I was not referring to captive test (-C) (which is +- broken on most of the linuxes) but about "-c" which shows self-test progress. Adding flag to wait until test is completed and poll once an interval could be an option. As for writing _anything_ to disk - sorry, no. However, smartctl has json output, so nothing should stop you from scripting this if needed.
comment:7 by , 4 years ago
P.S. I am not sure if there is anything "one fit all" solution here at all. As controller power saving logic is very different, one may require some writes, another could be happy with any smart command, etc. So may be its best to leave it as is.
comment:8 by , 4 years ago
You can do something like
smartctl /dev/disk0 -c --json |jq .ata_smart_data.self_test.status.value
and do whatever is needed if value > 0. Could be added in cron task, for example. Together with "-n" switch you can avoid excessive wake-ups.
comment:9 by , 4 years ago
It depends much on the hardware. On many laptops, other low-power devices and enclosures the selftest options are not usable as is.
As controller power saving logic is very different, one may require some writes, another could be happy with any smart command, etc.
That's a valid point. An option to do minimal meta-data updates should prevent a spin-down for all devices, though. (And also be safe for production-grade filesystems/partition-tables/unallocated-space on devices one wants to run internal tests on anyway.)
A straightforward --keep-awake option and a --keep-awake-command customization option would be a great help to make it work for the affected cases.
comment:10 by , 4 years ago
Err, maybe block device reading could work as keep-awake with the proper direct/sync/no-buffer io control function or option?
Someone knows about those?
comment:11 by , 4 years ago
To ease the tedious keep-awake command testing a bit I was using this script btw:
#!/bin/sh # # selftest-keep-awake-test.sh LOG=selftest-keep-awake-test.log DEVICES="/dev/sdb /dev/sdc" DEVICE_MOUNTPOINTS="/srv/data" # minutes to wait before asking user for disk spinning state evaluation # set it longer than disk spin down timers CHECK_TIME=5 # keep-awake commands to test TYPES="sat" # not working TYPED_KEEP_AWAKE_COMMANDS=" smartctl -d \$TYPE -n never \$DEVICE smartctl -d \$TYPE -i \$DEVICE smartctl -d \$TYPE -H \$DEVICE smartctl -d \$TYPE -c \$DEVICE" ###################################################### overriding above (not working) again TYPED_KEEP_AWAKE_COMMANDS="" for MOUNTPOINT in $DEVICE_MOUNTPOINTS; do break; done # not working SIMPLE_KEEP_AWAKE_COMMANDS="# (noop for comparision) smartctl -a \$DEVICE > /dev/null smartctl -d scsi -i \$DEVICE" ###################################################### overriding above (not working) again SIMPLE_KEEP_AWAKE_COMMANDS="touch \$MOUNTPOINT/selftest-keep-alive-test.file" function write_to_devices() { for MOUNT_POINT in $DEVICE_MOUNTPOINTS do echo -e "\n * Ensure consistent active device state by creating and and removing empty file in $MOUNT_POINT" touch $MOUNT_POINT/selftest-keep-alive-test.file sync rm $MOUNT_POINT/selftest-keep-alive-test.* done } function finish() { # stop remaining selftests for DEVICE in $DEVICES do CLEANUP_COMMANDS="smartctl --abort \$DEVICE" echo -e "\n * $DEVICE cleanup: $CLEANUP_COMMANDS" eval $CLEANUP_COMMANDS done # Write to the filesystems of the devices, # to ensure that host power management will actively spin down the disk again afterwards, # i.e. if it had already been switched to standby state before the selftest. write_to_devices exit } [ "$1" = "finish" ] && finish # init log for DEVICE in $DEVICES do echo "${DEVICE} $(smartctl -i $DEVICE | grep Family) ($(uname -m))" >> $LOG done # Access filesystems on the device # Need to ensure devices are actually active before the keep-awake test. # Otherwise the power management of the OS or adapter won't spin-down drives anyway... # * not during selftest # * and not until after a regular drive access has happened (if ever on a mostly idle backup or spare drive) # Writing is needed to avoid only reading buffered data without any adapter activity. write_to_devices echo -e "\n * Hear, devices should now be spinning idle, if not SSD cached." beep sleep 15 beep echo -e "\n * Starting selftests for keep-alive testing:" for TYPE in $TYPES do prev_IFS=$IFS IFS=' ' # eval bugs here? output file wrongly named ...filen, better while read ...? for COMMAND in "${SIMPLE_KEEP_AWAKE_COMMANDS}\n${TYPED_KEEP_AWAKE_COMMANDS}" do IFS=$prev_IFS i=0 run="true" for DEVICE in $DEVICES do INIT_COMMANDS="smartctl --abort \$DEVICE > /dev/null; smartctl --test=long \$DEVICE > /dev/null" echo -e "\n-------- Starting new long selftest on $DEVICE.\n" eval $INIT_COMMANDS done while [ "$run" = "true" ] do echo " -- At minute: $i" for DEVICE in $DEVICES do echo -e "\n * $DEVICE command: $COMMAND" eval $COMMAND done if [ "$i" -lt "$CHECK_TIME" ] then echo -n "<Press any letter key to stop keep-awake testing.>" read -t 60 -n 1 INPUT if [ "${INPUT:-undefined}" != "undefined" ] then echo -e '\r exiting... \n' finish fi echo -e '\r \n' else beep echo -ne "\n -> Enter list of halted devices (from \"$DEVICES\")\n or \"no\" if all are still spinning:" read -t 60 INPUT DEVICE="DEVICE" if [ "${INPUT:-undefined}" != "undefined" ] then echo "$COMMAND (aborted tests: $INPUT)" >> $LOG run="false" fi fi i=$((i+1)) done done done finish
comment:12 by , 4 years ago
I'm not willing to add much complexity to smartd for this topic. A possible solution would be: First implement device specific check intervals (#336). Then add an extension to (currently ATA-only) -l selfteststs
directive like -l selfteststs,SECONDS
. If specified, check interval for an individual device is reduced to SECONDS
if a self-test is in progress.
It could easily be tested whether this would prevent spin down of a device by running smartd with -i SECONDS
option and start a self-test.
comment:13 by , 4 years ago
I like your idea, that syntax also makes it much more consistent.
Could you see an optional addition like -l selftest,SECONDS,KEEP-AWAKE-COMMAND
?
Specifying a command to execute could make it work with whatever sync-read, or -touch a particular device may need.
comment:14 by , 4 years ago
It would even nicely allow for this:
I had wondered about a good way to schedule trim and scrub without colliding with selftests.
The keep-awake-command could be a script that, in addition to keep the disk awake, starts a timer to start a fstim after a short selftest and a full scrub after a long selftest, waiting interval+x seconds (and renewing that timer as long as the selftest is still running).
comment:15 by , 4 years ago
Before requesting further complex enhancements, please perform the test suggested in comment 12: Does some small -i SECONDS
setting in smartd command line prevent spin down during self-test?
comment:16 by , 4 years ago
Oh, sorry only now I realize you were still wondering if smartd
could work as is, with just a reduced interval, i.e. without calling an explicit keep-awake action in the itervals during the selftests.
My fault, I tested that too and should have elaborated my conclusions.
None of the smartd nor smartctl actions wake the disks up or keep them awake, here. And the selftests were also aborted while running smartd -i 30
. I think that is not unfortunate, because otherwise current smartd would prevent powersaving.
Here, the disks I tested here don't need to spin to answer the smartctl (and smartd) selftest queries. I guess that could depend on internal SSD storage. Some older disks may have to spin up for returning the selftest history for example, but certainly not these.
That's why I had concluded a more specific and effective keep-awake action would be needed anyway, to allow the selftests to complete on modern (non-24x7-server) hardware.
If we can find some direct-sync-io read function that wakes up all drives, that could be nice to use as default and be called from smartd.
But lacking that, the least complex solution seems to only execute an externally specified keep-awake command during selftests.
comment:17 by , 4 years ago
Milestone: | unscheduled → undecided |
---|
Some disks spin up (otherwise -n standby
won't exist), others don't. Leaving ticket open as undecided for now.
comment:18 by , 4 years ago
Ok thanks, I can recommend a dock with a nice new usb backup disk connected, to make it itch. ;-)
comment:19 by , 4 years ago
For those affected, maybe try risking a device specific workaround:
In my setup, it seems that the "power-on in standby feature" (hdparm's "VERY DANGEROUS" -s option) alters the systems runtime behaviour in such a way that smartd -i 60
interval checks are enough to prevent the selftests from getting aborted, even with only a basic smartd.conf (DEVICESCAN -H -l error -f).
No idea why it seems to work, maybe the kernel prepends a "wake-up" command to the smartd query or something like that.
As it's not a boot device I didn't wory about BIOS support for "power-on in standby" drives.
comment:20 by , 4 years ago
Actually this is "VERY DANGEROUS" unless all systems you want to connect this drive are able to identify and then spin up drives with "power-on in standby" enabled. This is possibly the case for most RAID controller firmware and possibly not for most regular PC BIOS.
comment:21 by , 4 years ago
Would you have a pointer to why or what could happen? I could not find info. I notice the (data) drives (1x sata, 1x usb-to-sata) now only spin-up later during boot (when getting mounted). BIOS detects them ok, does not need to boot from.
comment:22 by , 4 years ago
AFAICS from ATA ACS-4, full support of the Power-Up In Standby (PUIS) feature set requires that the BIOS or controller firmware properly handles these conditions:
If PUIS feature set enabled
bit is set (by command or jumper), an IDENTIFY DEVICE command must not spin up the disk. A drive may have only tiny firmware in ROM which later boots the actual firmware from the disk. Therefore the drive may return incomplete IDENTIFY information, indicated by the Response incomplete
bit. This information may be mostly empty, in particular in may report no model name and zero drive capacity. Then the BIOS must repeat the IDENTIFY DEVICE command later after spin up.
A drive may not spin up on I/O or other commands. This is indicated by the SET FEATURES subcommand required to spin-up
bit. Then the BIOS must issue the SET FEATURES: PUIS feature set device spin-up command to spin up the drive. Otherwise the drive will be virtually bricked until connected to a controller with a PUIS compatible BIOS.
See smartctl --identify=wb ...
for the mentioned bits.
comment:23 by , 4 years ago
Thank you, that was the technical information I was missing.
Seems to confirm my assumption that it's not dangerous for the drive nor the data itself. It may just not spin up to allow booting with an incompatible BIOS, or be mountable after a device scan with an old incompatible OS (linux <2.6.22).
Makes sense. Would require a new smartd.conf directive, an additional internal time interval independent of
-i
option and separate implementations for ATA/SATA and (possibly later) SCSI/SAS.Please report details about your proof of concept (platform, adapter, disk, ...).
Please test which of the following smartctl commands are sufficient (or not) to prevent the spin down:
Make sure to disable device type auto-detection by specifying the appropriate device
TYPE
(sat
,usb*
) orscsi
in the last testcase. Without-d
, additional commands (e.g. SCSI INQUIRY) may be issued and may produce misleading results.