This Patch set changes the logic of the "mdcheck" tool and the related systemd services mdcheck_start.service and mdcheck_continue.service. The set and related discussion are also posted as GitHub PR [1]. The set is meant to be applied on top of PR#189 [2], which has already been merged in the current main branch on GitHub. [1] https://github.com/md-raid-utilities/mdadm/pull/190 [2] https://github.com/md-raid-utilities/mdadm/pull/189 The current behavior is like this: * mdcheck without arguments starts a RAID check on all arrays on the system, starting at position 0. This is started from mdcheck_start.service, started by a systemd timer once a month. * mdcheck --continue looks for files /var/lib/mdcheck/MD_UUID_$UUID, reads the start position from them, and starts a check from that position on the array with the respective UUID. This is started from a systemd timer every night. In either case, mdcheck won't do anything if the kernel is already running a sync_action on a given array. The check runs for a given period of time (default 6h) and saves the last position in the MD_UUID file, to be taken up when mdcheck --continue is called next time. When the entire array has been checked, the MD_UUID_ file is deleted. When all checks are finished, mdcheck_continue.timer is stopped, to be restarted when mdcheck_start.timer expires next time. Before the recent commit 8aa4ea9 ("systemd: start mdcheck_continue.timer before mdcheck_start.timer"), this could lead to a race condition when the check for a given array didn't complete throughout the month. mdcheck_start.service would start and reset the check position to 0 before mdcheck_continue.service could pick up at the last saved position. Commit 8aa4ea9 works around this by starting mdcheck_continue.service a few minutes before mdcheck_start.timer. Yet the general problem still exists: both services trigger checks on the kernel's part which they can only passively monitor. So if a user plays with the timer settings (which he is in his rights to do), another similar race might happen. This patch set changes the behavior as follows: Only mdcheck_continue.service actually starts and stops kernel-based sync actions. This service will continue at the saved start position if an MD_UUID* file exists, or start a new check at position 0 otherwise. Starting at 0 can be inhibited by creating a file /var/lib/mdcheck/Checked_$UUID. These files will be created by mdcheck when it finishes checking a given array. Thus future invocations of mdcheck_continue.service will not restart the check on this array. mdcheck_start.service runs mdcheck --restart, which simply removes all Checked_* markers from /var/lib/mdcheck, so that the next invocation of mdcheck_continue.service will start new checks on all arrays which don't have already running checks. The general behavior of the systemd timers and services is like before, but the mentioned race condition is avoided, even if the user modifies the timer settings arbitrarily. This set slightly changes the behavior of the mdcheck script. Without --continue, it will still start checks on all array, but unlike before it will skip arrays for with a Checked_ marker exists. To avoid that, run mdcheck --restart before mdcheck. More details in the commit description of the patch "mdcheck: simplify start / continue logic and add "--restart" logic". Martin Wilck (4): mdcheck: loop over sync_action files in sysfs mdcheck: replace deprecated "$[cnt+1]" syntax mdcheck: simplify start / continue logic and add "--restart" logic mdcheck: log to stderr from systemd units misc/mdcheck | 105 ++++++++++++++++++++----------- systemd/mdcheck_continue.service | 6 +- systemd/mdcheck_start.service | 3 +- systemd/mdcheck_start.timer | 2 +- 4 files changed, 75 insertions(+), 41 deletions(-) -- 2.51.0