On Fri, Sep 12, 2025 at 11:34 PM Martin Wilck <martin.wilck@xxxxxxxx> wrote: > > This Patch set changes the logic of the "mdcheck" tool and the related systemd > services mdcheck_start.service and mdcheck_continue.service. > The set and related discussion are also posted as GitHub PR [1]. > > The set is meant to be applied on top of PR#189 [2], which has already been > merged in the current main branch on GitHub. > > [1] https://github.com/md-raid-utilities/mdadm/pull/190 > [2] https://github.com/md-raid-utilities/mdadm/pull/189 > > The current behavior is like this: > > * mdcheck without arguments starts a RAID check on all arrays on the system, > starting at position 0. This is started from mdcheck_start.service, > started by a systemd timer once a month. > * mdcheck --continue looks for files /var/lib/mdcheck/MD_UUID_$UUID, reads the > start position from them, and starts a check from that position on the array > with the respective UUID. This is started from a systemd timer every night. > > In either case, mdcheck won't do anything if the kernel is already running a > sync_action on a given array. The check runs for a given period of time > (default 6h) and saves the last position in the MD_UUID file, to be taken up > when mdcheck --continue is called next time. When the entire array has been > checked, the MD_UUID_ file is deleted. When all checks are finished, > mdcheck_continue.timer is stopped, to be restarted when mdcheck_start.timer > expires next time. > > Before the recent commit 8aa4ea9 ("systemd: start mdcheck_continue.timer > before mdcheck_start.timer"), this could lead to a race condition when the > check for a given array didn't complete throughout the month. > mdcheck_start.service would start and reset the check position to 0 > before mdcheck_continue.service could pick up at the last saved > position. Commit 8aa4ea9 works around this by starting > mdcheck_continue.service a few minutes before mdcheck_start.timer. Hi Martin The race condition is caused by the faulty modification by admin. commit 8aa4ea9 ("systemd: start mdcheck_continue.timer before mdcheck_start.timer") already fixes the problem that an array should continue to do the check if it doesn't finish checking in one month. The admin changes the timer sequence back again, it's a faulty action. We can add a warning comment in the timer service file to avoid such a race. > > Yet the general problem still exists: both services trigger checks on the > kernel's part which they can only passively monitor. So if a user plays with > the timer settings (which he is in his rights to do), another similar race > might happen. diff --git a/misc/mdcheck b/misc/mdcheck index 398a1ea607ca..7d1d79f795f0 100644 --- a/misc/mdcheck +++ b/misc/mdcheck @@ -116,7 +116,7 @@ do mdadm --detail --export "$dev" | grep '^MD_UUID=' > $tmp || continue source $tmp fl="/var/lib/mdcheck/MD_UUID_$MD_UUID" - if [ -z "$cont" ] + if [ -z "$cont" -a ! -f "$fl" ] then start=0 logger -p daemon.info mdcheck start checking $dev How about this? The start action checks the UUID file. So the check will continue if it doesn't finish in one month. Best Regards Xiao > > This patch set changes the behavior as follows: > > Only mdcheck_continue.service actually starts and stops kernel-based sync > actions. This service will continue at the saved start position if an MD_UUID* > file exists, or start a new check at position 0 otherwise. Starting at 0 can > be inhibited by creating a file /var/lib/mdcheck/Checked_$UUID. These files > will be created by mdcheck when it finishes checking a given array. Thus > future invocations of mdcheck_continue.service will not restart the check on > this array. > > mdcheck_start.service runs mdcheck --restart, which simply removes all > Checked_* markers from /var/lib/mdcheck, so that the next invocation of > mdcheck_continue.service will start new checks on all arrays which don't have > already running checks. > > The general behavior of the systemd timers and services is like before, but > the mentioned race condition is avoided, even if the user modifies the timer > settings arbitrarily. > > This set slightly changes the behavior of the mdcheck script. Without > --continue, it will still start checks on all array, but unlike before it will > skip arrays for with a Checked_ marker exists. To avoid that, run mdcheck > --restart before mdcheck. > > More details in the commit description of the patch "mdcheck: simplify start / > continue logic and add "--restart" logic". > > Martin Wilck (4): > mdcheck: loop over sync_action files in sysfs > mdcheck: replace deprecated "$[cnt+1]" syntax > mdcheck: simplify start / continue logic and add "--restart" logic > mdcheck: log to stderr from systemd units > > misc/mdcheck | 105 ++++++++++++++++++++----------- > systemd/mdcheck_continue.service | 6 +- > systemd/mdcheck_start.service | 3 +- > systemd/mdcheck_start.timer | 2 +- > 4 files changed, 75 insertions(+), 41 deletions(-) > > -- > 2.51.0 >