Re: [PATCH 0/4] mdadm: rework mdcheck systemd service logic

Xiao Ni <xni@xxxxxxxxxx> · Mon, 15 Sep 2025 08:54:29 +0800

On Fri, Sep 12, 2025 at 11:34 PM Martin Wilck <martin.wilck@xxxxxxxx> wrote:
>
> This Patch set changes the logic of the "mdcheck" tool and the related systemd
> services mdcheck_start.service and mdcheck_continue.service.
> The set and related discussion are also posted as GitHub PR [1].
>
> The set is meant to be applied on top of PR#189 [2], which has already been
> merged in the current main branch on GitHub.
>
> [1] https://github.com/md-raid-utilities/mdadm/pull/190
> [2] https://github.com/md-raid-utilities/mdadm/pull/189
>
> The current behavior is like this:
>
> * mdcheck without arguments starts a RAID check on all arrays on the system,
>   starting at position 0. This is started from mdcheck_start.service,
>   started by a systemd timer once a month.
> * mdcheck --continue looks for files /var/lib/mdcheck/MD_UUID_$UUID, reads the
>   start position from them, and starts a check from that position on the array
>   with the respective UUID. This is started from a systemd timer every night.
>
> In either case, mdcheck won't do anything if the kernel is already running a
> sync_action on a given array. The check runs for a given period of time
> (default 6h) and saves the last position in the MD_UUID file, to be taken up
> when mdcheck --continue is called next time. When the entire array has been
> checked, the MD_UUID_ file is deleted. When all checks are finished,
> mdcheck_continue.timer is stopped, to be restarted when mdcheck_start.timer
> expires next time.
>
> Before the recent commit 8aa4ea9 ("systemd: start mdcheck_continue.timer
> before mdcheck_start.timer"), this could lead to a race condition when the
> check for a given array didn't complete throughout the month.
> mdcheck_start.service would start and reset the check position to 0
> before mdcheck_continue.service could pick up at the last saved
> position. Commit 8aa4ea9 works around this by starting
> mdcheck_continue.service a few minutes before mdcheck_start.timer.

Hi Martin

The race condition is caused by the faulty modification by admin.
commit 8aa4ea9 ("systemd: start mdcheck_continue.timer before
mdcheck_start.timer") already fixes the problem that an array should
continue to do the check if it doesn't finish checking in one month.
The admin changes the timer sequence back again, it's a faulty action.
We can add a warning comment in the timer service file to avoid such a
race.

>
> Yet the general problem still exists: both services trigger checks on the
> kernel's part which they can only passively monitor. So if a user plays with
> the timer settings (which he is in his rights to do), another similar race
> might happen.

diff --git a/misc/mdcheck b/misc/mdcheck
index 398a1ea607ca..7d1d79f795f0 100644
--- a/misc/mdcheck
+++ b/misc/mdcheck
@@ -116,7 +116,7 @@ do
        mdadm --detail --export "$dev" | grep '^MD_UUID=' > $tmp || continue
        source $tmp
        fl="/var/lib/mdcheck/MD_UUID_$MD_UUID"
-       if [ -z "$cont" ]
+       if [ -z "$cont" -a ! -f "$fl" ]
        then
                start=0
                logger -p daemon.info mdcheck start checking $dev

How about this? The start action checks the UUID file. So the check
will continue if it doesn't finish in one month.

Best Regards
Xiao
>
> This patch set changes the behavior as follows:
>
> Only mdcheck_continue.service actually starts and stops kernel-based sync
> actions. This service will continue at the saved start position if an MD_UUID*
> file exists, or start a new check at position 0 otherwise. Starting at 0 can
> be inhibited by creating a file /var/lib/mdcheck/Checked_$UUID. These files
> will be created by mdcheck when it finishes checking a given array. Thus
> future invocations of mdcheck_continue.service will not restart the check on
> this array.
>
> mdcheck_start.service runs mdcheck --restart, which simply removes all
> Checked_* markers from /var/lib/mdcheck, so that the next invocation of
> mdcheck_continue.service will start new checks on all arrays which don't have
> already running checks.
>
> The general behavior of the systemd timers and services is like before, but
> the mentioned race condition is avoided, even if the user modifies the timer
> settings arbitrarily.
>
> This set slightly changes the behavior of the mdcheck script. Without
> --continue, it will still start checks on all array, but unlike before it will
> skip arrays for with a Checked_ marker exists. To avoid that, run mdcheck
> --restart before mdcheck.
>
> More details in the commit description of the patch "mdcheck: simplify start /
> continue logic and add "--restart" logic".
>
> Martin Wilck (4):
>   mdcheck: loop over sync_action files in sysfs
>   mdcheck: replace deprecated "$[cnt+1]" syntax
>   mdcheck: simplify start / continue logic and add "--restart" logic
>   mdcheck: log to stderr from systemd units
>
>  misc/mdcheck                     | 105 ++++++++++++++++++++-----------
>  systemd/mdcheck_continue.service |   6 +-
>  systemd/mdcheck_start.service    |   3 +-
>  systemd/mdcheck_start.timer      |   2 +-
>  4 files changed, 75 insertions(+), 41 deletions(-)
>
> --
> 2.51.0
>