Re: Kernel mistakenly "starts" resync on fully-degraded, newly-created raid10 array

Omari Stephens <xsdg@xxxxxxxx> · Sun, 22 Jun 2025 18:03:16 +0000

On 6/22/25 09:30, Wol wrote:
On 22/06/2025 02:39, Omari Stephens wrote:
I tried asking on Reddit, and ended up resolving the issue myself:
https://www.reddit.com/r/linuxquestions/comments/1lh9to0/ 
kernel_is_stuck_resyncing_a_4drive_raid10_array/

I run Debian SID, and am using kernel 6.12.32-amd64

#apt-cache policy linux-image-amd64
linux-image-amd64:
   Installed: 6.12.32-1
   Candidate: 6.12.32-1
   Version table:
  *** 6.12.32-1 500
         500 http://mirrors.kernel.org/debian unstable/main amd64 
Packages
         500 http://http.us.debian.org/debian unstable/main amd64 
Packages
         100 /var/lib/dpkg/status

#uname -r
6.12.32-amd64

To summarize the issue and my diagnostic steps, I ran this command to 
create a new raid10 array:

|#mdadm --create md13 --name=media --level=10 --layout=f2 -n 4 /dev/ 
sdb1 missing /dev/sdf1 missing|

|At that point, /proc/mdstat showed the following, which makes no sense:|

Why doesn't it make any sense?

Don't forget a raid-10 in linux is NOT a two raid-0s in a raid-1, it's 
its own thing entirely.

Understood.  I've been using Linux MD raid10 for over a decade.  I've 
read through this (and other references) in depth:
https://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10

My question is this: Suppose you create a 4-drive array.  2 drives are 
missing.  What data is there to synchronize?  What should get copied 
where, or what should get recomputed and written where?

To my understanding, in that situation, each block in the array only 
appears in one place on the physical media, and there is no redundancy 
or parity for any block that could be out of sync.

When you read from the array, yes, you're going to get interleaved bits 
of whatever happened to be on the physical media to start with, but 
that's basically the same as reading directly from any new physical 
media -- it's not initialized until it's initialized, and until it is, 
you don't know what you're going to read.

md127 : active raid10 sdb1[2] sdc1[0]
       23382980608 blocks super 1.2 512K chunks 2 far-copies [4/2] [U_U_]
       [>....................]  resync =  0.0% (8594688/23382980608) 
finish=25176161501.3min speed=0K/sec
       bitmap: 175/175 pages [700KB], 65536KB chunk

With 2 drives present and 2 drives absent, the array can only start if 
the present drives are considered in sync.  The kernel spent most of a 
day in this state.  The
"8594688" count increased very slowly over time, but after 24 hours, 
it was only up to 0.1%.  During that time, I had mounted the array and 
transfered 11TB of data onto it.

I can't see any mention of drive size, so your % complete is 
meaningless, but I would say my raid with about 12TB of disk takes a a 
couple of days to sort itself out ...

The rate of resync completion was ~0.  The estimated time to completion 
was 17483445 _years_.  Again, my hypothesis is that this is because the 
system was confused and wasn't actually doing anything meaningful. 
(Although the md127_resync process was sitting at 100% cpu usage the 
entire time; no clue what it was spending those cycles on)

Here's my disk layout, currently, after successfully adding the last two 
drives:
$lsblk /dev/sda /dev/sdb /dev/sdc /dev/sde
NAME              MAJ:MIN RM   SIZE RO TYPE   MOUNTPOINTS
sda                 8:0    0  10.9T  0 disk
└─sda1              8:1    0  10.9T  0 part
  └─md127           9:127  0  21.8T  0 raid10
    ├─media_crypt 253:0    0  21.8T  0 crypt  /mnt/home_media
    └─md127p1     259:0    0 492.2G  0 part
sdb                 8:16   0  10.9T  0 disk
└─sdb1              8:17   0  10.9T  0 part
  └─md127           9:127  0  21.8T  0 raid10
    ├─media_crypt 253:0    0  21.8T  0 crypt  /mnt/home_media
    └─md127p1     259:0    0 492.2G  0 part
sdc                 8:32   0  10.9T  0 disk
└─sdc1              8:33   0  10.9T  0 part
  └─md127           9:127  0  21.8T  0 raid10
    ├─media_crypt 253:0    0  21.8T  0 crypt  /mnt/home_media
    └─md127p1     259:0    0 492.2G  0 part
sde                 8:64   0  10.9T  0 disk
└─sde1              8:65   0  10.9T  0 part
  └─md127           9:127  0  21.8T  0 raid10
    ├─media_crypt 253:0    0  21.8T  0 crypt  /mnt/home_media
    └─md127p1     259:0    0 492.2G  0 part

Then when power-cycled, swapped SATA cables, and added the remaining 
drives, they were marked as spares and weren't added to the array 
(likely because the array was considered to be already resyncing):

I think you're right - if the array is already rebuilding, it can't 
start a new, different rebuild half way through the old one ...

#mdadm --detail /dev/md127
/dev/md127:
[...]
     Number   Major   Minor   RaidDevice State
        0       8       33        0      active sync   /dev/sdc1
        -       0        0        1      removed
        2       8       17        2      active sync   /dev/sdb1
        -       0        0        3      removed

        4       8        1        -      spare   /dev/sda1
        5       8       65        -      spare   /dev/sde1

I ended up resolving the issue by recreating the array with --assume- 
clean:

Bad idea !!! It's okay, especially with new drives and a new array, but 
it will leave the array in a random state. Not good ...

#mdadm --create md19 --name=media3 --assume-clean --readonly -- 
level=10 --layout=f2 -n 4 /dev/sdc1 missing /dev/sdb1 missing
To optimalize recovery speed, it is recommended to enable write-indent 
bitmap, do you want to enable it now? [y/N]? y
mdadm: /dev/sdc1 appears to be part of a raid array:
        level=raid10 devices=4 ctime=Sun Jun 22 00:51:33 2025
mdadm: /dev/sdb1 appears to be part of a raid array:
        level=raid10 devices=4 ctime=Sun Jun 22 00:51:33 2025
Continue creating array [y/N]? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/md19 started.

#cat /proc/mdstat
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4]
md127 : active (read-only) raid10 sdb1[2] sdc1[0]
       23382980608 blocks super 1.2 512K chunks 2 far-copies [4/2] [U_U_]
       bitmap: 175/175 pages [700KB], 65536KB chunk

At which point, I was able to add the new devices and have the array 
(start to) resync as expected:

Yup. Now the two-drive array is not resyncing, the new drives can be 
added and will resync.

#mdadm --manage /dev/md127 --add /dev/sda1 --add /dev/sde1
mdadm: added /dev/sda1
mdadm: added /dev/sde1

#cat /proc/mdstat
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4]
md127 : active raid10 sde1[5] sda1[4] sdc1[0] sdb1[2]
       23382980608 blocks super 1.2 512K chunks 2 far-copies [4/2] [U_U_]
       [>....................]  recovery =  0.0% (714112/11691490304) 
finish=1091.3min speed=178528K/sec
       bitmap: 0/175 pages [0KB], 65536KB chunk

#mdadm --detail /dev/md127
/dev/md127:
[...]
     Number   Major   Minor   RaidDevice State
        0       8       33        0      active sync   /dev/sdc1
        5       8       65        1      spare rebuilding   /dev/sde1
        2       8       17        2      active sync   /dev/sdb1
        4       8        1        3      spare rebuilding   /dev/sda1

--xsdg

Now you have an array where anything you have written will be okay 
(which I guess is what you care about), but the rest of the disk is 
uninitialised garbage that will instantly trigger a read fault if you 
try to read it.

Because these happen to be brand new (but pre-zeroed) hard disks, it 
happens that the initialized state is all zeroes.  Where would the read 
fault come from, though?  What is there to be out of sync?  raid10 has 
no parity, and there's no redundancy across 2 disks in a 4-disk raid10.

#dd if=/dev/md127 bs=1 skip=20T count=1M | hd
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
|................|
*
00100000
1048576+0 records in
1048576+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 1.37091 s, 765 kB/s

And to be clear, that's way past where the resync status is right now:
$cat /proc/mdstat
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4]
md127 : active raid10 sde1[5] sda1[4] sdc1[0] sdb1[2]
      23382980608 blocks super 1.2 512K chunks 2 far-copies [4/2] [U_U_]
      [=============>.......]  recovery = 69.8% 
(8170708928/11691490304) finish=415.6min speed=141166K/sec
      bitmap: 13/175 pages [52KB], 65536KB chunk

--xsdg

You need to set off a scrub, which will do those reads and get the array 
itself (not just your data) into a sane state.

https://archive.kernel.org/oldwiki/raid.wiki.kernel.org/

Ignore the obsolete content crap. Somebody clearly thinks that replacing 
USER documentation by double-dutch programmer documentation (aimed at a 
completely different audience) is a good idea ...

Cheers,
Wol