Re: raid5-cache / log / journal hung task during log replay

Philipp Steinhardt <steinhardt.philipp@xxxxxxxxx> · Fri, 25 Apr 2025 00:36:18 +0200

Hi,
> May be able to test on 6.14.2 on snapshots of the disks tomorrow

same issue on 6.15.0-rc3, debug log attached below.

The culprit seems to be the call to raid5_recovery_alloc_stripe with noblock set to 0 in r5c_recovery_analyze_meta_block (raid5-cache.c:2155) right after the stripe cache size is raised to 32768. The call to raid5_get_active_stripe unsuccessfully tries to get a free stripe and spawns the reclaim thread and waits for the R5_INACTIVE_BLOCKED flag to be removed (which never happens).

With my limited debugging skills it seems like the reclaim thread never does anything in r5l_do_reclaim, because reclaim_target == reclaimable == 0.

My attempts to reproduce this without including the member devices (by zero mapping anything but the superblocks) were not successful in triggering the same issue. The relevant part of the journal device is ~12GiB, so uploading the journal is possible. Including (the relevant) parts of the raid members may be more difficult.

@Song Liu
Sorry for the unsolicited CC, but you seem to be the only main contributor to the raid5-log, that is still frequently active. If there is a better place to issue bug reports like these, please let me know.

Linux ubuntu-raid6-recovery 6.15.0-rc3mainline #5 SMP PREEMPT_DYNAMIC Mon Apr 21 18:00:22 CEST 2025 x86_64 x86_64 x86_64 GNU/Linux
md: md127 stopped.
raid456: run(md127) called.
md/raid:md127: device sdf operational as raid disk 0
md/raid:md127: device sdj operational as raid disk 9
md/raid:md127: device sdi operational as raid disk 8
md/raid:md127: device sdg operational as raid disk 7
md/raid:md127: device sdb operational as raid disk 6
md/raid:md127: device sdd operational as raid disk 5
md/raid:md127: device sdk operational as raid disk 4
md/raid:md127: device sde operational as raid disk 3
md/raid:md127: device sda operational as raid disk 2
md/raid:md127: device sdc operational as raid disk 1
md/raid:md127: allocated 10636kB
md/raid:md127: raid level 6 active with 10 out of 10 devices, algorithm 2
RAID conf printout:
 --- level:6 rd:10 wd:10
 disk 0, o:1, dev:sdf
 disk 1, o:1, dev:sdc
 disk 2, o:1, dev:sda
 disk 3, o:1, dev:sde
 disk 4, o:1, dev:sdk
 disk 5, o:1, dev:sdd
 disk 6, o:1, dev:sdb
 disk 7, o:1, dev:sdg
 disk 8, o:1, dev:sdi
 disk 9, o:1, dev:sdj
md/raid:md127: using device sdh as journal
get_stripe, sector 7183404032
__find_stripe, sector 7183404032
__stripe 7183404032 not in cache
remove_hash(), stripe 0
init_stripe called, stripe 7183404032
insert_hash(), stripe 7183404032
get_stripe, sector 7183404040
__find_stripe, sector 7183404040
__stripe 7183404040 not in cache
remove_hash(), stripe 0
init_stripe called, stripe 7183404040
insert_hash(), stripe 7183404040
get_stripe, sector 7183404048
__find_stripe, sector 7183404048
__stripe 7183404048 not in cache
remove_hash(), stripe 0

[..]

__find_stripe, sector 7200159744
__stripe 7200159744 not in cache
get_stripe, sector 7200159744
__find_stripe, sector 7200159744
__stripe 7200159744 not in cache
md/raid:md127: Increasing stripe cache size to 512 to recovery data on journal.
+++ raid5d active
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
0 stripes handled
--- raid5d inactive
+++ raid5d active
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
0 stripes handled
--- raid5d inactive
get_stripe, sector 7200159744
__find_stripe, sector 7200159744
__stripe 7200159744 not in cache
remove_hash(), stripe 0
init_stripe called, stripe 7200159744
insert_hash(), stripe 7200159744
get_stripe, sector 7200159752
__find_stripe, sector 7200159752
__stripe 7200159752 not in cache
remove_hash(), stripe 0
init_stripe called, stripe 7200159752
insert_hash(), stripe 7200159752
get_stripe, sector 7200159760
__find_stripe, sector 7200159760
__stripe 7200159760 not in cache
get_stripe, sector 7200159760
__find_stripe, sector 7200159760
__stripe 7200159760 not in cache
md/raid:md127: Increasing stripe cache size to 1024 to recovery data on journal.
+++ raid5d active
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
0 stripes handled
--- raid5d inactive
+++ raid5d active
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
0 stripes handled
--- raid5d inactive
+++ raid5d active
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
0 stripes handled
--- raid5d inactive
+++ raid5d active
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
0 stripes handled
--- raid5d inactive
+++ raid5d active
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
0 stripes handled
--- raid5d inactive
+++ raid5d active
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
__get_priority_stripe: handle: empty hold: empty full_writes: 0 bypass_count: 0
0 stripes handled
--- raid5d inactive

[..]

get_stripe, sector 7200204176
__find_stripe, sector 7200204176
__stripe 7200204176 not in cache
remove_hash(), stripe 7199919696
init_stripe called, stripe 7200204176
insert_hash(), stripe 7200204176
get_stripe, sector 7200204184
__find_stripe, sector 7200204184
__stripe 7200204184 not in cache
remove_hash(), stripe 7199919704
init_stripe called, stripe 7200204184
insert_hash(), stripe 7200204184
get_stripe, sector 7200204192
__find_stripe, sector 7200204192
__stripe 7200204192 not in cache
get_stripe, sector 7200204192
__find_stripe, sector 7200204192
__stripe 7200204192 not in cache
md/raid:md127: Increasing stripe cache size to 32768 to recovery data on journal.
get_stripe, sector 7200204192
__find_stripe, sector 7200204192
__stripe 7200204192 not in cache