Hi,
在 2025/04/22 3:10, Matt Fleming 写道:
On Mon, 21 Apr 2025 at 19:53, Keith Busch <kbusch@xxxxxxxxxx> wrote:
Not sure. I'm also guessing cond_resched is the reason for your
observation, so that might be worth confirming is happening in whatever
IO paths you're workload is taking in case there's some other
explanation.
Yep, you're spot on. We're hitting cond_resched() from various code
paths (xfs_buf_delwri_submit_buffers(), swap_writepage(),
rmap_walk_file(), etc, etc).
All pluged IO must be submited before scheduled out, there is no point
for this direction. :(
Please check the other mail that I replied to your original report,
it'll make sense if a task keeps running on one cpu for milliseconds.
Thanks,
Kuai
sudo bpftrace -e 'k:psi_task_switch { $prev = (struct
task_struct *)arg0; if ($prev->plug != 0) {
if ($prev->plug->cur_ktime) {
@[kstack(3)] = count();
}
}
}'
Attaching 1 probe...
^C
@[
psi_task_switch+5
__schedule+2081
__cond_resched+51
]: 3044
fs-writeback happens to work around it by unplugging if it knows
cond_resched is going to schedule. The decision to unplug here wasn't
necessarily because of the plug's ktime, but it gets the job done:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fs-writeback.c?h=v6.15-rc3#n1984
Doesn't really scale well to copy this for every caller of
cond_resched(), though. An io specific helper implementation of
cond_resched might help.
Or if we don't want cond_resched to unplug (though I feel like you would
normally want that), I think we could invalidate the ktime when
scheduling to get the stats to read the current ktime after the process
is scheduled back in.
Thanks. Makes sense to me. I'll try this out and report back.
---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6978,6 +6978,9 @@ static void __sched notrace preempt_schedule_common(void)
* between schedule and now.
*/
} while (need_resched());
+
+ if (current->flags & PF_BLOCK_TS)
+ blk_plug_invalidate_ts(current);
}
#ifdef CONFIG_PREEMPTION
--
.