On Mon, 21 Apr 2025 at 19:53, Keith Busch <kbusch@xxxxxxxxxx> wrote: > > Not sure. I'm also guessing cond_resched is the reason for your > observation, so that might be worth confirming is happening in whatever > IO paths you're workload is taking in case there's some other > explanation. Yep, you're spot on. We're hitting cond_resched() from various code paths (xfs_buf_delwri_submit_buffers(), swap_writepage(), rmap_walk_file(), etc, etc). sudo bpftrace -e 'k:psi_task_switch { $prev = (struct task_struct *)arg0; if ($prev->plug != 0) { if ($prev->plug->cur_ktime) { @[kstack(3)] = count(); } } }' Attaching 1 probe... ^C @[ psi_task_switch+5 __schedule+2081 __cond_resched+51 ]: 3044 > fs-writeback happens to work around it by unplugging if it knows > cond_resched is going to schedule. The decision to unplug here wasn't > necessarily because of the plug's ktime, but it gets the job done: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fs-writeback.c?h=v6.15-rc3#n1984 > > Doesn't really scale well to copy this for every caller of > cond_resched(), though. An io specific helper implementation of > cond_resched might help. > > Or if we don't want cond_resched to unplug (though I feel like you would > normally want that), I think we could invalidate the ktime when > scheduling to get the stats to read the current ktime after the process > is scheduled back in. Thanks. Makes sense to me. I'll try this out and report back. > --- > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -6978,6 +6978,9 @@ static void __sched notrace preempt_schedule_common(void) > * between schedule and now. > */ > } while (need_resched()); > + > + if (current->flags & PF_BLOCK_TS) > + blk_plug_invalidate_ts(current); > } > > #ifdef CONFIG_PREEMPTION > --