On Mon, Apr 28, 2025 at 02:14:45PM +0200, Jan Kara wrote: > On Sat 26-04-25 14:29:15, Christian Brauner wrote: > > On Wed, Apr 16, 2025 at 06:58:25PM +0000, Joe Damato wrote: > > > Avoid an edge case where epoll_wait arms a timer and calls schedule() > > > even if the timer will expire immediately. > > > > > > For example: if the user has specified an epoll busy poll usecs which is > > > equal or larger than the epoll_wait/epoll_pwait2 timeout, it is > > > unnecessary to call schedule_hrtimeout_range; the busy poll usecs have > > > consumed the entire timeout duration so it is unnecessary to induce > > > scheduling latency by calling schedule() (via schedule_hrtimeout_range). > > > > > > This can be measured using a simple bpftrace script: > > > > > > tracepoint:sched:sched_switch > > > / args->prev_pid == $1 / > > > { > > > print(kstack()); > > > print(ustack()); > > > } > > > > > > Before this patch is applied: > > > > > > Testing an epoll_wait app with busy poll usecs set to 1000, and > > > epoll_wait timeout set to 1ms using the script above shows: > > > > > > __traceiter_sched_switch+69 > > > __schedule+1495 > > > schedule+32 > > > schedule_hrtimeout_range+159 > > > do_epoll_wait+1424 > > > __x64_sys_epoll_wait+97 > > > do_syscall_64+95 > > > entry_SYSCALL_64_after_hwframe+118 > > > > > > epoll_wait+82 > > > > > > Which is unexpected; the busy poll usecs should have consumed the > > > entire timeout and there should be no reason to arm a timer. > > > > > > After this patch is applied: the same test scenario does not generate a > > > call to schedule() in the above edge case. If the busy poll usecs are > > > reduced (for example usecs: 100, epoll_wait timeout 1ms) the timer is > > > armed as expected. > > > > > > Fixes: bf3b9f6372c4 ("epoll: Add busy poll support to epoll with socket fds.") > > > Signed-off-by: Joe Damato <jdamato@xxxxxxxxxx> > > > Reviewed-by: Jan Kara <jack@xxxxxxx> > > > --- > > > v2: > > > - No longer an RFC and rebased on vfs/vfs.fixes > > > - Added Jan's Reviewed-by > > > - Added Fixes tag > > > - No functional changes from the RFC > > > > > > rfcv1: https://lore.kernel.org/linux-fsdevel/20250415184346.39229-1-jdamato@xxxxxxxxxx/ > > > > > > fs/eventpoll.c | 10 +++++++++- > > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c > > > index 100376863a44..4bc264b854c4 100644 > > > --- a/fs/eventpoll.c > > > +++ b/fs/eventpoll.c > > > @@ -1996,6 +1996,14 @@ static int ep_try_send_events(struct eventpoll *ep, > > > return res; > > > } > > > > > > +static int ep_schedule_timeout(ktime_t *to) > > > +{ > > > + if (to) > > > + return ktime_after(*to, ktime_get()); > > > + else > > > + return 1; > > > +} > > > + > > > /** > > > * ep_poll - Retrieves ready events, and delivers them to the caller-supplied > > > * event buffer. > > > @@ -2103,7 +2111,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, > > > > > > write_unlock_irq(&ep->lock); > > > > > > - if (!eavail) > > > + if (!eavail && ep_schedule_timeout(to)) > > > timed_out = !schedule_hrtimeout_range(to, slack, > > > HRTIMER_MODE_ABS); > > > > Isn't this buggy? If @to is non-NULL and ep_schedule_timeout() returns > > false you want to set timed_out to 1 to break the wait. Otherwise you > > hang, no? > > Yep, looks like that. Good spotting! Thank you for spotting that and sorry for the trouble. Christian / Jan what would be the correct way for me to deal with this? Would it be to post a v3 (re-submitting the patch in its entirety) or to post a new patch that fixes the original and lists the commit sha from vfs.fixes with a Fixes tag ? I was going to propose the following, which if I understand correctly, should fix the issue Christian identified: diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 4bc264b854c4..1a5d1147f082 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2111,7 +2111,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, write_unlock_irq(&ep->lock); - if (!eavail && ep_schedule_timeout(to)) + if (!ep_schedule_timeout(to)) + timed_out = 1; + else if (!eavail) timed_out = !schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS); __set_current_state(TASK_RUNNING);