On 4/28/25 1:14 PM, Jan Kara wrote: > On Sat 26-04-25 14:29:15, Christian Brauner wrote: >> On Wed, Apr 16, 2025 at 06:58:25PM +0000, Joe Damato wrote: >>> Avoid an edge case where epoll_wait arms a timer and calls schedule() >>> even if the timer will expire immediately. >>> >>> For example: if the user has specified an epoll busy poll usecs which is >>> equal or larger than the epoll_wait/epoll_pwait2 timeout, it is >>> unnecessary to call schedule_hrtimeout_range; the busy poll usecs have >>> consumed the entire timeout duration so it is unnecessary to induce >>> scheduling latency by calling schedule() (via schedule_hrtimeout_range). >>> >>> This can be measured using a simple bpftrace script: >>> >>> tracepoint:sched:sched_switch >>> / args->prev_pid == $1 / >>> { >>> print(kstack()); >>> print(ustack()); >>> } >>> >>> Before this patch is applied: >>> >>> Testing an epoll_wait app with busy poll usecs set to 1000, and >>> epoll_wait timeout set to 1ms using the script above shows: >>> >>> __traceiter_sched_switch+69 >>> __schedule+1495 >>> schedule+32 >>> schedule_hrtimeout_range+159 >>> do_epoll_wait+1424 >>> __x64_sys_epoll_wait+97 >>> do_syscall_64+95 >>> entry_SYSCALL_64_after_hwframe+118 >>> >>> epoll_wait+82 >>> >>> Which is unexpected; the busy poll usecs should have consumed the >>> entire timeout and there should be no reason to arm a timer. >>> >>> After this patch is applied: the same test scenario does not generate a >>> call to schedule() in the above edge case. If the busy poll usecs are >>> reduced (for example usecs: 100, epoll_wait timeout 1ms) the timer is >>> armed as expected. >>> >>> Fixes: bf3b9f6372c4 ("epoll: Add busy poll support to epoll with socket fds.") >>> Signed-off-by: Joe Damato <jdamato@xxxxxxxxxx> >>> Reviewed-by: Jan Kara <jack@xxxxxxx> >>> --- >>> v2: >>> - No longer an RFC and rebased on vfs/vfs.fixes >>> - Added Jan's Reviewed-by >>> - Added Fixes tag >>> - No functional changes from the RFC >>> >>> rfcv1: https://lore.kernel.org/linux-fsdevel/20250415184346.39229-1-jdamato@xxxxxxxxxx/ >>> >>> fs/eventpoll.c | 10 +++++++++- >>> 1 file changed, 9 insertions(+), 1 deletion(-) >>> >>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c >>> index 100376863a44..4bc264b854c4 100644 >>> --- a/fs/eventpoll.c >>> +++ b/fs/eventpoll.c >>> @@ -1996,6 +1996,14 @@ static int ep_try_send_events(struct eventpoll *ep, >>> return res; >>> } >>> >>> +static int ep_schedule_timeout(ktime_t *to) >>> +{ >>> + if (to) >>> + return ktime_after(*to, ktime_get()); >>> + else >>> + return 1; >>> +} >>> + >>> /** >>> * ep_poll - Retrieves ready events, and delivers them to the caller-supplied >>> * event buffer. >>> @@ -2103,7 +2111,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, >>> >>> write_unlock_irq(&ep->lock); >>> >>> - if (!eavail) >>> + if (!eavail && ep_schedule_timeout(to)) >>> timed_out = !schedule_hrtimeout_range(to, slack, >>> HRTIMER_MODE_ABS); >> >> Isn't this buggy? If @to is non-NULL and ep_schedule_timeout() returns >> false you want to set timed_out to 1 to break the wait. Otherwise you >> hang, no? > > Yep, looks like that. Good spotting! > I second that. Also, isn't ep_schedule_timeout buggy too? It compares a timeout value with the current time, that has to be reworked as well. I see this patch already queued for stable/linux-6.14.y. What's the plan with it? Thanks, ta