Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending

Peter Xu <peterx@xxxxxxxxxx> · Thu, 24 Apr 2025 15:13:38 -0400

On Thu, Apr 24, 2025 at 12:40:56PM -0600, Jens Axboe wrote:
> On 4/24/25 12:26 PM, Peter Xu wrote:
> > On Thu, Apr 24, 2025 at 10:03:44AM -0400, Johannes Weiner wrote:
> >> On Wed, Apr 23, 2025 at 05:37:06PM -0600, Jens Axboe wrote:
> >>> userfaultfd may use interruptible sleeps to wait on userspace filling
> >>> a page fault, which works fine if the task can be reliably put to
> >>> sleeping waiting for that. However, if the task has a normal (ie
> >>> non-fatal) signal pending, then TASK_INTERRUPTIBLE sleep will simply
> >>> cause schedule() to be a no-op.
> >>>
> >>> For a task that registers a page with userfaultfd and then proceeds
> >>> to do a write from it, if that task also has a signal pending then
> >>> it'll essentially busy loop from do_page_fault() -> handle_userfault()
> >>> until that fault has been filled. Normally it'd be expected that the
> >>> task would sleep until that happens. Here's a trace from an application
> >>> doing just that:
> >>>
> >>> handle_userfault+0x4b8/0xa00 (P)
> >>> hugetlb_fault+0xe24/0x1060
> >>> handle_mm_fault+0x2bc/0x318
> >>> do_page_fault+0x1e8/0x6f0
> >>
> >> Makes sense. There is a fault_signal_pending() check before retrying:
> >>
> >> static inline bool fault_signal_pending(vm_fault_t fault_flags,
> >>                                         struct pt_regs *regs)
> >> {
> >>         return unlikely((fault_flags & VM_FAULT_RETRY) &&
> >>                         (fatal_signal_pending(current) ||
> >>                          (user_mode(regs) && signal_pending(current))));
> >> }
> >>
> >> Since it's an in-kernel fault, and the signal is non-fatal, it won't
> >> stop looping until the fault is handled.
> >>
> >> This in itself seems a bit sketchy. You have to hope there is no
> >> dependency between handling the signal -> handling the fault inside
> >> the userspace components.
> > 
> > True. So far, my understanding is e.g. in an userfaultfd context the signal
> > handler is responsible for not touching any possible trapped pages, or the
> > sighandler needs fixing on its own.
> > 
> >>
> >>> do_translation_fault+0x9c/0xd0
> >>> do_mem_abort+0x44/0xa0
> >>> el1_abort+0x3c/0x68
> >>> el1h_64_sync_handler+0xd4/0x100
> >>> el1h_64_sync+0x6c/0x70
> >>> fault_in_readable+0x74/0x108 (P)
> >>> iomap_file_buffered_write+0x14c/0x438
> >>> blkdev_write_iter+0x1a8/0x340
> >>> vfs_write+0x20c/0x348
> >>> ksys_write+0x64/0x108
> >>> __arm64_sys_write+0x1c/0x38
> >>>
> >>> where the task is looping with 100% CPU time in the above mentioned
> >>> fault path.
> >>>
> >>> Since it's impossible to handle signals, or other conditions like
> >>> TIF_NOTIFY_SIGNAL that also prevents interruptible sleeping, from the
> >>> fault path, use TASK_UNINTERRUPTIBLE with a short timeout even for vmf
> >>> modes that would normally ask for INTERRUPTIBLE or KILLABLE sleep. Fatal
> >>> signals will still be handled by the caller, and the timeout is short
> >>> enough to hopefully not cause any issues. If this is the first invocation
> >>> of this fault, eg FAULT_FLAG_TRIED isn't set, then the normal sleep mode
> >>> is used.
> >>>
> >>> Cc: stable@xxxxxxxxxxxxxxx
> >>> Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
> >>
> >> When this patch was first introduced, VM_FAULT_RETRY would work only
> >> once. The second try would have FAULT_FLAG_ALLOW_RETRY cleared,
> >> causing handle_userfault() to return VM_SIGBUS, which would bubble
> >> through the fixup table (kernel fault), -EFAULT from
> >> iomap_file_buffered_write() and unwind the kernel stack this way.
> > 
> > AFAIU we can't rely on the exception fixups because when reaching there it
> > means the user access is going to get a -EFAULT, but here the right
> > behavior is we keep waiting, aka, UNINTERRUPTIBLE wait until it's done.
> > 
> >>
> >> So I'm thinking this is the more likely commit for Fixes: and stable:
> >>
> >> commit 4064b982706375025628094e51d11cf1a958a5d3
> >> Author: Peter Xu <peterx@xxxxxxxxxx>
> >> Date:   Wed Apr 1 21:08:45 2020 -0700
> >>
> >>     mm: allow VM_FAULT_RETRY for multiple times
> > 
> > IMHO the multiple attempts are still fine, instead it's problematic if we
> > wait in INTERRUPTIBLE mode even in !user mode..  so maybe it's slightly
> > more suitable to use this as Fixes:
> > 
> > commit c270a7eedcf278304e05ebd2c96807487c97db61
> > Author: Peter Xu <peterx@xxxxxxxxxx>
> > Date:   Wed Apr 1 21:08:41 2020 -0700
> > 
> >     mm: introduce FAULT_FLAG_INTERRUPTIBLE
> > 
> > The important change there is:
> > 
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 888272621f38..c076d3295958 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -462,9 +462,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >         uwq.ctx = ctx;
> >         uwq.waken = false;
> >  
> > -       return_to_userland =
> > -               (vmf->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
> > -               (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
> > +       return_to_userland = vmf->flags & FAULT_FLAG_INTERRUPTIBLE;
> >         blocking_state = return_to_userland ? TASK_INTERRUPTIBLE :
> >                          TASK_KILLABLE;
> > 
> > I think we still need to avoid checking FAULT_FLAG_USER, because e.g. in
> > some other use cases like GUP we'd still want the threads (KVM does GUP and
> > it's a heavy user of userfaultfd) to respond to non-fatals.
> > 
> > However maybe we shouldn't really set INTERRUPTIBLE at all if it's non-GUP
> > and if it's non-user either.
> > 
> > So in general, some trivial concerns here on the patch..
> > 
> > Firstly, waiting UNINTERRUPTIBLE (even if with a small timeout) if
> > FAULT_FLAG_INTERRUPTIBLE is set is a slight ABI violation to me - after
> > all, FAULT_FLAG_INTERRUPTIBLE says "please respond to non-fatal signals
> > too!".
> 
> First of all, it won't respond to signals _right now_ if waiting on
> userfaultd, so that ABI violation already exists. The UNINTERRUPTIBLE
> doesn't really change that at all.

Since we can't check pending signals in a loop.. IIUC it's still ok if we
only check it right before a major sleep would happen.  IIUC that means the
old code (at least the userfaultfd path..) is following the rules.

But yeah, I think even with HZ/10 timeout, it will still respond to
non-fatal signals more or less.  It isn't that bad, so my current concerns
are more of nitpicking category.  Sorry if that wasn't clear.  I'm just
trying to see whether we have something even better (and if possible, easier).

> 
> > Secondly, userfaultfd is indeed the only consumer of
> > FAULT_FLAG_INTERRUPTIBLE but not necessary always in the future.  While
> > this patch resolves it for userfaultfd, it might get caught again later if
> > something else in the kernel starts to respects the _INTERRUPTIBLE flag
> > request.  For example, __folio_lock_or_retry() ignores that flag so far,
> > but logically it should obey too (with a folio_wait_locked_interruptible)..
> > 
> > I also think it's not as elegant to have the magic HZ/10, and it's also
> > destined even the loop is less frequent that's a waste of time (as if the
> > user page access comes from kernel context, we must wait... until the page
> > fault is resolved..).
> 
> Yeah I don't love the magic either, but the actual value of it isn't
> important - it's just to prevent a CPU spin for these cases.

Right.  If this patch is the only thing so far can fix it, I'm OK we go
with it first.  Said that..

> 
> > Is it possible we simply unset the request from the top?  As discussed
> > above, I think we still need to make sure GUP at least works for
> > non-fatals, however I think it might be more reasonable we never set
> > _INTERRUPTIBLE for !gup, then this problem might go away too with all above
> > concerns addressed.
> > 
> > A not-even-compiled patch just to clarify what I meant (and it won't work
> > unless it makes sense to both of you and we'll need to touch all archs when
> > changing the default flags):
> > 
> > ===8<===
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 296d294142c8..fa721525d93a 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1300,9 +1300,14 @@ void do_user_addr_fault(struct pt_regs *regs,
> >          * We set FAULT_FLAG_USER based on the register state, not
> >          * based on X86_PF_USER. User space accesses that cause
> >          * system page faults are still user accesses.
> > +        *
> > +        * When we're in user mode, allow fast response on non-fatal
> > +        * signals.  Do not set this in kernel mode faults because normally
> > +        * a kernel fault means the fault must be resolved anyway before
> > +        * going back to userspace.
> >          */
> >         if (user_mode(regs))
> > -               flags |= FAULT_FLAG_USER;
> > +               flags |= FAULT_FLAG_USER | FAULT_FLAG_INTERRUPTIBLE;
> >  
> >  #ifdef CONFIG_X86_64
> >         /*
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 9b701cfbef22..a80f3f609b37 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -487,8 +487,7 @@ extern unsigned int kobjsize(const void *objp);
> >   * arch-specific page fault handlers.
> >   */
> >  #define FAULT_FLAG_DEFAULT  (FAULT_FLAG_ALLOW_RETRY | \
> > -                            FAULT_FLAG_KILLABLE | \
> > -                            FAULT_FLAG_INTERRUPTIBLE)
> > +                            FAULT_FLAG_KILLABLE)
> > ===8<===
> > 
> > That also kind of matches with what we do with fault_signal_pending().
> > Would it make sense?
> 
> I don't think doing a non-bounded non-interruptible sleep for a
> condition that may never resolve (eg userfaultfd never fills the fault)
> is a good idea. What happens if the condition never becomes true? You

If page fault is never going to be resolved, normally we sigkill the
program as it can't move any further with no way to resolve the page fault.

But yeah that's based on the fact sigkill will work first..

> can't even kill the task at that point... Generally UNINTERRUPTIBLE
> sleep should only be used if it's a bounded wait.
> 
> For example, if I ran my previous write(2) reproducer here and the task
> got killed or exited before the userfaultfd fills the fault, then you'd
> have the task stuck in 'D' forever. Can't be killed, can't get
> reclaimed.
> 
> In other words, this won't work.

.. Would you help explain why it didn't work even for SIGKILL?  Above will
still set FAULT_FLAG_KILLABLE, hence I thought SIGKILL would always work
regardless.

For such kernel user page access, IIUC it should respond to SIGKILL in
handle_userfault(), then fault_signal_pending() would trap the SIGKILL this
time -> going kernel fixups. Then the upper stack should get -EFAULT in the
exception fixup path.

I could have missed something..

Thanks,

-- 
Peter Xu