On 4/24/25 8:03 AM, Johannes Weiner wrote: > On Wed, Apr 23, 2025 at 05:37:06PM -0600, Jens Axboe wrote: >> userfaultfd may use interruptible sleeps to wait on userspace filling >> a page fault, which works fine if the task can be reliably put to >> sleeping waiting for that. However, if the task has a normal (ie >> non-fatal) signal pending, then TASK_INTERRUPTIBLE sleep will simply >> cause schedule() to be a no-op. >> >> For a task that registers a page with userfaultfd and then proceeds >> to do a write from it, if that task also has a signal pending then >> it'll essentially busy loop from do_page_fault() -> handle_userfault() >> until that fault has been filled. Normally it'd be expected that the >> task would sleep until that happens. Here's a trace from an application >> doing just that: >> >> handle_userfault+0x4b8/0xa00 (P) >> hugetlb_fault+0xe24/0x1060 >> handle_mm_fault+0x2bc/0x318 >> do_page_fault+0x1e8/0x6f0 > > Makes sense. There is a fault_signal_pending() check before retrying: > > static inline bool fault_signal_pending(vm_fault_t fault_flags, > struct pt_regs *regs) > { > return unlikely((fault_flags & VM_FAULT_RETRY) && > (fatal_signal_pending(current) || > (user_mode(regs) && signal_pending(current)))); > } > > Since it's an in-kernel fault, and the signal is non-fatal, it won't > stop looping until the fault is handled. > > This in itself seems a bit sketchy. You have to hope there is no > dependency between handling the signal -> handling the fault inside > the userspace components. Indeed... But that's generic userfaultfd sketchiness, not really related to this patch. > >> do_translation_fault+0x9c/0xd0 >> do_mem_abort+0x44/0xa0 >> el1_abort+0x3c/0x68 >> el1h_64_sync_handler+0xd4/0x100 >> el1h_64_sync+0x6c/0x70 >> fault_in_readable+0x74/0x108 (P) >> iomap_file_buffered_write+0x14c/0x438 >> blkdev_write_iter+0x1a8/0x340 >> vfs_write+0x20c/0x348 >> ksys_write+0x64/0x108 >> __arm64_sys_write+0x1c/0x38 >> >> where the task is looping with 100% CPU time in the above mentioned >> fault path. >> >> Since it's impossible to handle signals, or other conditions like >> TIF_NOTIFY_SIGNAL that also prevents interruptible sleeping, from the >> fault path, use TASK_UNINTERRUPTIBLE with a short timeout even for vmf >> modes that would normally ask for INTERRUPTIBLE or KILLABLE sleep. Fatal >> signals will still be handled by the caller, and the timeout is short >> enough to hopefully not cause any issues. If this is the first invocation >> of this fault, eg FAULT_FLAG_TRIED isn't set, then the normal sleep mode >> is used. >> >> Cc: stable@xxxxxxxxxxxxxxx >> Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization") > > When this patch was first introduced, VM_FAULT_RETRY would work only > once. The second try would have FAULT_FLAG_ALLOW_RETRY cleared, > causing handle_userfault() to return VM_SIGBUS, which would bubble > through the fixup table (kernel fault), -EFAULT from > iomap_file_buffered_write() and unwind the kernel stack this way. > > So I'm thinking this is the more likely commit for Fixes: and stable: > > commit 4064b982706375025628094e51d11cf1a958a5d3 > Author: Peter Xu <peterx@xxxxxxxxxx> > Date: Wed Apr 1 21:08:45 2020 -0700 > > mm: allow VM_FAULT_RETRY for multiple times Thanks for checking that - yep that sounds fine to me, we can adjust the fixes tag appropriately. >> Reported-by: Zhiwei Jiang <qq282012236@xxxxxxxxx> >> Link: https://lore.kernel.org/io-uring/20250422162913.1242057-1-qq282012236@xxxxxxxxx/ >> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> > > Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx> Thanks! -- Jens Axboe