Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings

Joanne Koong <joannelkoong@xxxxxxxxx> · Thu, 3 Apr 2025 15:04:19 -0700

On Thu, Apr 3, 2025 at 1:44 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 03.04.25 21:09, Joanne Koong wrote:
> > On Thu, Apr 3, 2025 at 2:18 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >>
> >> On 03.04.25 05:31, Jingbo Xu wrote:
> >>>
> >>>
> >>> On 4/3/25 5:34 AM, Joanne Koong wrote:
> >>>> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >>>>>
> >>>>> On 23.11.24 00:23, Joanne Koong wrote:
> >>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
> >>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
> >>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
> >>>>>> writeback may take an indeterminate amount of time to complete, and
> >>>>>> waits may get stuck.
> >>>>>>
> >>>>>> Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx>
> >>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>
> >>>>>> ---
> >>>>>>     mm/migrate.c | 5 ++++-
> >>>>>>     1 file changed, 4 insertions(+), 1 deletion(-)
> >>>>>>

> >>>>> Ehm, doesn't this mean that any fuse user can essentially completely
> >>>>> block CMA allocations, memory compaction, memory hotunplug, memory
> >>>>> poisoning... ?!
> >>>>>
> >>>>> That sounds very bad.
> >>>>
> >>>> I took a closer look at the migration code and the FUSE code. In the
> >>>> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC
> >>>> mode folio lock holds will block migration until that folio is
> >>>> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at:
> >>>>
> >>>>           if (!folio_trylock(src)) {
> >>>>                   if (mode == MIGRATE_ASYNC)
> >>>>                           goto out;
> >>>>
> >>>>                   if (current->flags & PF_MEMALLOC)
> >>>>                           goto out;
> >>>>
> >>>>                   if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src))
> >>>>                           goto out;
> >>>>
> >>>>                   folio_lock(src);
> >>>>           }
> >>>>
> >>
> >> Right, I raised that also in my LSF/MM talk: waiting for readahead
> >> currently implies waiting for the folio lock (there is no separate
> >> readahead flag like there would be for writeback).
> >>
> >> The more I look into this and fuse, the more I realize that what fuse
> >> does is just completely broken right now.
> >>
> >>>> If this is all that is needed for a malicious FUSE server to block
> >>>> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE
> >>>> mappings are skipped in migration. A malicious server has easier and
> >>>> more powerful ways of blocking migration in FUSE than trying to do it
> >>>> through writeback. For a malicious fuse server, we in fact wouldn't
> >>>> even get far enough to hit writeback - a write triggers
> >>>> aops->write_begin() and a malicious server would deliberately hang
> >>>> forever while the folio is locked in write_begin().
> >>>
> >>> Indeed it seems possible.  A malicious FUSE server may already be
> >>> capable of blocking the synchronous migration in this way.
> >>
> >> Yes, I think the conclusion is that we should advise people from not
> >> using unprivileged FUSE if they care about any features that rely on
> >> page migration or page reclaim.
> >>
> >>>
> >>>
> >>>>
> >>>> I looked into whether we could eradicate all the places in FUSE where
> >>>> we may hold the folio lock for an indeterminate amount of time,
> >>>> because if that is possible, then we should not add this writeback way
> >>>> for a malicious fuse server to affect migration. But I don't think we
> >>>> can, for example taking one case, the folio lock needs to be held as
> >>>> we read in the folio from the server when servicing page faults, else
> >>>> the page cache would contain stale data if there was a concurrent
> >>>> write that happened just before, which would lead to data corruption
> >>>> in the filesystem. Imo, we need a more encompassing solution for all
> >>>> these cases if we're serious about preventing FUSE from blocking
> >>>> migration, which probably looks like a globally enforced default
> >>>> timeout of some sort or an mm solution for mitigating the blast radius
> >>>> of how much memory can be blocked from migration, but that is outside
> >>>> the scope of this patchset and is its own standalone topic.
> >>
> >> I'm still skeptical about timeouts: we can only get it wrong.
> >>
> >> I think a proper solution is making these pages movable, which does seem
> >> feasible if (a) splice is not involved and (b) we can find a way to not
> >> hold the folio lock forever e.g., in the readahead case.
> >>
> >> Maybe readahead would have to be handled more similar to writeback
> >> (e.g., having a separate flag, or using a combination of e.g.,
> >> writeback+uptodate flag, not sure)
> >>
> >> In both cases (readahead+writeback), we'd want to call into the FS to
> >> migrate a folio that is under readahread/writeback. In case of fuse
> >> without splice, a migration might be doable, and as discussed, splice
> >> might just be avoided.
> >>
> >>>>
> >>>> I don't see how this patch has any additional negative impact on
> >>>> memory migration for the case of malicious servers that the server
> >>>> can't already (and more easily) do. In fact, this patchset if anything
> >>>> helps memory given that malicious servers now can't also trigger page
> >>>> allocations for temp pages that would never get freed.
> >>>>
> >>>
> >>> If that's true, maybe we could drop this patch out of this patchset? So
> >>> that both before and after this patchset, synchronous migration could be
> >>> blocked by a malicious FUSE server, while the usability of continuous
> >>> memory (CMA) won't be affected.
> >>
> >> I had exactly the same thought: if we can block forever on the folio
> >> lock, there is no need for AS_WRITEBACK_INDETERMINATE. It's already all
> >> completely broken.
> >
> > I will resubmit this patchset and drop this patch.
> >
> > I think we still need AS_WRITEBACK_INDETERMINATE for sync and legacy
> > cgroupv1 reclaim scenarios:
> > a) sync: sync waits on writeback so if we don't skip waiting on
> > writeback for AS_WRITEBACK_INDETERMINATE mappings, then malicious fuse
> > servers could make syncs hang. (There's no actual effect on sync
> > behavior though with temp pages because even without temp pages, we
> > return even though the data hasn't actually been synced to disk by the
> > server yet)
>
> Just curious: Are we sure there are no other cases where a malicious
> userspace could make some other folio_lock() hang forever either way?
>

Unfortunately, there's an awful case where kswapd may get blocked
waiting for the folio lock. We encountered this in prod last week from
a well-intentioned but incorrectly written FUSE server that got stuck.
The stack trace was:

  366 kswapd0 D
  folio_wait_bit_common.llvm.15141953522965195141
  truncate_inode_pages_range
  fuse_evict_inode
  evict
  _dentry_kill
  shrink_dentry_list
  prune_dcache_sb
  super_cache_scan
  do_shrink_slab
  shrink_slab
  kswapd
  kthread
  ret_from_fork
  ret_from_fork_asm

which was narrowed down to the  __filemap_get_folio(..., FGP_LOCK,
...)  call in truncate_inode_pages_range().

I'm working on a fix for this for kswapd and planning to also do a
broader audit for other places where we might get tripped up from fuse
forever holding a folio lock. I'm going to look more into the
long-term fuse fix too - the first step will be documenting all the
places currently where a lock may be forever held.

> IOW, just like for migration, isn't this just solving one part of the
> whole problem we are facing?

For sync, I didn't see any folio lock acquires anywhere but I just
noticed that fuse's .sync_fs() implementation will block until a
server replies, so yes a malicious server could still hold up sync
regardless of temp pages or not. I'll drop the sync patch too in v7.

Thanks,
Joanne

>
> >
> > b) cgroupv1 reclaim: a correctly written fuse server can fall into
> > this deadlock in one very specific scenario (eg  if it's using legacy
> > cgroupv1 and reclaim encounters a folio that already has the reclaim
> > flag set and the caller didn't have __GFP_FS (or __GFP_IO if swap)
> > set), where the deadlock is triggered by:
> > * single-threaded FUSE server is in the middle of handling a request
> > that needs a memory allocation
> > * memory allocation triggers direct reclaim
> > * direct reclaim waits on a folio under writeback
> > * the FUSE server can't write back the folio since it's stuck in direct reclaim
>
> Yes, that sounds reasonable.
>
> --
> Cheers,
>
> David / dhildenb
>