Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings

Christian Brauner <brauner@xxxxxxxxxx> · Thu, 3 Apr 2025 11:35:23 +0200

On Thu, Apr 03, 2025 at 11:25:17AM +0200, Bernd Schubert wrote:
> 
> 
> On 4/3/25 11:18, David Hildenbrand wrote:
> > On 03.04.25 05:31, Jingbo Xu wrote:
> >>
> >>
> >> On 4/3/25 5:34 AM, Joanne Koong wrote:
> >>> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@xxxxxxxxxx>
> >>> wrote:
> >>>>
> >>>> On 23.11.24 00:23, Joanne Koong wrote:
> >>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the
> >>>>> folio if
> >>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag
> >>>>> set on its
> >>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the
> >>>>> mapping, the
> >>>>> writeback may take an indeterminate amount of time to complete, and
> >>>>> waits may get stuck.
> >>>>>
> >>>>> Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx>
> >>>>> Reviewed-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>
> >>>>> ---
> >>>>>    mm/migrate.c | 5 ++++-
> >>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/mm/migrate.c b/mm/migrate.c
> >>>>> index df91248755e4..fe73284e5246 100644
> >>>>> --- a/mm/migrate.c
> >>>>> +++ b/mm/migrate.c
> >>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t
> >>>>> get_new_folio,
> >>>>>                 */
> >>>>>                switch (mode) {
> >>>>>                case MIGRATE_SYNC:
> >>>>> -                     break;
> >>>>> +                     if (!src->mapping ||
> >>>>> +                         !mapping_writeback_indeterminate(src-
> >>>>> >mapping))
> >>>>> +                             break;
> >>>>> +                     fallthrough;
> >>>>>                default:
> >>>>>                        rc = -EBUSY;
> >>>>>                        goto out;
> >>>>
> >>>> Ehm, doesn't this mean that any fuse user can essentially completely
> >>>> block CMA allocations, memory compaction, memory hotunplug, memory
> >>>> poisoning... ?!
> >>>>
> >>>> That sounds very bad.
> >>>
> >>> I took a closer look at the migration code and the FUSE code. In the
> >>> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC
> >>> mode folio lock holds will block migration until that folio is
> >>> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at:
> >>>
> >>>          if (!folio_trylock(src)) {
> >>>                  if (mode == MIGRATE_ASYNC)
> >>>                          goto out;
> >>>
> >>>                  if (current->flags & PF_MEMALLOC)
> >>>                          goto out;
> >>>
> >>>                  if (mode == MIGRATE_SYNC_LIGHT && !
> >>> folio_test_uptodate(src))
> >>>                          goto out;
> >>>
> >>>                  folio_lock(src);
> >>>          }
> >>>
> > 
> > Right, I raised that also in my LSF/MM talk: waiting for readahead
> > currently implies waiting for the folio lock (there is no separate
> > readahead flag like there would be for writeback).
> > 
> > The more I look into this and fuse, the more I realize that what fuse
> > does is just completely broken right now.
> > 
> >>> If this is all that is needed for a malicious FUSE server to block
> >>> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE
> >>> mappings are skipped in migration. A malicious server has easier and
> >>> more powerful ways of blocking migration in FUSE than trying to do it
> >>> through writeback. For a malicious fuse server, we in fact wouldn't
> >>> even get far enough to hit writeback - a write triggers
> >>> aops->write_begin() and a malicious server would deliberately hang
> >>> forever while the folio is locked in write_begin().
> >>
> >> Indeed it seems possible.  A malicious FUSE server may already be
> >> capable of blocking the synchronous migration in this way.
> > 
> > Yes, I think the conclusion is that we should advise people from not
> > using unprivileged FUSE if they care about any features that rely on
> > page migration or page reclaim.
> > 
> >>
> >>
> >>>
> >>> I looked into whether we could eradicate all the places in FUSE where
> >>> we may hold the folio lock for an indeterminate amount of time,
> >>> because if that is possible, then we should not add this writeback way
> >>> for a malicious fuse server to affect migration. But I don't think we
> >>> can, for example taking one case, the folio lock needs to be held as
> >>> we read in the folio from the server when servicing page faults, else
> >>> the page cache would contain stale data if there was a concurrent
> >>> write that happened just before, which would lead to data corruption
> >>> in the filesystem. Imo, we need a more encompassing solution for all
> >>> these cases if we're serious about preventing FUSE from blocking
> >>> migration, which probably looks like a globally enforced default
> >>> timeout of some sort or an mm solution for mitigating the blast radius
> >>> of how much memory can be blocked from migration, but that is outside
> >>> the scope of this patchset and is its own standalone topic.
> > 
> > I'm still skeptical about timeouts: we can only get it wrong.
> > 
> > I think a proper solution is making these pages movable, which does seem
> > feasible if (a) splice is not involved and (b) we can find a way to not
> > hold the folio lock forever e.g., in the readahead case.
> > 
> > Maybe readahead would have to be handled more similar to writeback
> > (e.g., having a separate flag, or using a combination of e.g.,
> > writeback+uptodate flag, not sure)
> > 
> > In both cases (readahead+writeback), we'd want to call into the FS to
> > migrate a folio that is under readahread/writeback. In case of fuse
> > without splice, a migration might be doable, and as discussed, splice
> > might just be avoided.
> 
> My personal take is here that we should move away from splice.
> Keith (or colleague) is working on ZC with io-uring anyway, so
> maybe a good timing. We should just ensure that the new approach
> doesn't have the same issue.

splice is problematic in a lot of other ways too. It's easy to abuse it
for weird userspace hangs since it clings onto the pipe_lock() and no
one wants to do the invasive surgery to wean it off of that. So +1 on
avoiding splice.