On Thu, Apr 3, 2025 at 2:18 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 03.04.25 05:31, Jingbo Xu wrote: > > > > > > On 4/3/25 5:34 AM, Joanne Koong wrote: > >> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > >>> > >>> On 23.11.24 00:23, Joanne Koong wrote: > >>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if > >>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its > >>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the > >>>> writeback may take an indeterminate amount of time to complete, and > >>>> waits may get stuck. > >>>> > >>>> Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> > >>>> Reviewed-by: Shakeel Butt <shakeel.butt@xxxxxxxxx> > >>>> --- > >>>> mm/migrate.c | 5 ++++- > >>>> 1 file changed, 4 insertions(+), 1 deletion(-) > >>>> > >>>> diff --git a/mm/migrate.c b/mm/migrate.c > >>>> index df91248755e4..fe73284e5246 100644 > >>>> --- a/mm/migrate.c > >>>> +++ b/mm/migrate.c > >>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, > >>>> */ > >>>> switch (mode) { > >>>> case MIGRATE_SYNC: > >>>> - break; > >>>> + if (!src->mapping || > >>>> + !mapping_writeback_indeterminate(src->mapping)) > >>>> + break; > >>>> + fallthrough; > >>>> default: > >>>> rc = -EBUSY; > >>>> goto out; > >>> > >>> Ehm, doesn't this mean that any fuse user can essentially completely > >>> block CMA allocations, memory compaction, memory hotunplug, memory > >>> poisoning... ?! > >>> > >>> That sounds very bad. > >> > >> I took a closer look at the migration code and the FUSE code. In the > >> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC > >> mode folio lock holds will block migration until that folio is > >> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at: > >> > >> if (!folio_trylock(src)) { > >> if (mode == MIGRATE_ASYNC) > >> goto out; > >> > >> if (current->flags & PF_MEMALLOC) > >> goto out; > >> > >> if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src)) > >> goto out; > >> > >> folio_lock(src); > >> } > >> > > Right, I raised that also in my LSF/MM talk: waiting for readahead > currently implies waiting for the folio lock (there is no separate > readahead flag like there would be for writeback). > > The more I look into this and fuse, the more I realize that what fuse > does is just completely broken right now. > > >> If this is all that is needed for a malicious FUSE server to block > >> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE > >> mappings are skipped in migration. A malicious server has easier and > >> more powerful ways of blocking migration in FUSE than trying to do it > >> through writeback. For a malicious fuse server, we in fact wouldn't > >> even get far enough to hit writeback - a write triggers > >> aops->write_begin() and a malicious server would deliberately hang > >> forever while the folio is locked in write_begin(). > > > > Indeed it seems possible. A malicious FUSE server may already be > > capable of blocking the synchronous migration in this way. > > Yes, I think the conclusion is that we should advise people from not > using unprivileged FUSE if they care about any features that rely on > page migration or page reclaim. > > > > > > >> > >> I looked into whether we could eradicate all the places in FUSE where > >> we may hold the folio lock for an indeterminate amount of time, > >> because if that is possible, then we should not add this writeback way > >> for a malicious fuse server to affect migration. But I don't think we > >> can, for example taking one case, the folio lock needs to be held as > >> we read in the folio from the server when servicing page faults, else > >> the page cache would contain stale data if there was a concurrent > >> write that happened just before, which would lead to data corruption > >> in the filesystem. Imo, we need a more encompassing solution for all > >> these cases if we're serious about preventing FUSE from blocking > >> migration, which probably looks like a globally enforced default > >> timeout of some sort or an mm solution for mitigating the blast radius > >> of how much memory can be blocked from migration, but that is outside > >> the scope of this patchset and is its own standalone topic. > > I'm still skeptical about timeouts: we can only get it wrong. > > I think a proper solution is making these pages movable, which does seem > feasible if (a) splice is not involved and (b) we can find a way to not > hold the folio lock forever e.g., in the readahead case. > > Maybe readahead would have to be handled more similar to writeback > (e.g., having a separate flag, or using a combination of e.g., > writeback+uptodate flag, not sure) > > In both cases (readahead+writeback), we'd want to call into the FS to > migrate a folio that is under readahread/writeback. In case of fuse > without splice, a migration might be doable, and as discussed, splice > might just be avoided. > > >> > >> I don't see how this patch has any additional negative impact on > >> memory migration for the case of malicious servers that the server > >> can't already (and more easily) do. In fact, this patchset if anything > >> helps memory given that malicious servers now can't also trigger page > >> allocations for temp pages that would never get freed. > >> > > > > If that's true, maybe we could drop this patch out of this patchset? So > > that both before and after this patchset, synchronous migration could be > > blocked by a malicious FUSE server, while the usability of continuous > > memory (CMA) won't be affected. > > I had exactly the same thought: if we can block forever on the folio > lock, there is no need for AS_WRITEBACK_INDETERMINATE. It's already all > completely broken. I will resubmit this patchset and drop this patch. I think we still need AS_WRITEBACK_INDETERMINATE for sync and legacy cgroupv1 reclaim scenarios: a) sync: sync waits on writeback so if we don't skip waiting on writeback for AS_WRITEBACK_INDETERMINATE mappings, then malicious fuse servers could make syncs hang. (There's no actual effect on sync behavior though with temp pages because even without temp pages, we return even though the data hasn't actually been synced to disk by the server yet) b) cgroupv1 reclaim: a correctly written fuse server can fall into this deadlock in one very specific scenario (eg if it's using legacy cgroupv1 and reclaim encounters a folio that already has the reclaim flag set and the caller didn't have __GFP_FS (or __GFP_IO if swap) set), where the deadlock is triggered by: * single-threaded FUSE server is in the middle of handling a request that needs a memory allocation * memory allocation triggers direct reclaim * direct reclaim waits on a folio under writeback * the FUSE server can't write back the folio since it's stuck in direct reclaim Thanks for the feedback and discussion, everyone. > > -- > Cheers, > > David / dhildenb >