On Thu, Apr 03, 2025 at 11:25:17AM +0200, Bernd Schubert wrote: > > > On 4/3/25 11:18, David Hildenbrand wrote: > > On 03.04.25 05:31, Jingbo Xu wrote: > >> > >> > >> On 4/3/25 5:34 AM, Joanne Koong wrote: > >>> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@xxxxxxxxxx> > >>> wrote: > >>>> > >>>> On 23.11.24 00:23, Joanne Koong wrote: > >>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the > >>>>> folio if > >>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag > >>>>> set on its > >>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the > >>>>> mapping, the > >>>>> writeback may take an indeterminate amount of time to complete, and > >>>>> waits may get stuck. > >>>>> > >>>>> Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> > >>>>> Reviewed-by: Shakeel Butt <shakeel.butt@xxxxxxxxx> > >>>>> --- > >>>>> mm/migrate.c | 5 ++++- > >>>>> 1 file changed, 4 insertions(+), 1 deletion(-) > >>>>> > >>>>> diff --git a/mm/migrate.c b/mm/migrate.c > >>>>> index df91248755e4..fe73284e5246 100644 > >>>>> --- a/mm/migrate.c > >>>>> +++ b/mm/migrate.c > >>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t > >>>>> get_new_folio, > >>>>> */ > >>>>> switch (mode) { > >>>>> case MIGRATE_SYNC: > >>>>> - break; > >>>>> + if (!src->mapping || > >>>>> + !mapping_writeback_indeterminate(src- > >>>>> >mapping)) > >>>>> + break; > >>>>> + fallthrough; > >>>>> default: > >>>>> rc = -EBUSY; > >>>>> goto out; > >>>> > >>>> Ehm, doesn't this mean that any fuse user can essentially completely > >>>> block CMA allocations, memory compaction, memory hotunplug, memory > >>>> poisoning... ?! > >>>> > >>>> That sounds very bad. > >>> > >>> I took a closer look at the migration code and the FUSE code. In the > >>> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC > >>> mode folio lock holds will block migration until that folio is > >>> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at: > >>> > >>> if (!folio_trylock(src)) { > >>> if (mode == MIGRATE_ASYNC) > >>> goto out; > >>> > >>> if (current->flags & PF_MEMALLOC) > >>> goto out; > >>> > >>> if (mode == MIGRATE_SYNC_LIGHT && ! > >>> folio_test_uptodate(src)) > >>> goto out; > >>> > >>> folio_lock(src); > >>> } > >>> > > > > Right, I raised that also in my LSF/MM talk: waiting for readahead > > currently implies waiting for the folio lock (there is no separate > > readahead flag like there would be for writeback). > > > > The more I look into this and fuse, the more I realize that what fuse > > does is just completely broken right now. > > > >>> If this is all that is needed for a malicious FUSE server to block > >>> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE > >>> mappings are skipped in migration. A malicious server has easier and > >>> more powerful ways of blocking migration in FUSE than trying to do it > >>> through writeback. For a malicious fuse server, we in fact wouldn't > >>> even get far enough to hit writeback - a write triggers > >>> aops->write_begin() and a malicious server would deliberately hang > >>> forever while the folio is locked in write_begin(). > >> > >> Indeed it seems possible. A malicious FUSE server may already be > >> capable of blocking the synchronous migration in this way. > > > > Yes, I think the conclusion is that we should advise people from not > > using unprivileged FUSE if they care about any features that rely on > > page migration or page reclaim. > > > >> > >> > >>> > >>> I looked into whether we could eradicate all the places in FUSE where > >>> we may hold the folio lock for an indeterminate amount of time, > >>> because if that is possible, then we should not add this writeback way > >>> for a malicious fuse server to affect migration. But I don't think we > >>> can, for example taking one case, the folio lock needs to be held as > >>> we read in the folio from the server when servicing page faults, else > >>> the page cache would contain stale data if there was a concurrent > >>> write that happened just before, which would lead to data corruption > >>> in the filesystem. Imo, we need a more encompassing solution for all > >>> these cases if we're serious about preventing FUSE from blocking > >>> migration, which probably looks like a globally enforced default > >>> timeout of some sort or an mm solution for mitigating the blast radius > >>> of how much memory can be blocked from migration, but that is outside > >>> the scope of this patchset and is its own standalone topic. > > > > I'm still skeptical about timeouts: we can only get it wrong. > > > > I think a proper solution is making these pages movable, which does seem > > feasible if (a) splice is not involved and (b) we can find a way to not > > hold the folio lock forever e.g., in the readahead case. > > > > Maybe readahead would have to be handled more similar to writeback > > (e.g., having a separate flag, or using a combination of e.g., > > writeback+uptodate flag, not sure) > > > > In both cases (readahead+writeback), we'd want to call into the FS to > > migrate a folio that is under readahread/writeback. In case of fuse > > without splice, a migration might be doable, and as discussed, splice > > might just be avoided. > > My personal take is here that we should move away from splice. > Keith (or colleague) is working on ZC with io-uring anyway, so > maybe a good timing. We should just ensure that the new approach > doesn't have the same issue. splice is problematic in a lot of other ways too. It's easy to abuse it for weird userspace hangs since it clings onto the pipe_lock() and no one wants to do the invasive surgery to wean it off of that. So +1 on avoiding splice.