On Thu, Apr 3, 2025 at 1:44 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 03.04.25 21:09, Joanne Koong wrote: > > On Thu, Apr 3, 2025 at 2:18 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > >> > >> On 03.04.25 05:31, Jingbo Xu wrote: > >>> > >>> > >>> On 4/3/25 5:34 AM, Joanne Koong wrote: > >>>> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > >>>>> > >>>>> On 23.11.24 00:23, Joanne Koong wrote: > >>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if > >>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its > >>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the > >>>>>> writeback may take an indeterminate amount of time to complete, and > >>>>>> waits may get stuck. > >>>>>> > >>>>>> Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> > >>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@xxxxxxxxx> > >>>>>> --- > >>>>>> mm/migrate.c | 5 ++++- > >>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) > >>>>>> > >>>>> Ehm, doesn't this mean that any fuse user can essentially completely > >>>>> block CMA allocations, memory compaction, memory hotunplug, memory > >>>>> poisoning... ?! > >>>>> > >>>>> That sounds very bad. > >>>> > >>>> I took a closer look at the migration code and the FUSE code. In the > >>>> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC > >>>> mode folio lock holds will block migration until that folio is > >>>> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at: > >>>> > >>>> if (!folio_trylock(src)) { > >>>> if (mode == MIGRATE_ASYNC) > >>>> goto out; > >>>> > >>>> if (current->flags & PF_MEMALLOC) > >>>> goto out; > >>>> > >>>> if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src)) > >>>> goto out; > >>>> > >>>> folio_lock(src); > >>>> } > >>>> > >> > >> Right, I raised that also in my LSF/MM talk: waiting for readahead > >> currently implies waiting for the folio lock (there is no separate > >> readahead flag like there would be for writeback). > >> > >> The more I look into this and fuse, the more I realize that what fuse > >> does is just completely broken right now. > >> > >>>> If this is all that is needed for a malicious FUSE server to block > >>>> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE > >>>> mappings are skipped in migration. A malicious server has easier and > >>>> more powerful ways of blocking migration in FUSE than trying to do it > >>>> through writeback. For a malicious fuse server, we in fact wouldn't > >>>> even get far enough to hit writeback - a write triggers > >>>> aops->write_begin() and a malicious server would deliberately hang > >>>> forever while the folio is locked in write_begin(). > >>> > >>> Indeed it seems possible. A malicious FUSE server may already be > >>> capable of blocking the synchronous migration in this way. > >> > >> Yes, I think the conclusion is that we should advise people from not > >> using unprivileged FUSE if they care about any features that rely on > >> page migration or page reclaim. > >> > >>> > >>> > >>>> > >>>> I looked into whether we could eradicate all the places in FUSE where > >>>> we may hold the folio lock for an indeterminate amount of time, > >>>> because if that is possible, then we should not add this writeback way > >>>> for a malicious fuse server to affect migration. But I don't think we > >>>> can, for example taking one case, the folio lock needs to be held as > >>>> we read in the folio from the server when servicing page faults, else > >>>> the page cache would contain stale data if there was a concurrent > >>>> write that happened just before, which would lead to data corruption > >>>> in the filesystem. Imo, we need a more encompassing solution for all > >>>> these cases if we're serious about preventing FUSE from blocking > >>>> migration, which probably looks like a globally enforced default > >>>> timeout of some sort or an mm solution for mitigating the blast radius > >>>> of how much memory can be blocked from migration, but that is outside > >>>> the scope of this patchset and is its own standalone topic. > >> > >> I'm still skeptical about timeouts: we can only get it wrong. > >> > >> I think a proper solution is making these pages movable, which does seem > >> feasible if (a) splice is not involved and (b) we can find a way to not > >> hold the folio lock forever e.g., in the readahead case. > >> > >> Maybe readahead would have to be handled more similar to writeback > >> (e.g., having a separate flag, or using a combination of e.g., > >> writeback+uptodate flag, not sure) > >> > >> In both cases (readahead+writeback), we'd want to call into the FS to > >> migrate a folio that is under readahread/writeback. In case of fuse > >> without splice, a migration might be doable, and as discussed, splice > >> might just be avoided. > >> > >>>> > >>>> I don't see how this patch has any additional negative impact on > >>>> memory migration for the case of malicious servers that the server > >>>> can't already (and more easily) do. In fact, this patchset if anything > >>>> helps memory given that malicious servers now can't also trigger page > >>>> allocations for temp pages that would never get freed. > >>>> > >>> > >>> If that's true, maybe we could drop this patch out of this patchset? So > >>> that both before and after this patchset, synchronous migration could be > >>> blocked by a malicious FUSE server, while the usability of continuous > >>> memory (CMA) won't be affected. > >> > >> I had exactly the same thought: if we can block forever on the folio > >> lock, there is no need for AS_WRITEBACK_INDETERMINATE. It's already all > >> completely broken. > > > > I will resubmit this patchset and drop this patch. > > > > I think we still need AS_WRITEBACK_INDETERMINATE for sync and legacy > > cgroupv1 reclaim scenarios: > > a) sync: sync waits on writeback so if we don't skip waiting on > > writeback for AS_WRITEBACK_INDETERMINATE mappings, then malicious fuse > > servers could make syncs hang. (There's no actual effect on sync > > behavior though with temp pages because even without temp pages, we > > return even though the data hasn't actually been synced to disk by the > > server yet) > > Just curious: Are we sure there are no other cases where a malicious > userspace could make some other folio_lock() hang forever either way? > Unfortunately, there's an awful case where kswapd may get blocked waiting for the folio lock. We encountered this in prod last week from a well-intentioned but incorrectly written FUSE server that got stuck. The stack trace was: 366 kswapd0 D folio_wait_bit_common.llvm.15141953522965195141 truncate_inode_pages_range fuse_evict_inode evict _dentry_kill shrink_dentry_list prune_dcache_sb super_cache_scan do_shrink_slab shrink_slab kswapd kthread ret_from_fork ret_from_fork_asm which was narrowed down to the __filemap_get_folio(..., FGP_LOCK, ...) call in truncate_inode_pages_range(). I'm working on a fix for this for kswapd and planning to also do a broader audit for other places where we might get tripped up from fuse forever holding a folio lock. I'm going to look more into the long-term fuse fix too - the first step will be documenting all the places currently where a lock may be forever held. > IOW, just like for migration, isn't this just solving one part of the > whole problem we are facing? For sync, I didn't see any folio lock acquires anywhere but I just noticed that fuse's .sync_fs() implementation will block until a server replies, so yes a malicious server could still hold up sync regardless of temp pages or not. I'll drop the sync patch too in v7. Thanks, Joanne > > > > > b) cgroupv1 reclaim: a correctly written fuse server can fall into > > this deadlock in one very specific scenario (eg if it's using legacy > > cgroupv1 and reclaim encounters a folio that already has the reclaim > > flag set and the caller didn't have __GFP_FS (or __GFP_IO if swap) > > set), where the deadlock is triggered by: > > * single-threaded FUSE server is in the middle of handling a request > > that needs a memory allocation > > * memory allocation triggers direct reclaim > > * direct reclaim waits on a folio under writeback > > * the FUSE server can't write back the folio since it's stuck in direct reclaim > > Yes, that sounds reasonable. > > -- > Cheers, > > David / dhildenb >