Hi, On 2025-08-20 15:52:22 -0700, Boris Burkov wrote: > On Thu, Aug 21, 2025 at 12:06:42AM +0200, Klara Modin wrote: > > Hi, > > > > On 2025-08-18 17:36:53 -0700, Boris Burkov wrote: > > > Btrfs currently tracks its metadata pages in the page cache, using a > > > fake inode (fs_info->btree_inode) with offsets corresponding to where > > > the metadata is stored in the filesystem's full logical address space. > > > > > > A consequence of this is that when btrfs uses filemap_add_folio(), this > > > usage is charged to the cgroup of whichever task happens to be running > > > at the time. These folios don't belong to any particular user cgroup, so > > > I don't think it makes much sense for them to be charged in that way. > > > Some negative consequences as a result: > > > - A task can be holding some important btrfs locks, then need to lookup > > > some metadata and go into reclaim, extending the duration it holds > > > that lock for, and unfairly pushing its own reclaim pain onto other > > > cgroups. > > > - If that cgroup goes into reclaim, it might reclaim these folios a > > > different non-reclaiming cgroup might need soon. This is naturally > > > offset by LRU reclaim, but still. > > > > > > A very similar proposal to use the root cgroup was previously made by > > > Qu, where he eventually proposed the idea of setting it per > > > address_space. This makes good sense for the btrfs use case, as the > > > uncharged behavior should apply to all use of the address_space, not > > > select allocations. I.e., if someone adds another filemap_add_folio() > > > call using btrfs's btree_inode, we would almost certainly want the > > > uncharged behavior. > > > > > > Link: https://lore.kernel.org/linux-mm/b5fef5372ae454a7b6da4f2f75c427aeab6a07d6.1727498749.git.wqu@xxxxxxxx/ > > > Suggested-by: Qu Wenruo <wqu@xxxxxxxx> > > > Acked-by: Shakeel Butt <shakeel.butt@xxxxxxxxx> > > > Tested-by: syzbot@xxxxxxxxxxxxxxxxxxxxxxxxx > > > Signed-off-by: Boris Burkov <boris@xxxxxx> > > > > I bisected the following null-dereference to 3f31e0d9912d ("btrfs: set > > AS_UNCHARGED on the btree_inode") in mm-new but I believe it's a result of > > this patch: > > ... > > > > This means that not all folios will have a memcg attached also when > > memcg is enabled. In lru_gen_eviction() mem_cgroup_id() is called > > without a NULL check which then leads to the null-dereference. > > > > The following diff resolves the issue for me: > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index fae105a9cb46..c70e789201fc 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -809,7 +809,7 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg, > > > > static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) > > { > > - if (mem_cgroup_disabled()) > > + if (mem_cgroup_disabled() || !memcg) > > return 0; > > > > return memcg->id.id; > > > > However, it's mentioned in folio_memcg() that it can return NULL so this > > might be an existing bug which this patch just makes more obvious. > > > > There's also workingset_eviction() which instead gets the memcg from > > lruvec. Doing that in lru_gen_eviction() also resolves the issue for me: > > > > diff --git a/mm/workingset.c b/mm/workingset.c > > index 68a76a91111f..e805eadf0ec7 100644 > > --- a/mm/workingset.c > > +++ b/mm/workingset.c > > @@ -243,6 +243,7 @@ static void *lru_gen_eviction(struct folio *folio) > > int tier = lru_tier_from_refs(refs, workingset); > > struct mem_cgroup *memcg = folio_memcg(folio); > > struct pglist_data *pgdat = folio_pgdat(folio); > > + int memcgid; > > > > BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT); > > > > @@ -254,7 +255,9 @@ static void *lru_gen_eviction(struct folio *folio) > > hist = lru_hist_from_seq(min_seq); > > atomic_long_add(delta, &lrugen->evicted[hist][type][tier]); > > > > - return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset); > > + memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); > > + > > + return pack_shadow(memcgid, pgdat, token, workingset); > > } > > > > /* > > > > I don't really know what I'm doing here, though. > > Me neither, clearly :) > > Thanks so much for the report and fix! I fear there might be some other > paths that try to get memcg from lruvec or folio or whatever without > checking it. I feel like in this exact case, I would want to go to the > first sign of trouble and fix it at lruvec_memcg(). But then who knows > what else we've missed. > > May I ask what you were running to trigger this? My fstests run (clearly > not exercising enough interesting memory paths) did not hit it. > > This does make me wonder if the superior approach to the original patch > isn't just to go back to the very first thing Qu did and account these > to the root cgroup rather than do the whole uncharged thing. > > Boris > > > > > Regards, > > Klara Modin For me it's easiest to trigger when cloning a large repository, e.g. the kernel or gcc, with low-ish amount of RAM (maybe 1-4 GiB) so under memory pressure. Also: CONFIG_LRU_GEN=y CONFIG_LRU_GEN_ENABLED=y Shakeel: I think I'll wait a little before submitting a patch to see if there are any more comments. Regards, Klara Modin