Hi, On 2025-08-18 17:36:53 -0700, Boris Burkov wrote: > Btrfs currently tracks its metadata pages in the page cache, using a > fake inode (fs_info->btree_inode) with offsets corresponding to where > the metadata is stored in the filesystem's full logical address space. > > A consequence of this is that when btrfs uses filemap_add_folio(), this > usage is charged to the cgroup of whichever task happens to be running > at the time. These folios don't belong to any particular user cgroup, so > I don't think it makes much sense for them to be charged in that way. > Some negative consequences as a result: > - A task can be holding some important btrfs locks, then need to lookup > some metadata and go into reclaim, extending the duration it holds > that lock for, and unfairly pushing its own reclaim pain onto other > cgroups. > - If that cgroup goes into reclaim, it might reclaim these folios a > different non-reclaiming cgroup might need soon. This is naturally > offset by LRU reclaim, but still. > > A very similar proposal to use the root cgroup was previously made by > Qu, where he eventually proposed the idea of setting it per > address_space. This makes good sense for the btrfs use case, as the > uncharged behavior should apply to all use of the address_space, not > select allocations. I.e., if someone adds another filemap_add_folio() > call using btrfs's btree_inode, we would almost certainly want the > uncharged behavior. > > Link: https://lore.kernel.org/linux-mm/b5fef5372ae454a7b6da4f2f75c427aeab6a07d6.1727498749.git.wqu@xxxxxxxx/ > Suggested-by: Qu Wenruo <wqu@xxxxxxxx> > Acked-by: Shakeel Butt <shakeel.butt@xxxxxxxxx> > Tested-by: syzbot@xxxxxxxxxxxxxxxxxxxxxxxxx > Signed-off-by: Boris Burkov <boris@xxxxxx> I bisected the following null-dereference to 3f31e0d9912d ("btrfs: set AS_UNCHARGED on the btree_inode") in mm-new but I believe it's a result of this patch: Oops [#1] CPU: 4 UID: 0 PID: 87 Comm: kswapd0 Not tainted 6.17.0-rc2-next-20250820-00349-gd6ecef4f9566 #511 PREEMPTLAZY Hardware name: Banana Pi BPI-F3 (DT) epc : workingset_eviction (include/linux/memcontrol.h:815 mm/workingset.c:257 mm/workingset.c:394) ra : __remove_mapping (mm/vmscan.c:805) epc : ffffffff802e6de8 ra : ffffffff802b4114 sp : ffffffc6006c3670 gp : ffffffff8227dad8 tp : ffffffd701a2cb00 t0 : ffffffff80027d00 t1 : 0000000000000000 t2 : 0000000000000001 s0 : ffffffc6006c3680 s1 : ffffffc50415a540 a0 : 0000000000000001 a1 : ffffffd700b70048 a2 : 0000000000000000 a3 : 0000000000000000 a4 : 00000000000003f0 a5 : ffffffd700b70430 a6 : 0000000000000000 a7 : ffffffd77ffd1dc0 s2 : ffffffd705a483d8 s3 : ffffffd705a483e0 s4 : 0000000000000001 s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000001 s8 : ffffffd705a483d8 s9 : ffffffc6006c3760 s10: ffffffc50415a548 s11: ffffffff81e000e0 t3 : 0000000000000000 t4 : 0000000000000001 t5 : 0000000000000003 t6 : 0000000000000003 status: 0000000200000100 badaddr: 00000000000000d0 cause: 000000000000000d workingset_eviction (include/linux/memcontrol.h:815 mm/workingset.c:257 mm/workingset.c:394) __remove_mapping (mm/vmscan.c:805) shrink_folio_list (mm/vmscan.c:1545 (discriminator 2)) evict_folios (mm/vmscan.c:4738) try_to_shrink_lruvec (mm/vmscan.c:4901) shrink_one (mm/vmscan.c:4947) shrink_node (include/asm-generic/preempt.h:54 (discriminator 1) include/linux/rcupdate.h:93 (discriminator 1) include/linux/rcupdate.h:839 (discriminator 1) mm/vmscan.c:5010 (discriminator 1) mm/vmscan.c:5086 (discriminator 1) mm/vmscan.c:6073 (discriminator 1)) balance_pgdat (mm/vmscan.c:6942 mm/vmscan.c:7116) kswapd (mm/vmscan.c:7381) kthread (kernel/kthread.c:463) ret_from_fork_kernel (include/linux/entry-common.h:155 (discriminator 4) include/linux/entry-common.h:210 (discriminator 4) arch/riscv/kernel/process.c:216 (discriminator 4)) ret_from_fork_kernel_asm (arch/riscv/kernel/entry.S:328) Code: 0987 060a 6633 01c6 97ba b02f 01d7 0001 0013 0000 (5503) 0d08 All code ======== 0: 060a0987 .insn 4, 0x060a0987 4: 01c66633 or a2,a2,t3 8: 97ba .insn 2, 0x97ba a: 01d7b02f amoadd.d zero,t4,(a5) e: 0001 .insn 2, 0x0001 10: 00000013 addi zero,zero,0 14:* 0d085503 lhu a0,208(a6) <-- trapping instruction Code starting with the faulting instruction =========================================== 0: 0d085503 lhu a0,208(a6) ---[ end trace 0000000000000000 ]--- note: kswapd0[87] exited with irqs disabled note: kswapd0[87] exited with preempt_count 2 > --- > include/linux/pagemap.h | 1 + > mm/filemap.c | 12 ++++++++---- > 2 files changed, 9 insertions(+), 4 deletions(-) > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index c9ba69e02e3e..06dc3fae8124 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -211,6 +211,7 @@ enum mapping_flags { > folio contents */ > AS_INACCESSIBLE = 8, /* Do not attempt direct R/W access to the mapping */ > AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM = 9, > + AS_UNCHARGED = 10, /* Do not charge usage to a cgroup */ > /* Bits 16-25 are used for FOLIO_ORDER */ > AS_FOLIO_ORDER_BITS = 5, > AS_FOLIO_ORDER_MIN = 16, > diff --git a/mm/filemap.c b/mm/filemap.c > index e4a5a46db89b..5004a2cfa0cc 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -960,15 +960,19 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio, > { > void *shadow = NULL; > int ret; > + bool charge_mem_cgroup = !test_bit(AS_UNCHARGED, &mapping->flags); > > - ret = mem_cgroup_charge(folio, NULL, gfp); > - if (ret) > - return ret; > + if (charge_mem_cgroup) { > + ret = mem_cgroup_charge(folio, NULL, gfp); > + if (ret) > + return ret; > + } > > __folio_set_locked(folio); > ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow); > if (unlikely(ret)) { > - mem_cgroup_uncharge(folio); > + if (charge_mem_cgroup) > + mem_cgroup_uncharge(folio); > __folio_clear_locked(folio); > } else { > /* > -- > 2.50.1 > This means that not all folios will have a memcg attached also when memcg is enabled. In lru_gen_eviction() mem_cgroup_id() is called without a NULL check which then leads to the null-dereference. The following diff resolves the issue for me: diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index fae105a9cb46..c70e789201fc 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -809,7 +809,7 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg, static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) { - if (mem_cgroup_disabled()) + if (mem_cgroup_disabled() || !memcg) return 0; return memcg->id.id; However, it's mentioned in folio_memcg() that it can return NULL so this might be an existing bug which this patch just makes more obvious. There's also workingset_eviction() which instead gets the memcg from lruvec. Doing that in lru_gen_eviction() also resolves the issue for me: diff --git a/mm/workingset.c b/mm/workingset.c index 68a76a91111f..e805eadf0ec7 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -243,6 +243,7 @@ static void *lru_gen_eviction(struct folio *folio) int tier = lru_tier_from_refs(refs, workingset); struct mem_cgroup *memcg = folio_memcg(folio); struct pglist_data *pgdat = folio_pgdat(folio); + int memcgid; BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT); @@ -254,7 +255,9 @@ static void *lru_gen_eviction(struct folio *folio) hist = lru_hist_from_seq(min_seq); atomic_long_add(delta, &lrugen->evicted[hist][type][tier]); - return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset); + memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); + + return pack_shadow(memcgid, pgdat, token, workingset); } /* I don't really know what I'm doing here, though. Regards, Klara Modin