Re: [PATCH v3 1/4] mm/filemap: add AS_UNCHARGED

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Aug 21, 2025 at 12:06:42AM +0200, Klara Modin wrote:
> Hi,
> 
> On 2025-08-18 17:36:53 -0700, Boris Burkov wrote:
> > Btrfs currently tracks its metadata pages in the page cache, using a
> > fake inode (fs_info->btree_inode) with offsets corresponding to where
> > the metadata is stored in the filesystem's full logical address space.
> > 
> > A consequence of this is that when btrfs uses filemap_add_folio(), this
> > usage is charged to the cgroup of whichever task happens to be running
> > at the time. These folios don't belong to any particular user cgroup, so
> > I don't think it makes much sense for them to be charged in that way.
> > Some negative consequences as a result:
> > - A task can be holding some important btrfs locks, then need to lookup
> >   some metadata and go into reclaim, extending the duration it holds
> >   that lock for, and unfairly pushing its own reclaim pain onto other
> >   cgroups.
> > - If that cgroup goes into reclaim, it might reclaim these folios a
> >   different non-reclaiming cgroup might need soon. This is naturally
> >   offset by LRU reclaim, but still.
> > 
> > A very similar proposal to use the root cgroup was previously made by
> > Qu, where he eventually proposed the idea of setting it per
> > address_space. This makes good sense for the btrfs use case, as the
> > uncharged behavior should apply to all use of the address_space, not
> > select allocations. I.e., if someone adds another filemap_add_folio()
> > call using btrfs's btree_inode, we would almost certainly want the
> > uncharged behavior.
> > 
> > Link: https://lore.kernel.org/linux-mm/b5fef5372ae454a7b6da4f2f75c427aeab6a07d6.1727498749.git.wqu@xxxxxxxx/
> > Suggested-by: Qu Wenruo <wqu@xxxxxxxx>
> > Acked-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>
> > Tested-by: syzbot@xxxxxxxxxxxxxxxxxxxxxxxxx
> > Signed-off-by: Boris Burkov <boris@xxxxxx>
> 
> I bisected the following null-dereference to 3f31e0d9912d ("btrfs: set
> AS_UNCHARGED on the btree_inode") in mm-new but I believe it's a result of
> this patch:
> 
>  Oops [#1]
>  CPU: 4 UID: 0 PID: 87 Comm: kswapd0 Not tainted 6.17.0-rc2-next-20250820-00349-gd6ecef4f9566 #511 PREEMPTLAZY
>  Hardware name: Banana Pi BPI-F3 (DT)
>  epc : workingset_eviction (include/linux/memcontrol.h:815 mm/workingset.c:257 mm/workingset.c:394) 
>  ra : __remove_mapping (mm/vmscan.c:805) 
>  epc : ffffffff802e6de8 ra : ffffffff802b4114 sp : ffffffc6006c3670
>   gp : ffffffff8227dad8 tp : ffffffd701a2cb00 t0 : ffffffff80027d00
>   t1 : 0000000000000000 t2 : 0000000000000001 s0 : ffffffc6006c3680
>   s1 : ffffffc50415a540 a0 : 0000000000000001 a1 : ffffffd700b70048
>   a2 : 0000000000000000 a3 : 0000000000000000 a4 : 00000000000003f0
>   a5 : ffffffd700b70430 a6 : 0000000000000000 a7 : ffffffd77ffd1dc0
>   s2 : ffffffd705a483d8 s3 : ffffffd705a483e0 s4 : 0000000000000001
>   s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000001
>   s8 : ffffffd705a483d8 s9 : ffffffc6006c3760 s10: ffffffc50415a548
>   s11: ffffffff81e000e0 t3 : 0000000000000000 t4 : 0000000000000001
>   t5 : 0000000000000003 t6 : 0000000000000003
>  status: 0000000200000100 badaddr: 00000000000000d0 cause: 000000000000000d
>  workingset_eviction (include/linux/memcontrol.h:815 mm/workingset.c:257 mm/workingset.c:394) 
>  __remove_mapping (mm/vmscan.c:805) 
>  shrink_folio_list (mm/vmscan.c:1545 (discriminator 2)) 
>  evict_folios (mm/vmscan.c:4738) 
>  try_to_shrink_lruvec (mm/vmscan.c:4901) 
>  shrink_one (mm/vmscan.c:4947) 
>  shrink_node (include/asm-generic/preempt.h:54 (discriminator 1) include/linux/rcupdate.h:93 (discriminator 1) include/linux/rcupdate.h:839 (discriminator 1) mm/vmscan.c:5010 (discriminator 1) mm/vmscan.c:5086 (discriminator 1) mm/vmscan.c:6073 (discriminator 1)) 
>  balance_pgdat (mm/vmscan.c:6942 mm/vmscan.c:7116) 
>  kswapd (mm/vmscan.c:7381) 
>  kthread (kernel/kthread.c:463) 
>  ret_from_fork_kernel (include/linux/entry-common.h:155 (discriminator 4) include/linux/entry-common.h:210 (discriminator 4) arch/riscv/kernel/process.c:216 (discriminator 4)) 
>  ret_from_fork_kernel_asm (arch/riscv/kernel/entry.S:328) 
>  Code: 0987 060a 6633 01c6 97ba b02f 01d7 0001 0013 0000 (5503) 0d08
>  All code
>  ========
>     0:	060a0987          	.insn	4, 0x060a0987
>     4:	01c66633          	or	a2,a2,t3
>     8:	97ba                	.insn	2, 0x97ba
>     a:	01d7b02f          	amoadd.d	zero,t4,(a5)
>     e:	0001                	.insn	2, 0x0001
>    10:	00000013          	addi	zero,zero,0
>    14:*	0d085503          	lhu	a0,208(a6)		<-- trapping instruction
>  
>  Code starting with the faulting instruction
>  ===========================================
>     0:	0d085503          	lhu	a0,208(a6)
>  ---[ end trace 0000000000000000 ]---
>  note: kswapd0[87] exited with irqs disabled
>  note: kswapd0[87] exited with preempt_count 2
> 
> > ---
> >  include/linux/pagemap.h |  1 +
> >  mm/filemap.c            | 12 ++++++++----
> >  2 files changed, 9 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index c9ba69e02e3e..06dc3fae8124 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -211,6 +211,7 @@ enum mapping_flags {
> >  				   folio contents */
> >  	AS_INACCESSIBLE = 8,	/* Do not attempt direct R/W access to the mapping */
> >  	AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM = 9,
> > +	AS_UNCHARGED = 10,	/* Do not charge usage to a cgroup */
> >  	/* Bits 16-25 are used for FOLIO_ORDER */
> >  	AS_FOLIO_ORDER_BITS = 5,
> >  	AS_FOLIO_ORDER_MIN = 16,
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index e4a5a46db89b..5004a2cfa0cc 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -960,15 +960,19 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
> >  {
> >  	void *shadow = NULL;
> >  	int ret;
> > +	bool charge_mem_cgroup = !test_bit(AS_UNCHARGED, &mapping->flags);
> >  
> > -	ret = mem_cgroup_charge(folio, NULL, gfp);
> > -	if (ret)
> > -		return ret;
> > +	if (charge_mem_cgroup) {
> > +		ret = mem_cgroup_charge(folio, NULL, gfp);
> > +		if (ret)
> > +			return ret;
> > +	}
> >  
> >  	__folio_set_locked(folio);
> >  	ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow);
> >  	if (unlikely(ret)) {
> > -		mem_cgroup_uncharge(folio);
> > +		if (charge_mem_cgroup)
> > +			mem_cgroup_uncharge(folio);
> >  		__folio_clear_locked(folio);
> >  	} else {
> >  		/*
> > -- 
> > 2.50.1
> > 
> 
> This means that not all folios will have a memcg attached also when
> memcg is enabled. In lru_gen_eviction() mem_cgroup_id() is called
> without a NULL check which then leads to the null-dereference.
> 
> The following diff resolves the issue for me:
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index fae105a9cb46..c70e789201fc 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -809,7 +809,7 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>  
>  static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
>  {
> -	if (mem_cgroup_disabled())
> +	if (mem_cgroup_disabled() || !memcg)
>  		return 0;
>  
>  	return memcg->id.id;
> 
> However, it's mentioned in folio_memcg() that it can return NULL so this
> might be an existing bug which this patch just makes more obvious.
> 
> There's also workingset_eviction() which instead gets the memcg from
> lruvec. Doing that in lru_gen_eviction() also resolves the issue for me:
> 
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 68a76a91111f..e805eadf0ec7 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -243,6 +243,7 @@ static void *lru_gen_eviction(struct folio *folio)
>  	int tier = lru_tier_from_refs(refs, workingset);
>  	struct mem_cgroup *memcg = folio_memcg(folio);
>  	struct pglist_data *pgdat = folio_pgdat(folio);
> +	int memcgid;
>  
>  	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
>  
> @@ -254,7 +255,9 @@ static void *lru_gen_eviction(struct folio *folio)
>  	hist = lru_hist_from_seq(min_seq);
>  	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
>  
> -	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
> +	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
> +
> +	return pack_shadow(memcgid, pgdat, token, workingset);
>  }
>  
>  /*
> 
> I don't really know what I'm doing here, though.

Me neither, clearly :)

Thanks so much for the report and fix! I fear there might be some other
paths that try to get memcg from lruvec or folio or whatever without
checking it. I feel like in this exact case, I would want to go to the
first sign of trouble and fix it at lruvec_memcg(). But then who knows
what else we've missed.

May I ask what you were running to trigger this? My fstests run (clearly
not exercising enough interesting memory paths) did not hit it.

This does make me wonder if the superior approach to the original patch
isn't just to go back to the very first thing Qu did and account these
to the root cgroup rather than do the whole uncharged thing.

Boris

> 
> Regards,
> Klara Modin




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux