Re: [PATCH] xfs: export buffer cache usage via stats

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 30 Apr 2025 12:38:09 +1000

On Mon, Apr 28, 2025 at 11:11:35AM -0700, Wengang Wang wrote:
> This patch introduces new fields to per-mount and global stats,
> and export them to user space.
> 
> @page_alloc	-- number of pages allocated from buddy to buffer cache
> @page_free	-- number of pages freed to buddy from buffer cache
> @kbb_alloc	-- number of BBs allocated from kmalloc slab to buffer cache
> @kbb_free	-- number of BBs freed to kmalloc slab from buffer cache
> @vbb_alloc	-- number of BBs allocated from vmalloc system to buffer cache
> @vbb_free	-- number of BBs freed to vmalloc system from buffer cache

This forms a permanent user API once created, so exposing internal
implementation details like this doesn't make me feel good. We've
changed how we allocate memory for buffers quite a bit recently
to do things like support large folios and minimise vmap usage,
then to use vmalloc instead of vmap, etc. e.g. we don't use pages
at all in the buffer cache anymore..

I'm actually looking further simplifying the implementation - I
think the custom folio/vmalloc stuff can be replaced entirely by a
single call to kvmalloc() now, which means some stuff will come from
slabs, some from the buddy and some from vmalloc. We won't know
where it comes from at all, and if this stats interface already
existed then such a change would render it completely useless.

> By looking at above stats fields, user space can easily know the buffer
> cache usage.

Not easily - the implementation only aggregates alloc/free values so
the user has to manually do the (alloc - free) calculation to
determine how much memory is currenlty in use.  And then we don't
really know what size buffers are actually using that memory...

i.e. buffers for everything other than xattrs are fixed sizes (single
sector, single block, directory block, inode cluster), so it makes
make more sense to me to dump a buffer size histogram for memory
usage. We can infer things like inode cluster memory usage from such
output, so not only would we get memory usage we also get some
insight into what is consuming the memory.

Hence I think it would be better to track a set of buffer size based
buckets so we get output something like:

buffer size	count		Total Bytes
-----------	-----		-----------
< 4kB		<n>		<aggregate count of b_length>
4kB
<= 8kB
<= 16kB
<= 32kB
<= 64kB

I also think that it might be better to dump this in a separate
sysfs file rather than add it to the existing stats file.

With this information on any given system, we can infer what
allocated from slab based on the buffer sizes and system PAGE_SIZE.

However, my main point is that for the general case of "how much
memory is in use by the buffer cache", we really don't want to tie
it to the internal allocation implementation. A histogram output like the
above is not tied to the internal implementation, whilst giving
additional insight into what size allocations are generating all the
memory usage...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx