Re: [PATCH] mm: fix the inaccurate memory statistics issue for users

Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> · Wed, 4 Jun 2025 20:46:02 +0800

On 2025/6/4 01:29, Shakeel Butt wrote:
On Tue, Jun 03, 2025 at 04:48:08PM +0200, Michal Hocko wrote:
On Tue 03-06-25 22:22:46, Baolin Wang wrote:
Let me try to clarify further.

The 'mm->rss_stat' is updated by using add_mm_counter(),
dec/inc_mm_counter(), which are all wrappers around
percpu_counter_add_batch(). In percpu_counter_add_batch(), there is percpu
batch caching to avoid 'fbc->lock' contention.

OK, this is exactly the line of argument I was looking for. If _all_
updates done in the kernel are using batching and therefore the lock is
only held every N (percpu_counter_batch) updates then a risk of locking
contention would be decreased. This is worth having a note in the
changelog.

OK.

This patch changes task_mem()
and task_statm() to get the accurate mm counters under the 'fbc->lock', but
this will not exacerbate kernel 'mm->rss_stat' lock contention due to the
the percpu batch caching of the mm counters.

You might argue that my test cases cannot demonstrate an actual lock
contention, but they have already shown that there is no significant
'fbc->lock' contention when the kernel updates 'mm->rss_stat'.

I was arguing that `top -d 1' doesn't really represent a potential
adverse usage. These proc files are generally readable so I would be
expecting something like busy loop read while process tries to update
counters to see the worst case scenario. If that is barely visible then
we can conclude a normal use wouldn't even notice.

OK.

Baolin, please run stress-ng command that stresses minor anon page
faults in multiple threads and then run multiple bash scripts which cat
/proc/pidof(stress-ng)/status. That should be how much the stress-ng
process is impacted by the parallel status readers versus without them.

Sure. Thanks Shakeel. I run the stress-ng with the 'stress-ng --fault 32 
--perf -t 1m' command, while simultaneously running the following 
scripts to read the /proc/pidof(stress-ng)/status for each thread.

From the following data, I did not observe any obvious impact of this 
patch on the stress-ng tests when repeatedly reading the 
/proc/pidof(stress-ng)/status.

w/o patch
stress-ng: info:  [6891]          3,993,235,331,584 CPU Cycles 
         59.767 B/sec
stress-ng: info:  [6891]          1,472,101,565,760 Instructions 
         22.033 B/sec (0.369 instr. per cycle)
stress-ng: info:  [6891]                 36,287,456 Page Faults Total 
          0.543 M/sec
stress-ng: info:  [6891]                 36,287,456 Page Faults Minor 
          0.543 M/sec

w/ patch
stress-ng: info:  [6872]          4,018,592,975,968 CPU Cycles 
         60.177 B/sec
stress-ng: info:  [6872]          1,484,856,150,976 Instructions 
         22.235 B/sec (0.369 instr. per cycle)
stress-ng: info:  [6872]                 36,547,456 Page Faults Total 
          0.547 M/sec
stress-ng: info:  [6872]                 36,547,456 Page Faults Minor 
          0.547 M/sec

=========================
#!/bin/bash

# Get the PIDs of stress-ng processes
PIDS=$(pgrep stress-ng)

# Loop through each PID and monitor /proc/[pid]/status
for PID in $PIDS; do
    while true; do
        cat /proc/$PID/status
	usleep 100000
    done &
done