Re: [LSF/MM/BPF Topic] synthetic mm testing like page migration

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 27 Mar 2025 07:46:03 +1100

On Wed, Mar 26, 2025 at 11:59:48AM -0700, Luis Chamberlain wrote:
> I'd like to propose this as a a BoF for MM.
> 
> We can find issues if we test them, but some bugs are hard to reproduce,
> specially some mm bugs. How far are we willing to add knobs to help with
> synthetic tests which which may not apply to numa for instance? An
> example is the recent patch I just posted to force testing page
> migration [0]. We can only run that test if we have a numa system, and a
> lot of testing today runs on guests without numa. Would we be willing
> to add a fake numa node to help with synthetic tests like page
> migration?

Boot your test VMs with fake-numa=4, and now you have a 4 node
system being tested even though it's not a real, physical numa
machine.  I've been doing this for the best part of 15 years now
with a couple of my larger test VMs explicitly to test NUMA
interactions.

I also have a large 64p VM with explicit qemu NUMA configuration
that mirrors the underlying hardware NUMA layout. This allows NUMA
aware perf testing from inside that VM that responds the same as a
real physical machine would.

$ $ lscpu
....
CPU(s):                   64
  On-line CPU(s) list:    0-63
    Thread(s) per core:   1
    Core(s) per socket:   16
    Socket(s):            4
.....
NUMA:                     
  NUMA node(s):           4
  NUMA node0 CPU(s):      0-15
  NUMA node1 CPU(s):      16-31
  NUMA node2 CPU(s):      32-47
  NUMA node3 CPU(s):      48-63

This is also the VM I'm doing most of my performance testing and
check-parallel development on, so I see the NUMA scalability issues
that occur when trying to make use of the underlying hardware NUMA
capability...

> Then what else could we add to help stress test page migration and
> compaction further? We already have generic/750 and that has found some
> snazzy issues so far. But what else can we do to help random guests
> all over running fstests start covering complex mm tests better?

Use check-parallel on buffered loop devices - it'll generate a heap
of page cache pressure from all the IO, and run a heap more
tests at the same time as the compaction is running from g/740. This
often overlaps with g/650 which does background CPU hotplug, and it
definitely overlaps with other tests running drop_caches, mount,
unmount, etc, too.

One of the eventual goals of check-parallel is to have all these
things environmental variables like memory load, compaction, cpu
hotplug, etc to be changing in the background whilst the tests
running so that we can exercise all the filesystem functionality
under changing MM and environmental conditions without having to
code that into individual tests....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx