Re: [PATCH] generic/764: fsstress + migrate_pages() test

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Mar 28, 2025 at 07:22:45AM +1100, Dave Chinner wrote:
> On Thu, Mar 27, 2025 at 12:53:30PM +0100, Jan Kara wrote:
> > On Wed 26-03-25 11:50:55, Luis Chamberlain wrote:
> > > 0-day reported a page migration kernel warning with folios which happen
> > > to be buffer-heads [0]. I'm having a terribly hard time reproducing the bug
> > > and so I wrote this test to force page migration filesystems.
> > > 
> > > It turns out we have have no tests for page migration on fstests or ltp,
> > > and its no surprise, other than compaction covered by generic/750 there
> > > is no easy way to trigger page migration right now unless you have a
> > > numa system.
> > > 
> > > We should evaluate if we want to help stress test page migration
> > > artificially by later implementing a way to do page migration on simple
> > > systems to an artificial target.
> > > 
> > > So far, this doesn't trigger any kernel splats, not even warnings for me.
> > > 
> > > Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> > > Link: https://lore.kernel.org/r/202503101536.27099c77-lkp@xxxxxxxxx # [0]
> > > Signed-off-by: Luis Chamberlain <mcgrof@xxxxxxxxxx>
> > 
> > So when I was testing page migration in the past MM guys advised me to use
> > THP compaction as a way to trigger page migration. You can manually
> > trigger compaction by:
> > 
> > echo 1 >/proc/sys/vm/compact_memory
> 
> Right, that's what generic/750 does. IT runs fsstress and every 5
> seconds runs memory compaction in the background.
> 
> > So you first mess with the page cache a bit to fragment memory and then
> > call the above to try to compact it back...
> 
> Which is effectively what g/750 tries to exercise.

Indeed. And I've tried g/750 for over 24 hours trying to reproduce
the issue reported by Oliver and I was not able to, so this augments the
coverage.

The original report by Oliver was about ltp syscalls-04/close_range01
triggering the spin lock on the buffer_migrate_folio_norefs() path which
triggers a lock followed by a sleep context. But the report indicates
the test ran with btrfs and btrfs does not use buffer_migrate_folio_norefs().
Although clearly the splat and diagnosis by Matthew that the spinlock seems
to need fixing, reproducing this issue would be good. But this has been hard.

In fact there are only a few users of buffer_migrate_folio_norefs() left
and ext4 is one of them, as well as the block layer.

I wrote this test to see if this might help with another path, the other
aspect of migration on numa nodes with ext4. But sadly I can't reproduce
the issue yet.

I'm next trying fio against a block device directory and then looping
with migratepages on the pid, essentially bouncing the memory from fio
from one node to another in a loop. And.. nothing yet.. even if I then
try to loop enabling compaction.

Syszbot recently provided another reproducer in C  [0] but that hasn't let me
reproduce the issue yet either.

> When it's run by check-parallel, compaction ends up doing a lot
> more work over a much wider range of tests...

Yeah I would hope the issue is reproducible with check-parallel, I
haven't been able to run it yet but as soon as I do I am going to
be supper happy due the huge benefits this will bring to testing.

[0] https://lkml.kernel.org/r/67e57c41.050a0220.2f068f.0033.GAE@xxxxxxxxxx

  Luis




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux