On Mon, May 12, 2025 at 01:14 PM -0400, Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > On Mon, May 12, 2025 at 12:35:39PM -0400, Zi Yan wrote: >> On 12 May 2025, at 12:16, Lorenzo Stoakes wrote: >> >> > +cc Zi >> > >> > Hi Marc, >> > >> > I noticed this same bug as reported in [0], but only for a _very_ recent >> > patch series by Zi, which is only present in mm-new, which is the most >> > unstable mm branch right now :) >> > >> > So I wonder if related or a coincidence caused by something else? >> >> Unless Marc's branch has my "make MIGRATE_ISOLATE a standalone bit" patchset, >> it should be caused by something else. >> >> A bisect would be very helpful. >> >> > >> > This is triggered by the mm self-test (in tools/testing/selftests/mm, you >> > can just make -jXX there) transhuge-stress, invoked as: >> > >> > $ sudo ./transhuge-stress -d 20 >> > >> > The stack traces do look very different though so perhaps unrelated? >> >> The warning is triggered, in the both cases, a pageblock with MIGRATE_UNMOVABLE(0) >> is moved to MIGRATE_RECLAIMABLE(2). The pageblock is supposed to have >> MIGRATE_RECLAIMABLE(2) before the movement. > > The weird thing is that the warning is from expand(), when the broken > up chunks are put *back*. Marc, can you confirm that this is the only > warning in dmesg, and there aren't any before this one? Yep, I’ve just checked, it was the first warning and `panic_on_warn` is set to 1. I managed to reproduce a similar crash using 6.15.0-rc7 (this time THP seems to be involved): … root@qemus390x:~# [ 40.442403] ------------[ cut here ]------------ [ 40.442471] page type is 0, passed migratetype is 1 (nr=256) [ 40.442525] WARNING: CPU: 0 PID: 350 at mm/page_alloc.c:669 expand (mm/page_alloc.c:669 (discriminator 2) mm/page_alloc.c:1572 (discriminator 2)) [ 40.442558] Modules linked in: pkey_pckmo(E) pkey(E) diag288_wdt(E) watchdog(E) s390_trng(E) virtio_console(E) rng_core(E) vmw_vsock_virtio_transport(E) vmw_vsock_virtio_transport_common(E) vsock(E) ghash_s390(E) prng(E) aes_s390(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha512_s390(E) sha256_s390(E) sha1_s390(E) sha_common(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) sch_fq_codel(E) drm(E) i2c_core(E) drm_panel_orientation_quirks(E) nfnetlink(E) autofs4(E) [ 40.442651] Unloaded tainted modules: hmac_s390(E):1 [ 40.442677] CPU: 0 UID: 0 PID: 350 Comm: mempig_verify Tainted: G E 6.15.0-rc7-11557-ga01c92c55b53 #1 PREEMPT [ 40.442683] Tainted: [E]=UNSIGNED_MODULE [ 40.442687] Hardware name: IBM 3931 A01 701 (KVM/Linux) [ 40.442692] Krnl PSW : 0404d00180000000 000002ff929af40c expand (mm/page_alloc.c:669 (discriminator 10) mm/page_alloc.c:1572 (discriminator 10)) [ 40.442696] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 [ 40.442699] Krnl GPRS: 000002ff80000004 0000000000000005 0000000000000030 0000000000000000 [ 40.442701] 0000000000000005 0000027f80000005 0000000000000100 0000000000000008 [ 40.442703] 000002ff93f99290 000001f63a415900 0000027500000008 00000275829f4000 [ 40.442704] 0000000000000000 0000000000000008 000002ff929af408 0000027f928c36f8 [ 40.442722] Krnl Code: 000002ff929af3fc: c02000883f4b larl %r2,000002ff93ab7292 Code starting with the faulting instruction =========================================== [ 40.442722] 000002ff929af402: c0e5ffe7bd17 brasl %r14,000002ff926a6e30 [ 40.442722] #000002ff929af408: af000000 mc 0,0 [ 40.442722] >000002ff929af40c: a7f4ff49 brc 15,000002ff929af29e [ 40.442722] 000002ff929af410: b904002b lgr %r2,%r11 [ 40.442722] 000002ff929af414: c03000881980 larl %r3,000002ff93ab2714 [ 40.442722] 000002ff929af41a: c0e5fffdd883 brasl %r14,000002ff9296a520 [ 40.442722] 000002ff929af420: af000000 mc 0,0 [ 40.442736] Call Trace: [ 40.442738] expand (mm/page_alloc.c:669 (discriminator 10) mm/page_alloc.c:1572 (discriminator 10)) [ 40.442741] expand (mm/page_alloc.c:669 (discriminator 2) mm/page_alloc.c:1572 (discriminator 2)) [ 40.442743] rmqueue_bulk (mm/page_alloc.c:1587 mm/page_alloc.c:1758 mm/page_alloc.c:2311 mm/page_alloc.c:2364) [ 40.442745] __rmqueue_pcplist (mm/page_alloc.c:3086) [ 40.442748] rmqueue.isra.0 (mm/page_alloc.c:3124 mm/page_alloc.c:3155) [ 40.442751] get_page_from_freelist (mm/page_alloc.c:3683) [ 40.442754] __alloc_frozen_pages_noprof (mm/page_alloc.c:4967 (discriminator 1)) [ 40.442756] alloc_pages_mpol (mm/mempolicy.c:2290) [ 40.442764] folio_alloc_mpol_noprof (mm/mempolicy.c:2322) [ 40.442766] vma_alloc_folio_noprof (mm/mempolicy.c:2355 (discriminator 1)) [ 40.442769] vma_alloc_anon_folio_pmd (mm/huge_memory.c:1167 (discriminator 1)) [ 40.442773] __do_huge_pmd_anonymous_page (mm/huge_memory.c:1227 (discriminator 1)) [ 40.442775] __handle_mm_fault (mm/memory.c:5862 mm/memory.c:6111) [ 40.442781] handle_mm_fault (mm/memory.c:6321) [ 40.442783] do_exception (arch/s390/mm/fault.c:298) [ 40.442792] __do_pgm_check (arch/s390/kernel/traps.c:345) [ 40.442802] pgm_check_handler (arch/s390/kernel/entry.S:334) [ 40.442805] Last Breaking-Event-Address: [ 40.442806] __warn_printk (kernel/panic.c:801) [ 40.442818] Kernel panic - not syncing: kernel: panic_on_warn set ... [ 40.442822] CPU: 0 UID: 0 PID: 350 Comm: mempig_verify Tainted: G E 6.15.0-rc7-11557-ga01c92c55b53 #1 PREEMPT [ 40.442825] Tainted: [E]=UNSIGNED_MODULE [ 40.442826] Hardware name: IBM 3931 A01 701 (KVM/Linux) [ 40.442827] Call Trace: [ 40.442828] dump_stack_lvl (lib/dump_stack.c:122) [ 40.442831] panic (kernel/panic.c:372) [ 40.442833] check_panic_on_warn (kernel/panic.c:247) [ 40.442836] __warn (kernel/panic.c:751) [ 40.443057] report_bug (lib/bug.c:176 lib/bug.c:215) [ 40.443064] monitor_event_exception (arch/s390/kernel/traps.c:227 (discriminator 1)) [ 40.443067] __do_pgm_check (arch/s390/kernel/traps.c:345) [ 40.443071] pgm_check_handler (arch/s390/kernel/entry.S:334) [ 40.443074] expand (mm/page_alloc.c:669 (discriminator 10) mm/page_alloc.c:1572 (discriminator 10)) [ 40.443077] expand (mm/page_alloc.c:669 (discriminator 2) mm/page_alloc.c:1572 (discriminator 2)) [ 40.443080] rmqueue_bulk (mm/page_alloc.c:1587 mm/page_alloc.c:1758 mm/page_alloc.c:2311 mm/page_alloc.c:2364) [ 40.443087] __rmqueue_pcplist (mm/page_alloc.c:3086) [ 40.443090] rmqueue.isra.0 (mm/page_alloc.c:3124 mm/page_alloc.c:3155) [ 40.443093] get_page_from_freelist (mm/page_alloc.c:3683) [ 40.443097] __alloc_frozen_pages_noprof (mm/page_alloc.c:4967 (discriminator 1)) [ 40.443100] alloc_pages_mpol (mm/mempolicy.c:2290) [ 40.443104] folio_alloc_mpol_noprof (mm/mempolicy.c:2322) [ 40.443110] vma_alloc_folio_noprof (mm/mempolicy.c:2355 (discriminator 1)) [ 40.443114] vma_alloc_anon_folio_pmd (mm/huge_memory.c:1167 (discriminator 1)) [ 40.443117] __do_huge_pmd_anonymous_page (mm/huge_memory.c:1227 (discriminator 1)) [ 40.443120] __handle_mm_fault (mm/memory.c:5862 mm/memory.c:6111) [ 40.443123] handle_mm_fault (mm/memory.c:6321) [ 40.443126] do_exception (arch/s390/mm/fault.c:298) [ 40.443129] __do_pgm_check (arch/s390/kernel/traps.c:345) [ 40.443132] pgm_check_handler (arch/s390/kernel/entry.S:334) This time, the setup is even simpler: 1. Start a 2GB QEMU/KVM guest 2. Now run some memory stress test I run this test in a loop (with starting/shutting down the VM) and after many iterations, the bug occurs. […snip…]