Re: [PATCH v3 13/20] mm: Copy-on-Write (COW) reuse support for PTE-mapped THP

David Hildenbrand <david@xxxxxxxxxx> · Sat, 19 Apr 2025 18:32:51 +0200

On 19.04.25 18:25, David Hildenbrand wrote:
On 19.04.25 18:02, Kairui Song wrote:
On Tue, Mar 4, 2025 at 12:46 AM David Hildenbrand <david@xxxxxxxxxx> wrote:

Currently, we never end up reusing PTE-mapped THPs after fork. This
wasn't really a problem with PMD-sized THPs, because they would have to
be PTE-mapped first, but it's getting a problem with smaller THP
sizes that are effectively always PTE-mapped.

With our new "mapped exclusively" vs "maybe mapped shared" logic for
large folios, implementing CoW reuse for PTE-mapped THPs is straight
forward: if exclusively mapped, make sure that all references are
from these (our) mappings. Add some helpful comments to explain the
details.

CONFIG_TRANSPARENT_HUGEPAGE selects CONFIG_MM_ID. If we spot an anon
large folio without CONFIG_TRANSPARENT_HUGEPAGE in that code, something
is seriously messed up.

There are plenty of things we can optimize in the future: For example, we
could remember that the folio is fully exclusive so we could speedup
the next fault further. Also, we could try "faulting around", turning
surrounding PTEs that map the same folio writable. But especially the
latter might increase COW latency, so it would need further
investigation.

Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
---
   mm/memory.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++------
   1 file changed, 75 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 73b783c7d7d51..bb245a8fe04bc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3729,19 +3729,86 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf, struct folio *folio)
          return ret;
   }

-static bool wp_can_reuse_anon_folio(struct folio *folio,
-                                   struct vm_area_struct *vma)
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static bool __wp_can_reuse_large_anon_folio(struct folio *folio,
+               struct vm_area_struct *vma)
   {
+       bool exclusive = false;
+
+       /* Let's just free up a large folio if only a single page is mapped. */
+       if (folio_large_mapcount(folio) <= 1)
+               return false;
+
          /*
-        * We could currently only reuse a subpage of a large folio if no
-        * other subpages of the large folios are still mapped. However,
-        * let's just consistently not reuse subpages even if we could
-        * reuse in that scenario, and give back a large folio a bit
-        * sooner.
+        * The assumption for anonymous folios is that each page can only get
+        * mapped once into each MM. The only exception are KSM folios, which
+        * are always small.
+        *
+        * Each taken mapcount must be paired with exactly one taken reference,
+        * whereby the refcount must be incremented before the mapcount when
+        * mapping a page, and the refcount must be decremented after the
+        * mapcount when unmapping a page.
+        *
+        * If all folio references are from mappings, and all mappings are in
+        * the page tables of this MM, then this folio is exclusive to this MM.
           */
-       if (folio_test_large(folio))
+       if (folio_test_large_maybe_mapped_shared(folio))
+               return false;
+
+       VM_WARN_ON_ONCE(folio_test_ksm(folio));
+       VM_WARN_ON_ONCE(folio_mapcount(folio) > folio_nr_pages(folio));
+       VM_WARN_ON_ONCE(folio_entire_mapcount(folio));
+
+       if (unlikely(folio_test_swapcache(folio))) {
+               /*
+                * Note: freeing up the swapcache will fail if some PTEs are
+                * still swap entries.
+                */
+               if (!folio_trylock(folio))
+                       return false;
+               folio_free_swap(folio);
+               folio_unlock(folio);
+       }
+
+       if (folio_large_mapcount(folio) != folio_ref_count(folio))
                  return false;

+       /* Stabilize the mapcount vs. refcount and recheck. */
+       folio_lock_large_mapcount(folio);
+       VM_WARN_ON_ONCE(folio_large_mapcount(folio) < folio_ref_count(folio));

Hi David, I'm seeing this WARN_ON being triggered on my test machine:

Hi!

So I assume the following will not sort out the issue for you, correct?

https://lore.kernel.org/all/20250415095007.569836-1-david@xxxxxxxxxx/T/#u


I'm currently working on my swap table series and testing heavily with
swap related workloads. I thought my patch may break the kernel, but
after more investigation and reverting to current mm-unstable, it
still occurs (with a much lower chance though, I think my series
changed the timing so it's more frequent in my case).

The test is simple, I just enable all mTHP sizes and repeatedly build
linux kernel in a 1G memcg using tmpfs.

The WARN is reproducible with current mm-unstable
(dc683247117ee018e5da6b04f1c499acdc2a1418):

[ 5268.100379] ------------[ cut here ]------------
[ 5268.105925] WARNING: CPU: 2 PID: 700274 at mm/memory.c:3792
do_wp_page+0xfc5/0x1080
[ 5268.112437] Modules linked in: zram virtiofs
[ 5268.115507] CPU: 2 UID: 0 PID: 700274 Comm: cc1 Kdump: loaded Not
tainted 6.15.0-rc2.ptch-gdc683247117e #1434 PREEMPT(voluntary)
[ 5268.120562] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
[ 5268.123025] RIP: 0010:do_wp_page+0xfc5/0x1080
[ 5268.124807] Code: 0d 80 77 32 02 0f 85 3e f1 ff ff 0f 1f 44 00 00
e9 34 f1 ff ff 48 0f ba 75 00 1f 65 ff 0d 63 77 32 02 0f 85 21 f1 ff
ff eb e1 <0f> 0b e9 10 fd ff ff 65 ff 00 f0 48 0f b
a 6d 00 1f 0f 83 ec fc ff
[ 5268.132034] RSP: 0000:ffffc900234efd48 EFLAGS: 00010297
[ 5268.134002] RAX: 0000000000000080 RBX: 0000000000000000 RCX: 000fffffffe00000
[ 5268.136609] RDX: 0000000000000081 RSI: 00007f009cbad000 RDI: ffffea0012da0000
[ 5268.139371] RBP: ffffea0012da0068 R08: 80000004b682d025 R09: 00007f009c7c0000
[ 5268.142183] R10: ffff88839c48b8c0 R11: 0000000000000000 R12: ffff88839c48b8c0
[ 5268.144738] R13: ffffea0012da0000 R14: 00007f009cbadf10 R15: ffffc900234efdd8
[ 5268.147540] FS:  00007f009d1fdac0(0000) GS:ffff88a07ae14000(0000)
knlGS:0000000000000000
[ 5268.150715] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5268.153270] CR2: 00007f009cbadf10 CR3: 000000016c7c0001 CR4: 0000000000770eb0
[ 5268.155674] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5268.158100] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5268.160613] PKRU: 55555554
[ 5268.161662] Call Trace:
[ 5268.162609]  <TASK>
[ 5268.163438]  ? ___pte_offset_map+0x1b/0x110
[ 5268.165309]  __handle_mm_fault+0xa51/0xf00
[ 5268.166848]  ? update_load_avg+0x80/0x760
[ 5268.168376]  handle_mm_fault+0x13d/0x360
[ 5268.169930]  do_user_addr_fault+0x2f2/0x7f0
[ 5268.171630]  exc_page_fault+0x6a/0x140
[ 5268.173278]  asm_exc_page_fault+0x26/0x30
[ 5268.174866] RIP: 0033:0x120e8e4
[ 5268.176272] Code: 84 a9 00 00 00 48 39 c3 0f 85 ae 00 00 00 48 8b
43 20 48 89 45 38 48 85 c0 0f 85 b7 00 00 00 48 8b 43 18 48 8b 15 6c
08 42 01 <0f> 11 43 10 48 89 1d 61 08 42 01 48 89 53 18 0f 11 03 0f 11
43 20
[ 5268.184121] RSP: 002b:00007fff8a855160 EFLAGS: 00010246
[ 5268.186343] RAX: 00007f009cbadbd0 RBX: 00007f009cbadf00 RCX: 0000000000000000
[ 5268.189209] RDX: 00007f009cbba030 RSI: 00000000000006f4 RDI: 0000000000000000
[ 5268.192145] RBP: 00007f009cbb6460 R08: 00007f009d10f000 R09: 000000000000016c
[ 5268.194687] R10: 0000000000000000 R11: 0000000000000010 R12: 00007f009cf97660
[ 5268.197172] R13: 00007f009756ede0 R14: 00007f0097582348 R15: 0000000000000002
[ 5268.199419]  </TASK>
[ 5268.200227] ---[ end trace 0000000000000000 ]---

I also once changed the WARN_ON to WARN_ON_FOLIO and I got more info here:

[ 3994.907255] page: refcount:9 mapcount:1 mapping:0000000000000000
index:0x7f90b3e98 pfn:0x615028
[ 3994.914449] head: order:3 mapcount:8 entire_mapcount:0
nr_pages_mapped:8 pincount:0
[ 3994.924534] memcg:ffff888106746000
[ 3994.927868] anon flags:
0x17ffffc002084c(referenced|uptodate|owner_2|head|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
[ 3994.933479] raw: 0017ffffc002084c ffff88816edd9128 ffffea000beac108
ffff8882e8ba6bc9
[ 3994.936251] raw: 00000007f90b3e98 0000000000000000 0000000900000000
ffff888106746000
[ 3994.939466] head: 0017ffffc002084c ffff88816edd9128
ffffea000beac108 ffff8882e8ba6bc9
[ 3994.943355] head: 00000007f90b3e98 0000000000000000
0000000900000000 ffff888106746000
[ 3994.946988] head: 0017ffffc0000203 ffffea0018540a01
0000000800000007 00000000ffffffff
[ 3994.950328] head: ffffffff00000007 00000000800000a3
0000000000000000 0000000000000008
[ 3994.953684] page dumped because:
VM_WARN_ON_FOLIO(folio_large_mapcount(folio) < folio_ref_count(folio))
[ 3994.957534] ------------[ cut here ]------------
[ 3994.959917] WARNING: CPU: 16 PID: 555282 at mm/memory.c:3794
do_wp_page+0x10c0/0x1110
[ 3994.963069] Modules linked in: zram virtiofs
[ 3994.964726] CPU: 16 UID: 0 PID: 555282 Comm: sh Kdump: loaded Not
tainted 6.15.0-rc1.ptch-ge39aef85f4c0-dirty #1431 PREEMPT(voluntary)
[ 3994.969985] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
[ 3994.972905] RIP: 0010:do_wp_page+0x10c0/0x1110
[ 3994.974477] Code: fe ff 0f 0b bd f5 ff ff ff e9 16 fb ff ff 41 83
a9 bc 12 00 00 01 e9 2f fb ff ff 48 c7 c6 90 c2 49 82 4c 89 ef e8 40
fd fe ff <0f> 0b e9 6a fc ff ff 65 ff 00 f0 48 0f b
a 6d 00 1f 0f 83 46 fc ff
[ 3994.981033] RSP: 0000:ffffc9002b3c7d40 EFLAGS: 00010246
[ 3994.982636] RAX: 000000000000005b RBX: 0000000000000000 RCX: 0000000000000000
[ 3994.984778] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff889ffea16a80
[ 3994.986865] RBP: ffffea0018540a68 R08: 0000000000000000 R09: c0000000ffff7fff
[ 3994.989316] R10: 0000000000000001 R11: ffffc9002b3c7b80 R12: ffff88810cfd7d40
[ 3994.991654] R13: ffffea0018540a00 R14: 00007f90b3e9d620 R15: ffffc9002b3c7dd8
[ 3994.994076] FS:  00007f90b3caa740(0000) GS:ffff88a07b194000(0000)
knlGS:0000000000000000
[ 3994.996939] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3994.998902] CR2: 00007f90b3e9d620 CR3: 0000000104088004 CR4: 0000000000770eb0
[ 3995.001314] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3995.003746] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3995.006173] PKRU: 55555554
[ 3995.007117] Call Trace:
[ 3995.007988]  <TASK>
[ 3995.008755]  ? __pfx_default_wake_function+0x10/0x10
[ 3995.010490]  ? ___pte_offset_map+0x1b/0x110
[ 3995.011929]  __handle_mm_fault+0xa51/0xf00
[ 3995.013346]  handle_mm_fault+0x13d/0x360
[ 3995.014796]  do_user_addr_fault+0x2f2/0x7f0
[ 3995.016331]  ? sigprocmask+0x77/0xa0
[ 3995.017656]  exc_page_fault+0x6a/0x140
[ 3995.018978]  asm_exc_page_fault+0x26/0x30
[ 3995.020309] RIP: 0033:0x7f90b3d881a7
[ 3995.021461] Code: e8 4e b1 f8 ff 66 66 2e 0f 1f 84 00 00 00 00 00
0f 1f 00 f3 0f 1e fa 55 31 c0 ba 01 00 00 00 48 89 e5 53 48 89 fb 48
83 ec 08 <f0> 0f b1 15 71 54 11 00 0f 85 3b 01 00 0
0 48 8b 35 84 54 11 00 48
[ 3995.028091] RSP: 002b:00007ffc33632c90 EFLAGS: 00010206
[ 3995.029992] RAX: 0000000000000000 RBX: 0000560cfbfc0a40 RCX: 0000000000000000
[ 3995.032456] RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000560cfbfc0a40
[ 3995.034794] RBP: 00007ffc33632ca0 R08: 00007ffc33632d50 R09: 00007ffc33632cff
[ 3995.037534] R10: 00007ffc33632c70 R11: 00007ffc33632d00 R12: 0000560cfbfc0a40
[ 3995.041063] R13: 00007f90b3e97fd0 R14: 00007f90b3e97fa8 R15: 0000000000000000
[ 3995.044390]  </TASK>
[ 3995.045510] ---[ end trace 0000000000000000 ]---

My guess is folio_ref_count is not a reliable thing to check here,
anything can increase the folio's ref account even without locking it,
for example, a swap cache lookup or maybe anything iterating the LRU.

It is reliable, we are holding the mapcount lock, so for each mapcount
we must have a corresponding refcount. If that is not the case, we have
an issue elsewhere.

Other reference may only increase the refcount, but not violate the
mapcount vs. refcount condition.

Can you reproduce also with swap disabled?

Oh, re-reading the condition 3 times, I realize that the sanity check is wrong ...

diff --git a/mm/memory.c b/mm/memory.c
index 037b6ce211f1f..a17eeef3f1f89 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3789,7 +3789,7 @@ static bool __wp_can_reuse_large_anon_folio(struct folio *folio,
 
        /* Stabilize the mapcount vs. refcount and recheck. */
        folio_lock_large_mapcount(folio);
-       VM_WARN_ON_ONCE(folio_large_mapcount(folio) < folio_ref_count(folio));
+       VM_WARN_ON_ONCE(folio_large_mapcount(folio) > folio_ref_count(folio));
 
        if (folio_test_large_maybe_mapped_shared(folio))
                goto unlock;

Our refcount must be at least the mapcount, that's what we want to assert.

Can you test and send a fix patch if that makes it fly for you?

--
Cheers,

David / dhildenb