On Thu, May 15, 2025 at 12:45 AM Dev Jain <dev.jain@xxxxxxx> wrote: > > > > On 15/05/25 8:51 am, Nico Pache wrote: > > Ugh... So sorry, I forgot to turn off the chain-reply-to. > > > > resending V7 *facepalm* > > In the future you can just send the same version again with [RESEND] > prefixed in the subject, that prevents confusion. Thanks I'll do that next time. > > > > > On Wed, May 14, 2025 at 9:03 PM Nico Pache <npache@xxxxxxxxxx> wrote: > >> > >> The following series provides khugepaged and madvise collapse with the > >> capability to collapse anonymous memory regions to mTHPs. > >> > >> To achieve this we generalize the khugepaged functions to no longer depend > >> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages > >> (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. This info is > >> tracked using a bitmap. After the PMD scan is done, we do binary recursion > >> on the bitmap to find the optimal mTHP sizes for the PMD range. The > >> restriction on max_ptes_none is removed during the scan, to make sure we > >> account for the whole PMD range. When no mTHP size is enabled, the legacy > >> behavior of khugepaged is maintained. max_ptes_none will be scaled by the > >> attempted collapse order to determine how full a THP must be to be > >> eligible. If a mTHP collapse is attempted, but contains swapped out, or > >> shared pages, we dont perform the collapse. > >> > >> With the default max_ptes_none=511, the code should keep its most of its > >> original behavior. To exercise mTHP collapse we need to set > >> max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will > >> experience collapse "creep" and constantly promote mTHPs to the next > >> available size. This is due the fact that it will introduce at least 2x > >> the number of pages, and on a future scan will satisfy that condition once > >> again. > >> > >> Patch 1: Refactor/rename hpage_collapse > >> Patch 2: Some refactoring to combine madvise_collapse and khugepaged > >> Patch 3-5: Generalize khugepaged functions for arbitrary orders > >> Patch 6-9: The mTHP patches > >> Patch 10-11: Tracing/stats > >> Patch 12: Documentation > >> > >> --------- > >> Testing > >> --------- > >> - Built for x86_64, aarch64, ppc64le, and s390x > >> - selftests mm > >> - I created a test script that I used to push khugepaged to its limits > >> while monitoring a number of stats and tracepoints. The code is > >> available here[1] (Run in legacy mode for these changes and set mthp > >> sizes to inherit) > >> The summary from my testings was that there was no significant > >> regression noticed through this test. In some cases my changes had > >> better collapse latencies, and was able to scan more pages in the same > >> amount of time/work, but for the most part the results were consistent. > >> - redis testing. I tested these changes along with my defer changes > >> (see followup post for more details). > >> - some basic testing on 64k page size. > >> - lots of general use. > >> > >> V6 Changes: > >> - Dont release the anon_vma_lock early (like in the PMD case), as not all > >> pages are isolated. > >> - Define the PTE as null to avoid a uninitilized condition > >> - minor nits and newline cleanup > >> - make sure to unmap and unlock the pte for the swapin case > >> - change the revalidation to always check the PMD order (as this will make > >> sure that no other VMA spans it) > >> > >> V5 Changes [2]: > >> - switched the order of patches 1 and 2 > >> - fixed some edge cases on the unified madvise_collapse and khugepaged > >> - Explained the "creep" some more in the docs > >> - fix EXCEED_SHARED vs EXCEED_SWAP accounting issue > >> - fix potential highmem issue caused by a early unmap of the PTE > >> > >> V4 Changes: > >> - Rebased onto mm-unstable > >> - small changes to Documentation > >> > >> V3 Changes: > >> - corrected legacy behavior for khugepaged and madvise_collapse > >> - added proper mTHP stat tracking > >> - Minor changes to prevent a nested lock on non-split-lock arches > >> - Took Devs version of alloc_charge_folio as it has the proper stats > >> - Skip cases were trying to collapse to a lower order would still fail > >> - Fixed cases were the bitmap was not being updated properly > >> - Moved Documentation update to this series instead of the defer set > >> - Minor bugs discovered during testing and review > >> - Minor "nit" cleanup > >> > >> V2 Changes: > >> - Minor bug fixes discovered during review and testing > >> - removed dynamic allocations for bitmaps, and made them stack based > >> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. > >> - Updated trace events to include collapsing order info. > >> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. > >> - No longer require a chunk to be fully utilized before setting the bit. > >> Use the same max_ptes_none scaling principle to achieve this. > >> - Skip mTHP collapse that requires swapin or shared handling. This helps > >> prevent some of the "creep" that was discovered in v1. > >> > >> [1] - https://gitlab.com/npache/khugepaged_mthp_test > >> [2] - https://lore.kernel.org/all/20250428181218.85925-1-npache@xxxxxxxxxx/ > >> > >> Dev Jain (1): > >> khugepaged: generalize alloc_charge_folio() > >> > >> Nico Pache (11): > >> khugepaged: rename hpage_collapse_* to khugepaged_* > >> introduce khugepaged_collapse_single_pmd to unify khugepaged and > >> madvise_collapse > >> khugepaged: generalize hugepage_vma_revalidate for mTHP support > >> khugepaged: generalize __collapse_huge_page_* for mTHP support > >> khugepaged: introduce khugepaged_scan_bitmap for mTHP support > >> khugepaged: add mTHP support > >> khugepaged: skip collapsing mTHP to smaller orders > >> khugepaged: avoid unnecessary mTHP collapse attempts > >> khugepaged: improve tracepoints for mTHP orders > >> khugepaged: add per-order mTHP khugepaged stats > >> Documentation: mm: update the admin guide for mTHP collapse > >> > >> Documentation/admin-guide/mm/transhuge.rst | 14 +- > >> include/linux/huge_mm.h | 5 + > >> include/linux/khugepaged.h | 4 + > >> include/trace/events/huge_memory.h | 34 +- > >> mm/huge_memory.c | 11 + > >> mm/khugepaged.c | 472 ++++++++++++++------- > >> 6 files changed, 382 insertions(+), 158 deletions(-) > >> > >> -- > >> 2.49.0 > >> > > >