Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled

Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> · Fri, 30 May 2025 17:52:03 +0800

On 2025/5/30 17:16, David Hildenbrand wrote:
On 30.05.25 11:10, David Hildenbrand wrote:
On 30.05.25 10:59, Ryan Roberts wrote:
On 30/05/2025 09:44, David Hildenbrand wrote:
On 30.05.25 10:04, Ryan Roberts wrote:
On 29/05/2025 09:23, Baolin Wang wrote:
As we discussed in the previous thread [1], the MADV_COLLAPSE will 
ignore
the system-wide anon/shmem THP sysfs settings, which means that 
even though
we have disabled the anon/shmem THP configuration, MADV_COLLAPSE 
will still
attempt to collapse into a anon/shmem THP. This violates the rule 
we have
agreed upon: never means never. This patch set will address this 
issue.

This is a drive-by comment from me without having the previous 
context, but...

Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a 
deliberate
user-initiated, synchonous request to use huge pages for a range of 
memory.
There is nothing *transparent* about it, it just happens to be 
implemented using
the same logic that THP uses.

I always thought this was a deliberate design decision.

If the admin said "never", then why should a user be able to 
overwrite that?

Well my interpretation would be that the admin is saying never 
*transparently*
give anyone any hugepages; on balance it does more harm than good for my
workloads. The toggle is called transparent_hugepage/enabled, after all.

I'd say it's "enabling transparent huge pages" not "transparently
enabling huge pages". After all, these things are ... transparent huge
pages.

But yeah, it's confusing.

Whereas MADV_COLLAPSE is deliberately applied to a specific region at an
opportune moment in time, presumably because the user knows that the 
region
*will* benefit and because that point in the execution is not 
sensitive to latency.

Not sure if MADV_HUGEPAGE is really *that* different.

I see them as logically separate.

The design decision I recall is that if VM_NOHUGEPAGE is set, we'll 
ignore that.
Because that was set by the app itself (MADV_NOHUEPAGE).

IIUC, MADV_COLLAPSE does not ignore the VM_NOHUGEPAGE setting, if we set 
VM_NOHUGEPAGE, then MADV_COLLAPSE will not be allowed to collapse a THP. 
See:
__thp_vma_allowable_orders() ---> vma_thp_disabled()

Hmm, ok. My instinct would have been the opposite; MADV_NOHUGEPAGE 
means "I
don't want the risk of latency spikes and memory bloat that THP can 
cause". Not
"ignore my explicit requests to MADV_COLLAPSE".

But if that descision was already taken and that's the current 
behavior then I
agree we have an inconsistency with respect to the sysfs control.

Perhaps we should be guided by real world usage - AIUI there is a 
cloud that
disables THP at system level today (Google?).
The use case I am aware of for disabling it for debugging purposes.
Saved us quite some headake in the past at customer sites for
troubleshooting + workarounds ...

Let's take a look at the man page:

MADV_COLLAPSE is  independent  of  any  sysfs  (see  sysfs(5))  setting
under  /sys/kernel/mm/transparent_hugepage, both in terms of determining
THP eligibility, and allocation semantics.

I recall we discussed that it should ignore the 
max_ptes_none/swap/shared.

But "any" setting would include "enable" ...

It kind-of contradicts the linked 
Documentation/admin-guide/mm/transhuge.rst, where we have this 
*beautiful* comment

"Transparent Hugepage Support for anonymous memory can be entirely 
disable (mostly for debugging purposes".

I mean, "entirely" is also pretty clear to me.

Yes, agree. We have encountered issues caused by THP in our Alibaba 
fleet. The quickest way to stop the bleeding was to disable THP. In such 
case, we do not expect MADV_HUGEPAGE to still collapse a THP.