Re: [DISCUSSION] proposed mctl() API

Usama Arif <usamaarif642@xxxxxxxxx> · Thu, 29 May 2025 19:32:10 +0100

On 29/05/2025 19:13, Matthew Wilcox wrote:
> On Thu, May 29, 2025 at 10:54:34AM -0700, Shakeel Butt wrote:
>> On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
>>> People should put more effort into allocating THPs automatically and
>>> monitoring where they're helping performance and where they're hurting
>>> performance,
>>
>> Can you please expand on people putting more effort? Is it about
>> workloads or maybe malloc implementations (tcmalloc, jemalloc) being
>> more intelligent in managing their allocations/frees to keep more used
>> memory in hugepage aligned regions? And conveying to kernel which
>> regions they prefer hugepage backed and which they do not? Or something
>> else?
> 
> We need infrastructure inside the kernel to monitor whether a task is
> making effective use of the THPs that it has, and if it's not then move
> those THPs over to where they will be better used.
> 

I think this is the really difficult part.

If we have 2 workloads on the same server, For e.g. one is database where THPs 
just dont do well, but the other one is AI where THPs do really well. How
will the kernel monitor that the database workload is performing worse
and the AI one isnt?

I added THP shrinker to hopefully try and do this automatically, and it does
really help. But unfortunately it is not a complete solution.
There are severely memory bound workloads where even a tiny increase
in memory will lead to an OOM. And if you colocate the container thats running
that workload with one in which we will benefit with THPs, we unfortunately
can't just rely on the system doing the right thing.

It would be awesome if THPs are truly transparent and don't require
any input, but unfortunately I don't think that there is a solution
for this with just kernel monitoring.

This is just a big hint from the user. If the global system policy is madvise
and the workload owner has done their own benchmarks and see benefits
with always, they set DEFAULT_MADV_HUGEPAGE for the process to optin as "always".
If the global system policy is always and the workload owner has done their own 
benchmarks and see worse results with always, they set DEFAULT_MADV_NOHUGEPAGE for 
the process to optin as "madvise". 

> I don't necessarily object to userspace giving hints like "I think I'm
> going to use all of this 20MB region quite heavily", but the kernel should
> treat those hints with the appropriate skepticism, otherwise it's just
> a turbo button that nobody would ever _not_ press.
> 
>>> instead of coming up with these baroque reasons to blame
>>> the sysadmin for not having tweaked some magic knob.
>>
>> To me this is not about blaming sysadmin but more about sysadmin wanting
>> more fine grained control on THP allocation policies for different
>> workloads running in a multi-tenant environment.
> 
> That's the same thing.  Linux should be auto-tuning, not relying on some
> omniscient sysadmin to fix it up.