Re: [DISCUSSION] proposed mctl() API

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Tue, 10 Jun 2025 17:02:06 +0100

On Tue, Jun 10, 2025 at 04:30:43PM +0100, Usama Arif wrote:
>
>
> On 10/06/2025 16:17, Lorenzo Stoakes wrote:
> > On Tue, Jun 10, 2025 at 04:03:07PM +0100, Usama Arif wrote:
> >>
> >>
> >> On 30/05/2025 14:10, Lorenzo Stoakes wrote:
> >>> On Thu, May 29, 2025 at 06:21:55PM +0100, Usama Arif wrote:
> >>>>
> >>>>
> >>>> My knowledge is security is limited, so please bare with me, but I actually
> >>>> didn't understand the security issue and the need for CAP_SYS_ADMIN for
> >>>> doing VM_(NO)HUGEPAGE.
> >>>>
> >>>> A process can already madvise its own VMAs, and this is just doing that
> >>>> for the entire process. And VM_INIT_DEF_MASK is already set to VM_NOHUGEPAGE
> >>>> so it will be inherited by the parent. Just adding VM_HUGEPAGE shouldnt be
> >>>> a issue? Inheriting MMF_VM_HUGEPAGE will mean that khugepaged would enter
> >>>> for that process as well, which again doesnt seem like a security issue
> >>>> to me.
> >>>
> >>> W.R.T. the current process, the Issue is one Jann raised, in relation to
> >>> propagation of behaviour to privileged (e.g. setuid) processes.
> >>>
> >>
> >> But what is the actual security issue of having hugepages (or not having them) when
> >> the process is running with setuid?
> >
> > Speak to Jann about this. Security isn't my area. He gave feedback on this,
> > which is why I raised it, if you search through previous threads you can find
> > it.
> >
>
> Yes, he is in CC here as well. I have read it in the previous thread. Just raising it
> here as it was mentioned here :)
>
> >>
> >> I know the cgroup proposal has been shot down, but lets imagine if this was a cgroup
> >> setting, similar to the other memory controls we have, for e.g. memory.swap.{max,high,peak}.
> >>
> >> We can chown the cgroup so that the property is set by unprivileged process.
> >>
> >> Having the process swap with setuid when the unprivileged process has swap disabled
> >> in the cgroup is not the right behaviour. What currently happens is that the process
> >> after obtaining the higher privilege level doesn't swap as well.
> >>
> >> Similarly for hugepages, if it was a cgroup level setting, having the process give
> >> hugepages always with setuid when the unprivileged user had it disabled it or vice versa
> >> would not be the right behaviour.
> >>
> >> Another example is PR_SET_MEMORY_MERGE, setuid does not change how it works as far as
> >> I can tell.
> >>
> >> So madlibs I dont see what the security issue is and why we would need to elevate privileges
> >> to do this.
> >>
> >>> W.R.T. remote processes, obviously we want to make sure we are permitted to do
> >>> so.
> >>>
> >>
> >> I know that this needs to be future proof. But I don't actually know of a real world
> >> usecase where we want to do any of these things for remote processes.
> >> Whether its the existing per process changes like PR_SET_MEMORY_MERGE for KSM and
> >> PR_SET_THP_DISABLE for THP or the newer proposals of PR_DEFAULT_MADV_(NO)HUGEPAGE
> >> or Barrys proposal.
> >> All of them are for the process itself (and its children by fork+exec) and not for
> >> remote processes. As we try to make our changes usecase driven, I think we should
> >> not add support for remote processes (which is another reason why I think this might
> >> sit better in prctl).
> >
> > I'm extremely confused as to why you think this propoal is predicated upon
> > remote process manipulation? It was simply suggested as a possibility for
> > increased flexibility.
> >
> > We can just remove this parameter no?
> >
>
> Sure.
>
> > It is entirely orthogonal to the prctl() stuff.
> >
> > Overall at this point I share Matthew's point of view on this - we shouldn't be
> > doing any of this upstream.
>
> As I replied to Matthew in [1], it would be amazing if it was not needed, but thats not
> how it works in the medium term and I dont think it will work even in the long term.
> I will paste my answer from [1] below as well:
>
> If we have 2 workloads on the same server, For e.g. one is database where THPs
> just dont do well, but the other one is AI where THPs do really well. How
> will the kernel monitor that the database workload is performing worse
> and the AI one isnt?
>
> I added THP shrinker to hopefully try and do this automatically, and it does
> really help. But unfortunately it is not a complete solution.
> There are severely memory bound workloads where even a tiny increase
> in memory will lead to an OOM. And if you colocate the container thats running
> that workload with one in which we will benefit with THPs, we unfortunately
> can't just rely on the system doing the right thing.
>
> It would be awesome if THPs are truly transparent and don't require
> any input, but unfortunately I don't think that there is a solution
> for this with just kernel monitoring.
>
> This is just a big hint from the user. If the global system policy is madvise
> and the workload owner has done their own benchmarks and see benefits
> with always, they set DEFAULT_MADV_HUGEPAGE for the process to optin as "always".
> If the global system policy is always and the workload owner has done their own
> benchmarks and see worse results with always, they set DEFAULT_MADV_NOHUGEPAGE for
> the process to optin as "madvise".
>
> [1] https://lore.kernel.org/all/162c14e6-0b16-4698-bd76-735037ea0d73@xxxxxxxxx/
>
>

Yup I appreciate these points, and we have discussed them I feel quite a
bit :) I echo them.

Nobody says that the interface isn't sucky and THPs are not as transparent
as they should be, nor that we lack decent non-cgroup 'policy'
manipulation.

BUT.

We're talking about adding a permanent hack into the kernel that
force-sets a VMA flag for all VMAs across fork/exec.

I have simply been trying to flesh out the _least worst_ means of
doing this - _if we have to do it_.

That last bit being operative - I have come to think, based on Matthew's
feedback, that the RoI of permanently adding this hack is not a good one.

I think the case remains to be made for that.

> I havent seen activity on this thread over the past week, but I was hoping
> we can reach a consensus on which approach to use, prctl or mctl.
> If its mctl and if you don't think this should be done, please let me know
> if you would like me to work on this instead. This is a valid big realworld
> usecase that is a real blocker for deploying THPs in workloads in servers.

Please exercise patience, upstream moves at its own pace.

>
> Thanks!
> Usama