On Tue, Jun 10, 2025 at 04:30:43PM +0100, Usama Arif wrote: > > > On 10/06/2025 16:17, Lorenzo Stoakes wrote: > > On Tue, Jun 10, 2025 at 04:03:07PM +0100, Usama Arif wrote: > >> > >> > >> On 30/05/2025 14:10, Lorenzo Stoakes wrote: > >>> On Thu, May 29, 2025 at 06:21:55PM +0100, Usama Arif wrote: > >>>> > >>>> > >>>> My knowledge is security is limited, so please bare with me, but I actually > >>>> didn't understand the security issue and the need for CAP_SYS_ADMIN for > >>>> doing VM_(NO)HUGEPAGE. > >>>> > >>>> A process can already madvise its own VMAs, and this is just doing that > >>>> for the entire process. And VM_INIT_DEF_MASK is already set to VM_NOHUGEPAGE > >>>> so it will be inherited by the parent. Just adding VM_HUGEPAGE shouldnt be > >>>> a issue? Inheriting MMF_VM_HUGEPAGE will mean that khugepaged would enter > >>>> for that process as well, which again doesnt seem like a security issue > >>>> to me. > >>> > >>> W.R.T. the current process, the Issue is one Jann raised, in relation to > >>> propagation of behaviour to privileged (e.g. setuid) processes. > >>> > >> > >> But what is the actual security issue of having hugepages (or not having them) when > >> the process is running with setuid? > > > > Speak to Jann about this. Security isn't my area. He gave feedback on this, > > which is why I raised it, if you search through previous threads you can find > > it. > > > > Yes, he is in CC here as well. I have read it in the previous thread. Just raising it > here as it was mentioned here :) > > >> > >> I know the cgroup proposal has been shot down, but lets imagine if this was a cgroup > >> setting, similar to the other memory controls we have, for e.g. memory.swap.{max,high,peak}. > >> > >> We can chown the cgroup so that the property is set by unprivileged process. > >> > >> Having the process swap with setuid when the unprivileged process has swap disabled > >> in the cgroup is not the right behaviour. What currently happens is that the process > >> after obtaining the higher privilege level doesn't swap as well. > >> > >> Similarly for hugepages, if it was a cgroup level setting, having the process give > >> hugepages always with setuid when the unprivileged user had it disabled it or vice versa > >> would not be the right behaviour. > >> > >> Another example is PR_SET_MEMORY_MERGE, setuid does not change how it works as far as > >> I can tell. > >> > >> So madlibs I dont see what the security issue is and why we would need to elevate privileges > >> to do this. > >> > >>> W.R.T. remote processes, obviously we want to make sure we are permitted to do > >>> so. > >>> > >> > >> I know that this needs to be future proof. But I don't actually know of a real world > >> usecase where we want to do any of these things for remote processes. > >> Whether its the existing per process changes like PR_SET_MEMORY_MERGE for KSM and > >> PR_SET_THP_DISABLE for THP or the newer proposals of PR_DEFAULT_MADV_(NO)HUGEPAGE > >> or Barrys proposal. > >> All of them are for the process itself (and its children by fork+exec) and not for > >> remote processes. As we try to make our changes usecase driven, I think we should > >> not add support for remote processes (which is another reason why I think this might > >> sit better in prctl). > > > > I'm extremely confused as to why you think this propoal is predicated upon > > remote process manipulation? It was simply suggested as a possibility for > > increased flexibility. > > > > We can just remove this parameter no? > > > > Sure. > > > It is entirely orthogonal to the prctl() stuff. > > > > Overall at this point I share Matthew's point of view on this - we shouldn't be > > doing any of this upstream. > > As I replied to Matthew in [1], it would be amazing if it was not needed, but thats not > how it works in the medium term and I dont think it will work even in the long term. > I will paste my answer from [1] below as well: > > If we have 2 workloads on the same server, For e.g. one is database where THPs > just dont do well, but the other one is AI where THPs do really well. How > will the kernel monitor that the database workload is performing worse > and the AI one isnt? > > I added THP shrinker to hopefully try and do this automatically, and it does > really help. But unfortunately it is not a complete solution. > There are severely memory bound workloads where even a tiny increase > in memory will lead to an OOM. And if you colocate the container thats running > that workload with one in which we will benefit with THPs, we unfortunately > can't just rely on the system doing the right thing. > > It would be awesome if THPs are truly transparent and don't require > any input, but unfortunately I don't think that there is a solution > for this with just kernel monitoring. > > This is just a big hint from the user. If the global system policy is madvise > and the workload owner has done their own benchmarks and see benefits > with always, they set DEFAULT_MADV_HUGEPAGE for the process to optin as "always". > If the global system policy is always and the workload owner has done their own > benchmarks and see worse results with always, they set DEFAULT_MADV_NOHUGEPAGE for > the process to optin as "madvise". > > [1] https://lore.kernel.org/all/162c14e6-0b16-4698-bd76-735037ea0d73@xxxxxxxxx/ > > Yup I appreciate these points, and we have discussed them I feel quite a bit :) I echo them. Nobody says that the interface isn't sucky and THPs are not as transparent as they should be, nor that we lack decent non-cgroup 'policy' manipulation. BUT. We're talking about adding a permanent hack into the kernel that force-sets a VMA flag for all VMAs across fork/exec. I have simply been trying to flesh out the _least worst_ means of doing this - _if we have to do it_. That last bit being operative - I have come to think, based on Matthew's feedback, that the RoI of permanently adding this hack is not a good one. I think the case remains to be made for that. > I havent seen activity on this thread over the past week, but I was hoping > we can reach a consensus on which approach to use, prctl or mctl. > If its mctl and if you don't think this should be done, please let me know > if you would like me to work on this instead. This is a valid big realworld > usecase that is a real blocker for deploying THPs in workloads in servers. Please exercise patience, upstream moves at its own pace. > > Thanks! > Usama