Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process

David Hildenbrand <david@xxxxxxxxxx> · Thu, 15 May 2025 21:17:28 +0200

On 15.05.25 20:36, Lorenzo Stoakes wrote:
One thing I wanted to emphasise - shouldn't we invoke
khugepaged_enter_vma() for VMAs we set VM_HUGEPAGE for?

I note in __mmap_new_vma() we do so for non-anon VMAs _prior_ to invoking
vma_set_thp_policy().

I _really_ hate putting in conditional logic like this for specific
cases. Feels very hacky.

On Thu, May 15, 2025 at 05:47:34PM +0100, Usama Arif wrote:


On 15/05/2025 17:06, Lorenzo Stoakes wrote:

Its doing the same as KSM as suggested by David. Does KSM break these tests?
Is there some specific test you can point to that I can run that is breaking
with this patch and not without it?

They don't build, at all. I explain how you can attempt a build below.


Ah yes, I initially thought by break you meant they were failing. I saw that,
will fix it.

And no, KSM doesn't break the tests, because steps were taken to make them
not break the tests :) I mean it's really easy - it's just adding some
trivial stubs.


Yes, will do Thanks!

Thanks.


If you need help with it just ping me in whatever way helps and I can help!

It's understandable as it's not necessarily clear this is a thing (maybe we
need self tests to build it, but that might break CI setups so unclear).

The merging is much more important!

To re-emphasise ^ the merging is key - I will unfortunately have to NACK
any series that breaks it. So this is an absolute requirement here.




I really feel this series needs to be an RFC until we can get some
consensus on how to approach this.

There was consensus in https://lore.kernel.org/all/97702ff0-fc50-4779-bfa8-83dc42352db1@xxxxxxxxxx/

I disagree with this asssessment, that doesn't look like consensus at all,
I think at least this is a very contentious or at least _complicated_ topic
that we need to really dig into.

So in my view - it's this kind of situation that warrants an RFC until
there's some stabilisation and agreement on a way forward.


Sure will change next revision to RFC, unless hopefully maybe we can
get a consensus in this revision :)

Thanks!




On Thu, May 15, 2025 at 02:33:30PM +0100, Usama Arif wrote:
This is set via the new PR_SET_THP_POLICY prctl.

What is?

You're making very major changes here, including adding a new flag to
mm_struct (!!) and the explanation/justification for this is missing.


I have added the justification in your reply to the coverletter.

As stated there, you've not explained why alternatives are unworkable, I
think we need this!

Sort of:

1. Why not cgroups? blah blah blah
2. Why not process_madvise()? blah blah blah
3. Why not bpf? blah blah blah
4. Why not <something I've not thought of>? blah blah blah


I will add this in the next cover letter.

Thanks




This will set the MMF2_THP_VMA_DEFAULT_HUGE process flag
which changes the default of new VMAs to be VM_HUGEPAGE. The
call also modifies all existing VMAs that are not VM_NOHUGEPAGE
to be VM_HUGEPAGE. The policy is inherited during fork+exec.

So you can only set this flag?


??

This patch is only allowing the setting of this flag. I am asking 'so you
can only set this flag?'

To which it appears the answer is, yes I think :)

An improved cover letter could say something like:

"
Here we implement the first flag intended to allow the _overriding_ of huge
page policy to ensure that, when
/sys/kernel/mm/transparent_hugepage/enabled is set to madvise, we are able
to maintain fine-grained control of individual processes, including any
fork/exec'd, by setting this flag.

In subsequent commits, we intend to permit further such control.
"



This allows systems where the global policy is set to "madvise"
to effectively have THPs always for the process. In an environment
where different types of workloads are stacked on the same machine,
this will allow workloads that benefit from always having hugepages
to do so, without regressing those that don't.

Again, this explanation really makes no sense at all to me, I don't really
know what you mean, you're not going into what you're doing in this change,
this is just a very unclear commit message.


I hope this is answered in my reply to your coverletter.

You still need to improve the cover letter here I think, see above for a
suggestion!


Sure, will do in the next revision, Thanks!

Thanks



Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx>
---
  include/linux/huge_mm.h                       |  3 ++
  include/linux/mm_types.h                      | 11 +++++++
  include/uapi/linux/prctl.h                    |  4 +++
  kernel/fork.c                                 |  1 +
  kernel/sys.c                                  | 21 ++++++++++++
  mm/huge_memory.c                              | 32 +++++++++++++++++++
  mm/vma.c                                      |  2 ++
  tools/include/uapi/linux/prctl.h              |  4 +++
  .../trace/beauty/include/uapi/linux/prctl.h   |  4 +++
  9 files changed, 82 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..e652ad9ddbbd 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -260,6 +260,9 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
  	return orders;
  }

+void vma_set_thp_policy(struct vm_area_struct *vma);

This is a VMA-specific function but you're putting it in huge_mm.h? Why
can't
this be in vma.h or vma.c?


Sure can move it there.

+void process_vmas_thp_default_huge(struct mm_struct *mm);

'vmas' is redundant here.


Sure.
+
  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
  					 unsigned long vm_flags,
  					 unsigned long tva_flags,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e76bade9ebb1..2fe93965e761 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1066,6 +1066,7 @@ struct mm_struct {
  		mm_context_t context;

  		unsigned long flags; /* Must use atomic bitops to access */
+		unsigned long flags2;


Ugh, god really??

I really am not a fan of adding flags2 just to add a prctl() feature like
this. This is crazy.

Also this is a TERRIBLE name. I mean, no please PLEASE no.

Do we really have absolutely no choice but to add a new flags field here?

It again doesn't help that you don't mention nor even try to justify this
in the commit message or cover letter.


And again, I hope my reply to your email has given you the justification.

No :) I understood why you did this though of course.


If this is a 32-bit kernel vs. 64-bit kernel thing so we 'ran out of bits',
let's just go make this flags field 64-bit on 32-bit kernels.

I mean - I'm kind of insisting we do that to be honest. Because I really
don't like this.


If the maintainers want this, I will make it a 64 bit only feature. We
are only using it for 64 bit servers. But it will probably mean ifdef
config 64 bit in a lot of places.

I'm going to presume you are including me in this category rather than
implying that you are deferring only to others :)


Yes ofcourse! I mean all maintainers :)

And hopefully everyone else as well :)

So, there's another option:

Have a prerequisite series that makes mm_struct->flags 64-bit on 32-bit
kernels, which solves this problem everywhere and avoids us wasting a bunch
of memory for a very specific usecase, splitting flag state across 2 fields
(which are no longer atomic as a whole of course), adding confusion,
possibly subtly breaking anywhere that assumes mm->flags completely
describes mm-granularity flag state etc.


This is probably a very basic question, but by make mm_struct->flags 64-bit on 32-bit
do you mean convert flags to unsigned long long when !CONIFG_64BIT?

Yes. This would be a worthwhile project in its own right.

I will be changing vma->vm_flags in the same way soon.


The RoI here is not looking good, otherwise.



Also if we _HAVE_ to have this, shouldn't we duplicate that comment about
atomic bitops?...


Sure


  #ifdef CONFIG_AIO
  		spinlock_t			ioctx_lock;
@@ -1744,6 +1745,11 @@ enum {
  				 MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
  				 MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)

+#define MMF2_THP_VMA_DEFAULT_HUGE		0

I thought the whole idea was to move away from explicitly refrencing 'THP'
in a future where large folios are implicit and now we're saying 'THP'.

Anyway the 'VMA' is totally redundant here.


Sure, I can remove VMA.
I see THP everywhere in the kernel code.
Its mentioned 108 times in transhuge.rst alone :)
If you have any suggestion to rename this flag, happy to take it :)

Yeah I mean it's a mess man, and it's not your fault... Again naming is
hard, I put a suggestion in reply to cover letter anyway...


+#define MMF2_THP_VMA_DEFAULT_HUGE_MASK		(1 << MMF2_THP_VMA_DEFAULT_HUGE)

Do we really need explicit trivial mask declarations like this?


I have followed the convention that has existed in this file, please see below
links :)
https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1645
https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1623
https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1603
https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1582

Ack, yuck but ack.



+
+#define MMF2_INIT_MASK		(MMF2_THP_VMA_DEFAULT_HUGE_MASK)

+
  static inline unsigned long mmf_init_flags(unsigned long flags)
  {
  	if (flags & (1UL << MMF_HAS_MDWE_NO_INHERIT))
@@ -1752,4 +1758,9 @@ static inline unsigned long mmf_init_flags(unsigned long flags)
  	return flags & MMF_INIT_MASK;
  }

+static inline unsigned long mmf2_init_flags(unsigned long flags)
+{
+	return flags & MMF2_INIT_MASK;
+}
+
  #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 15c18ef4eb11..325c72f40a93 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -364,4 +364,8 @@ struct prctl_mm_map {
  # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
  # define PR_TIMER_CREATE_RESTORE_IDS_GET	2

+#define PR_SET_THP_POLICY		78
+#define PR_GET_THP_POLICY		79
+#define PR_THP_POLICY_DEFAULT_HUGE	0
+
  #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9e4616dacd82..6e5f4a8869dc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1054,6 +1054,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,

  	if (current->mm) {
  		mm->flags = mmf_init_flags(current->mm->flags);
+		mm->flags2 = mmf2_init_flags(current->mm->flags2);
  		mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
  	} else {
  		mm->flags = default_dump_filter;
diff --git a/kernel/sys.c b/kernel/sys.c
index c434968e9f5d..1115f258f253 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2658,6 +2658,27 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
  			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
  		mmap_write_unlock(me->mm);
  		break;
+	case PR_GET_THP_POLICY:
+		if (arg2 || arg3 || arg4 || arg5)
+			return -EINVAL;
+		if (!!test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2))

I really don't think we need the !!? Do we?

I have followed the convention that has existed in this file already,
please see:
https://elixir.bootlin.com/linux/v6.14.6/source/kernel/sys.c#L2644

OK, but please don't, I don't see why this is necessary. if (truthy) is
fine.

Unless somebody has a really good reason why this is necessary, it's just
ugly ceremony.


Agreed :)

Thanks




Shouldn't we lock the mm when we do this no? Can't somebody change this?


It wasn't locked in PR_GET_THP_DISABLE
https://elixir.bootlin.com/linux/v6.14.6/source/kernel/sys.c#L2644

I can acquire do mmap_write_lock_killable the same as PR_SET_THP_POLICY
in the next series.

I can also add the lock in PR_GET_THP_DISABLE.

Well, the issue I guess is... if the flags field is atomic, and we know
over this call maybe we can rely on mm sticking around, then we probalby
don't need an mmap lock actually.


+			error = PR_THP_POLICY_DEFAULT_HUGE;

Wait, error = PR_THP_POLICY_DEFAULT_HUGE? Is this the convention for
returning here? :)

I see a few of the PR_GET_.. setting the return value. I hope I didnt
misinterpret that.

Yeah I thought it might be the case. I reemphasise my dislike of prctl().



+		break;
+	case PR_SET_THP_POLICY:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		if (mmap_write_lock_killable(me->mm))
+			return -EINTR;
+		switch (arg2) {
+		case PR_THP_POLICY_DEFAULT_HUGE:
+			set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
+			process_vmas_thp_default_huge(me->mm);
+			break;
+		default:
+			return -EINVAL;

Oh I just noticed - this is really broken - you're not unlocking the mmap()
here on error... :) you definitely need to fix this.


Ah yes, will do Thanks!

Thanks


+		}
+		mmap_write_unlock(me->mm);
+		break;
  	case PR_MPX_ENABLE_MANAGEMENT:
  	case PR_MPX_DISABLE_MANAGEMENT:
  		/* No longer implemented: */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2780a12b25f0..64f66d5295e8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -98,6 +98,38 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
  	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
  }

+void vma_set_thp_policy(struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
+		vm_flags_set(vma, VM_HUGEPAGE);
+}
+
+static void vmas_thp_default_huge(struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+	unsigned long vm_flags;
+
+	VMA_ITERATOR(vmi, mm, 0);

This is a declaration, it should be grouped with declarations...


Sure, will make the change in next version.

Unfortunately checkpatch didn't complain.

Checkpatch actually complains the other way :P it doesn't understand
macros.

So you'll start getting a warning here, which you can ignore. It sucks, but
there we go. Making checkpatch.pl understand that would be a pain, probs.


+	for_each_vma(vmi, vma) {
+		vm_flags = vma->vm_flags;
+		if (vm_flags & VM_NOHUGEPAGE)
+			continue;

Literally no point in you putting vm_flags as a separate variable here.


Sure, will make the change in next version.

Thanks!


So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
is to override global 'never'?


Again, I am not overriding never.

hugepage_global_always and hugepage_global_enabled will evaluate to false
and you will not get a hugepage.

Yeah, again ack, but I kind of hate that we set VM_HUGEPAGE everywhere even
if the policy is never.

And we now get into realms of:

'Hey I set prctl() to make everything huge pages, and PR_GET_THP_POLICY
says I've set that, but nothing is huge? BUG???'

Of course then you get into - if somebody sets it to never, do we go around
and remove VM_HUGEPAGE and this MMF_ flag?



I'm really concerned about this.

+		vm_flags_set(vma, VM_HUGEPAGE);
+	}
+}

Do we have an mmap write lock established here? Can you confirm that? Also
you should add an assert for that here.


Yes I do, its only called in PR_SET_THP_POLICY where mmap_write lock was taken.
I can add an assert if it helps.

It not only helps, it's utterly critical :)

'It's only called in xxx()' is famous last words for a programmer, because
later somebody (maybe even your good self) calls it from somewhere else
and... we've all been there...


Thanks! Will do.

Thanks.


+
+void process_vmas_thp_default_huge(struct mm_struct *mm)
+{
+	if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
+		return;
+
+	set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2);
+	vmas_thp_default_huge(mm);
+}
+
+
  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
  					 unsigned long vm_flags,
  					 unsigned long tva_flags,
diff --git a/mm/vma.c b/mm/vma.c
index 1f2634b29568..101b19c96803 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2476,6 +2476,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
  	if (!vma_is_anonymous(vma))
  		khugepaged_enter_vma(vma, map->flags);
  	ksm_add_vma(vma);
+	vma_set_thp_policy(vma);

You're breaking VMA merging completely by doing this here...

Now I can map one VMA with this policy set, then map another immediately
next to it and - oops - no merge, ever, because the VM_HUGEPAGE flag is not
set in the new VMA on merge attempt.

I realise KSM is just as broken (grr) but this doesn't justify us
completely breaking VMA merging here.

I think this answers it. Its doing the same as KSM.

Yes, but as I said there, it's not acceptable, at all.

You're making it so litearlly VMA merging _does not happen at all_. That's
unacceptable and might even break some workloads.

You'll certainly cause very big kernel metadata usage.

Consider:

|-----------------------------|..................|
| some VMA flags, VM_HUGEPAGE | proposed new VMA |
|-----------------------------|..................|

Now, because you set VM_HUGEPAGE _after any merge is attempted_, this will
_always_ be fragmented, forever.


So if __mmap_new_vma and do_brk_flags are called after merge attempt,
is it possible to vma_set_thp_policy (or do something similar) before
the merge attempt?

Actually I just read your reply to the next block, so I think its ok?
Added more to the next block.

I dont have any preference on where its put, so happy with putting this
earlier.

Yeah, you can just do it earlier. But you maybe should just set the flag in
the appropriate field rather than using the set flags helper.



That's just not... acceptable.

The fact KSM is broken this way doesn't make that OK.

Especially on brk(), which now will _always_ allocate new VMAs for every
brk() expansion which doesn't seem very efficient.

It may also majorly degrade performance.

That makes me think we need some perf testing for this ideally...



You need to set earlier than this. Then of course a driver might decide to
override this, so maybe then we need to override that.

But then we're getting into realms of changing fundamental VMA code _just
for this feature_.

Again I'm iffy about this. Very.

Also you've broken the VMA userland tests here:

$ cd tools/testing/vma
$ make
...
In file included from vma.c:33:
../../../mm/vma.c: In function ‘__mmap_new_vma’:
../../../mm/vma.c:2486:9: error: implicit declaration of function ‘vma_set_thp_policy’; did you mean ‘vma_dup_policy’? [-Wimplicit-function-declaration]
  2486 |         vma_set_thp_policy(vma);
       |         ^~~~~~~~~~~~~~~~~~
       |         vma_dup_policy
make: *** [<builtin>: vma.o] Error 1

You need to create stubs accordingly.


Thanks will do.

Thanks!


  	*vmap = vma;
  	return 0;

@@ -2705,6 +2706,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
  	mm->map_count++;
  	validate_mm(mm);
  	ksm_add_vma(vma);
+	vma_set_thp_policy(vma);

You're breaking merging again... This is quite a bad case too as now you'll
have totally fragmented brk VMAs no?


Again doing it the same as KSM.

That doesn't make it ok. Just because KSM is broken doesn't make this ok. I
mean grr at KSM :) I'm going to look into that and see about
investigating/fixing that behaviour.

obviously I can't accept anything that will fundamentally break VMA
merging.


Ofcourse!

The answer really is to do this earlier, but you risk a driver overriding
it, but that's OK I think (I don't even think any in-tree ones do actually
_anywhere_ - and yes I was literally reading through _every single_ .mmap()
callback lately because I am quite obviously insane ;)

Again I can help with this.


Appreaciate it!

I am actually not familiar with the merge code. I will try and have a look,
but if you could give a pointer to the file:line after which its not acceptable
to have and I can move vma_set_thp_policy to before it or try and do something
similar to that.

Ack.

I wrote the latest merge and mmap() code so am well placed on this :>)

But I don't think we should use vma_set_thp_policy() in these places, we
should just set the flag, to avoid trying to do a write lock etc. etc.,
plus we want to set the flag in a place that's not a VMA yet in both cases.

So we'd need something like in do_mmap():

+	vm_flags |= mm_implied_vma_flags(mm);
	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);

Where mm_implied_vma_flags() reads the MMF flags and sees if any imply VMA
flags.

But we have something for that already don't we? mm->def_flags.

Can't we use that actually? That should work for mmap too?

Yeah, look at

commit 1860033237d4be09c5d7382585f0c7229367a534
Author: Michal Hocko <mhocko@xxxxxxxx>
Date:   Mon Jul 10 15:48:02 2017 -0700

    mm: make PR_SET_THP_DISABLE immediately active


Where we moved away from that. As raised, I am not sure I like what we 
did with PR_SET_THP_DISABLE.

And I don't want any new magical prctl like that that add new magical 
internal toggles.

OTOH, I am completely fine with a prctl that just changes the default 
for new VMAs (just like applying madvise imemdiately afterwards). I'm 
also fine with a prtctl that changes all existing VMAs, but maybe just 
issuing a madvise() is the better solution, to cleanly separate it.

All not too crazy and not too invasive -- piggybagging on VM_HUGEPAGE / 
VM_NOHUGEPAGE.

I can life with that if it solves a use case.

--
Cheers,

David / dhildenb