On 2025/9/11 22:02, Lorenzo Stoakes wrote:
On Wed, Sep 10, 2025 at 08:42:37PM +0800, Lance Yang wrote:
Hey Yafang,
On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
programs to influence THP order selection based on factors such as:
- Workload identity
For example, workloads running in specific containers or cgroups.
- Allocation context
Whether the allocation occurs during a page fault, khugepaged, swap or
other paths.
- VMA's memory advice settings
MADV_HUGEPAGE or MADV_NOHUGEPAGE
- Memory pressure
PSI system data or associated cgroup PSI metrics
The kernel API of this new BPF hook is as follows,
/**
* @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
* @vma: vm_area_struct associated with the THP allocation
* @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
* BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
* neither is set.
* @tva_type: TVA type for current @vma
* @orders: Bitmask of requested THP orders for this allocation
* - PMD-mapped allocation if PMD_ORDER is set
* - mTHP allocation otherwise
*
* Return: The suggested THP order from the BPF program for allocation. It will
* not exceed the highest requested order in @orders. Return -1 to
* indicate that the original requested @orders should remain unchanged.
*/
typedef int thp_order_fn_t(struct vm_area_struct *vma,
enum bpf_thp_vma_type vma_type,
enum tva_type tva_type,
unsigned long orders);
Only a single BPF program can be attached at any given time, though it can
be dynamically updated to adjust the policy. The implementation supports
anonymous THP, shmem THP, and mTHP, with future extensions planned for
file-backed THP.
This functionality is only active when system-wide THP is configured to
madvise or always mode. It remains disabled in never mode. Additionally,
if THP is explicitly disabled for a specific task via prctl(), this BPF
functionality will also be unavailable for that task.
This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
enabled. Note that this capability is currently unstable and may undergo
significant changes—including potential removal—in future kernel versions.
Suggested-by: David Hildenbrand <david@xxxxxxxxxx>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
---
[...]
diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
new file mode 100644
index 000000000000..525ee22ab598
--- /dev/null
+++ b/mm/huge_memory_bpf.c
@@ -0,0 +1,243 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BPF-based THP policy management
+ *
+ * Author: Yafang Shao <laoar.shao@xxxxxxxxx>
+ */
+
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/huge_mm.h>
+#include <linux/khugepaged.h>
+
+enum bpf_thp_vma_type {
+ BPF_THP_VM_NONE = 0,
+ BPF_THP_VM_HUGEPAGE, /* VM_HUGEPAGE */
+ BPF_THP_VM_NOHUGEPAGE, /* VM_NOHUGEPAGE */
+};
+
+/**
+ * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
+ * @vma: vm_area_struct associated with the THP allocation
+ * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
+ * BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
+ * neither is set.
+ * @tva_type: TVA type for current @vma
+ * @orders: Bitmask of requested THP orders for this allocation
+ * - PMD-mapped allocation if PMD_ORDER is set
+ * - mTHP allocation otherwise
+ *
+ * Return: The suggested THP order from the BPF program for allocation. It will
+ * not exceed the highest requested order in @orders. Return -1 to
+ * indicate that the original requested @orders should remain unchanged.
A minor documentation nit: the comment says "Return -1 to indicate that the
original requested @orders should remain unchanged". It might be slightly
clearer to say "Return a negative value to fall back to the original
behavior". This would cover all error codes as well ;)
+ */
+typedef int thp_order_fn_t(struct vm_area_struct *vma,
+ enum bpf_thp_vma_type vma_type,
+ enum tva_type tva_type,
+ unsigned long orders);
Sorry if I'm missing some context here since I haven't tracked the whole
series closely.
Regarding the return value for thp_order_fn_t: right now it returns a
single int order. I was thinking, what if we let it return an unsigned
long bitmask of orders instead? This seems like it would be more flexible
down the road, especially if we get more mTHP sizes to choose from. It
would also make the API more consistent, as bpf_hook_thp_get_orders()
itself returns an unsigned long ;)
I think that adds confusion - as in how an order might be chosen from
those. Also we have _received_ a bitmap of available orders - and the intent
here is to select _which one we should use_.
Yep. Makes sense to me ;)
And this is an experimental feature, behind a flag explicitly labelled as
experimental (and thus subject to change) so if we found we needed to change
things in the future we can.
You're right, I didn't pay enough attention to the fact that this is
an experimental feature. So my suggestions were based on a lack of
context ...
Also, for future extensions, it might be a good idea to add a reserved
flags argument to the thp_order_fn_t signature.
We don't need to do anything like this, as we are behind an experimental flag
and in no way guarantee that this will be used this way going forwards.
For example thp_order_fn_t(..., unsigned long flags).
This would give us aforward-compatible way to add new semantics later
without breaking the ABI and needing a v2. We could just require it to be
0 for now.
There is no ABI.
I mean again to emphasise, this is an _experimental_ feature not to be relied
upon in production.
Thanks for the great work!
Lance
Perhaps we need to put a 'EXPERIMENTAL_' prefix on the config flag too to really
bring this home, as it's perhaps not all that clear :)
No need for a 'EXPERIMENTAL_' prefix, it was just me missing
the background. Appreciate you clarifying this!
Cheers,
Lance