Re: [RFC PATCH v5 mm-new 1/5] mm: thp: add support for BPF based THP order selection

Usama Arif <usamaarif642@xxxxxxxxx> · Tue, 19 Aug 2025 11:11:05 +0100

On 19/08/2025 04:08, Yafang Shao wrote:
>> Hi Yafang,
>>
>> From the coverletter, one of the potential usecases you are trying to solve for is if global policy
>> is "never", but the workload want THPs (either always or on madvise basis). But over here,
>> MMF_VM_HUGEPAGE will never be set so in that case mm_flags_test(MMF_VM_HUGEPAGE, oldmm) will
>> always evaluate to false and the get_sugested_order call doesnt matter?
> 
> See the replyment in another thread.
> 
>>
>>
>>
>>>               __khugepaged_enter(mm);
>>>  }
>>>
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 4108bcd96784..d10089e3f181 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
>>>
>>>         EXPERIMENTAL because the impact of some changes is still unclear.
>>>
>>> +config EXPERIMENTAL_BPF_ORDER_SELECTION
>>> +     bool "BPF-based THP order selection (EXPERIMENTAL)"
>>> +     depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
>>> +
>>> +     help
>>> +       Enable dynamic THP order selection using BPF programs. This
>>> +       experimental feature allows custom BPF logic to determine optimal
>>> +       transparent hugepage allocation sizes at runtime.
>>> +
>>> +       Warning: This feature is unstable and may change in future kernel
>>> +       versions.
>>> +
>>
>>
>> I know there was a discussion on this earlier, but my opinion is that putting all of this
>> as experiment with warnings is not great. No one will be able to deploy this in production
>> if its going to be removed, and I believe thats where the real usage is.
> 
> See the replyment in another thread.
> 
>>
>>>  endif # TRANSPARENT_HUGEPAGE
>>>
>>>  # simple helper to make the code a bit easier to read
>>> diff --git a/mm/Makefile b/mm/Makefile
>>> index ef54aa615d9d..cb55d1509be1 100644
>>> --- a/mm/Makefile
>>> +++ b/mm/Makefile
>>> @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>>>  obj-$(CONFIG_NUMA) += memory-tiers.o
>>>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>>> +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
>>>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>>>  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
>>>  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
>>> diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
>>> new file mode 100644
>>> index 000000000000..2b03539452d1
>>> --- /dev/null
>>> +++ b/mm/bpf_thp.c
>>> @@ -0,0 +1,186 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +
>>> +#include <linux/bpf.h>
>>> +#include <linux/btf.h>
>>> +#include <linux/huge_mm.h>
>>> +#include <linux/khugepaged.h>
>>> +
>>> +struct bpf_thp_ops {
>>> +     /**
>>> +      * @get_suggested_order: Get the suggested THP orders for allocation
>>> +      * @mm: mm_struct associated with the THP allocation
>>> +      * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
>>> +      *                 When NULL, the decision should be based on @mm (i.e., when
>>> +      *                 triggered from an mm-scope hook rather than a VMA-specific
>>> +      *                 context).
>>> +      *                 Must belong to @mm (guaranteed by the caller).
>>> +      * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
>>> +      * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
>>> +      * @orders: Bitmask of requested THP orders for this allocation
>>> +      *          - PMD-mapped allocation if PMD_ORDER is set
>>> +      *          - mTHP allocation otherwise
>>> +      *
>>> +      * Rerurn: Bitmask of suggested THP orders for allocation. The highest
>>> +      *         suggested order will not exceed the highest requested order
>>> +      *         in @orders.
>>
>> If we want to make this generic enough so that it doesnt change, should we allow suggested order to
>> exceed highest requested order?
> 
> The maximum requested order is determined by the callsite. For example:
> - PMD-mapped THP uses PMD_ORDER
> - mTHP uses (PMD_ORDER - 1)
> 
> We must respect this upper bound to avoid undefined behavior.

Ack, makes sense.
> 
>>
>>> +      */
>>> +     int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>>> +                                u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
>>> +};
>>> +
>>> +static struct bpf_thp_ops bpf_thp;
>>> +static DEFINE_SPINLOCK(thp_ops_lock);
>>> +
>>> +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>>> +                     u64 vma_flags, enum tva_type tva_flags, int orders)
>>> +{
>>> +     int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>>> +                                u64 vma_flags, enum tva_type tva_flags, int orders);
>>> +     int suggested_orders = orders;
>>> +
>>> +     /* No BPF program is attached */
>>> +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>> +                   &transparent_hugepage_flags))
>>> +             return suggested_orders;
>>> +
>>> +     rcu_read_lock();
>>> +     bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
>>> +     if (!bpf_suggested_order)
>>> +             goto out;
>>
>>
>> My rcu API knowledge is not the best, but maybe we could do:
>>
>> if (!rcu_access_pointer(bpf_thp.get_suggested_order))
>>         return suggested_orders;
>>
> 
> There might be a race here.  The current rcu_access_pointer() check
> occurs outside the RCU read-side critical section, meaning the
> protected pointer could be freed between the check and use.
> Therefore, we must perform the NULL check within the RCU read critical
> section when dereferencing the pointer:
> 

Ack, makes sense.