On 19/08/2025 04:08, Yafang Shao wrote: >> Hi Yafang, >> >> From the coverletter, one of the potential usecases you are trying to solve for is if global policy >> is "never", but the workload want THPs (either always or on madvise basis). But over here, >> MMF_VM_HUGEPAGE will never be set so in that case mm_flags_test(MMF_VM_HUGEPAGE, oldmm) will >> always evaluate to false and the get_sugested_order call doesnt matter? > > See the replyment in another thread. > >> >> >> >>> __khugepaged_enter(mm); >>> } >>> >>> diff --git a/mm/Kconfig b/mm/Kconfig >>> index 4108bcd96784..d10089e3f181 100644 >>> --- a/mm/Kconfig >>> +++ b/mm/Kconfig >>> @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT >>> >>> EXPERIMENTAL because the impact of some changes is still unclear. >>> >>> +config EXPERIMENTAL_BPF_ORDER_SELECTION >>> + bool "BPF-based THP order selection (EXPERIMENTAL)" >>> + depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL >>> + >>> + help >>> + Enable dynamic THP order selection using BPF programs. This >>> + experimental feature allows custom BPF logic to determine optimal >>> + transparent hugepage allocation sizes at runtime. >>> + >>> + Warning: This feature is unstable and may change in future kernel >>> + versions. >>> + >> >> >> I know there was a discussion on this earlier, but my opinion is that putting all of this >> as experiment with warnings is not great. No one will be able to deploy this in production >> if its going to be removed, and I believe thats where the real usage is. > > See the replyment in another thread. > >> >>> endif # TRANSPARENT_HUGEPAGE >>> >>> # simple helper to make the code a bit easier to read >>> diff --git a/mm/Makefile b/mm/Makefile >>> index ef54aa615d9d..cb55d1509be1 100644 >>> --- a/mm/Makefile >>> +++ b/mm/Makefile >>> @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o >>> obj-$(CONFIG_NUMA) += memory-tiers.o >>> obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o >>> obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o >>> +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o >>> obj-$(CONFIG_PAGE_COUNTER) += page_counter.o >>> obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o >>> obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o >>> diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c >>> new file mode 100644 >>> index 000000000000..2b03539452d1 >>> --- /dev/null >>> +++ b/mm/bpf_thp.c >>> @@ -0,0 +1,186 @@ >>> +// SPDX-License-Identifier: GPL-2.0 >>> + >>> +#include <linux/bpf.h> >>> +#include <linux/btf.h> >>> +#include <linux/huge_mm.h> >>> +#include <linux/khugepaged.h> >>> + >>> +struct bpf_thp_ops { >>> + /** >>> + * @get_suggested_order: Get the suggested THP orders for allocation >>> + * @mm: mm_struct associated with the THP allocation >>> + * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL) >>> + * When NULL, the decision should be based on @mm (i.e., when >>> + * triggered from an mm-scope hook rather than a VMA-specific >>> + * context). >>> + * Must belong to @mm (guaranteed by the caller). >>> + * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL) >>> + * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL) >>> + * @orders: Bitmask of requested THP orders for this allocation >>> + * - PMD-mapped allocation if PMD_ORDER is set >>> + * - mTHP allocation otherwise >>> + * >>> + * Rerurn: Bitmask of suggested THP orders for allocation. The highest >>> + * suggested order will not exceed the highest requested order >>> + * in @orders. >> >> If we want to make this generic enough so that it doesnt change, should we allow suggested order to >> exceed highest requested order? > > The maximum requested order is determined by the callsite. For example: > - PMD-mapped THP uses PMD_ORDER > - mTHP uses (PMD_ORDER - 1) > > We must respect this upper bound to avoid undefined behavior. Ack, makes sense. > >> >>> + */ >>> + int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable, >>> + u64 vma_flags, enum tva_type tva_flags, int orders) __rcu; >>> +}; >>> + >>> +static struct bpf_thp_ops bpf_thp; >>> +static DEFINE_SPINLOCK(thp_ops_lock); >>> + >>> +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable, >>> + u64 vma_flags, enum tva_type tva_flags, int orders) >>> +{ >>> + int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable, >>> + u64 vma_flags, enum tva_type tva_flags, int orders); >>> + int suggested_orders = orders; >>> + >>> + /* No BPF program is attached */ >>> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, >>> + &transparent_hugepage_flags)) >>> + return suggested_orders; >>> + >>> + rcu_read_lock(); >>> + bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order); >>> + if (!bpf_suggested_order) >>> + goto out; >> >> >> My rcu API knowledge is not the best, but maybe we could do: >> >> if (!rcu_access_pointer(bpf_thp.get_suggested_order)) >> return suggested_orders; >> > > There might be a race here. The current rcu_access_pointer() check > occurs outside the RCU read-side critical section, meaning the > protected pointer could be freed between the check and use. > Therefore, we must perform the NULL check within the RCU read critical > section when dereferencing the pointer: > Ack, makes sense.