On Tue, Jul 29, 2025 at 11:08 PM Zi Yan <ziy@xxxxxxxxxx> wrote: > > On 29 Jul 2025, at 5:18, Yafang Shao wrote: > > > Background > > ---------- > > > > Our production servers consistently configure THP to "never" due to > > historical incidents caused by its behavior. Key issues include: > > - Increased Memory Consumption > > THP significantly raises overall memory usage, reducing available memory > > for workloads. > > > > - Latency Spikes > > Random latency spikes occur due to frequent memory compaction triggered > > by THP. > > > > - Lack of Fine-Grained Control > > THP tuning is globally configured, making it unsuitable for containerized > > environments. When multiple workloads share a host, enabling THP without > > per-workload control leads to unpredictable behavior. > > > > Due to these issues, administrators avoid switching to madvise or always > > modes—unless per-workload THP control is implemented. > > > > To address this, we propose BPF-based THP policy for flexible adjustment. > > Additionally, as David mentioned [0], this mechanism can also serve as a > > The link to [0] is missing. :) I forgot to add it: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@xxxxxxxxxx/ > > > policy prototyping tool (test policies via BPF before upstreaming them). > > > > Proposed Solution > > ----------------- > > > > As suggested by David [0], we introduce a new BPF interface: > > > > /** > > * @get_suggested_order: Get the suggested highest THP order for allocation > > * @mm: mm_struct associated with the THP allocation > > * @tva_flags: TVA flags for current context > > * %TVA_IN_PF: Set when in page fault context > > * Other flags: Reserved for future use > > * @order: The highest order being considered for this THP allocation. > > * %PUD_ORDER for PUD-mapped allocations > > There is no PUD THP yet and the highest THP order is PMD_ORDER. It is better > to remove the line above to avoid confusion. Thanks for catching that. I’ll remove it. > > > * %PMD_ORDER for PMD-mapped allocations > > * %PMD_ORDER - 1 for mTHP allocations > > * > > * Rerurn: Suggested highest THP order to use for allocation. The returned > > * order will never exceed the input @order value. > > */ > > int (*get_suggested_order)(struct mm_struct *mm, unsigned long tva_flags, int order); > > > > This interface: > > - Supports both use cases (per-workload tuning + policy prototyping). > > - Can be extended with BPF helpers (e.g., for memory pressure awareness). > > IIRC, your initial RFC works at VMA level, but this patch targets mm level. > Is mm sufficient for your use case? Yes, mm is sufficient for our use cases. We've already deployed a variant of this patchset in our production environment, and it has been performing well under our workloads. > Are you planning to extend the > BFP interface to VMA in the future? Just curious. Our use cases don’t currently require the VMA. We can add it later if a clear need arises. -- Regards Yafang