On Tue, Aug 19, 2025 at 6:44 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > > On 19/08/2025 03:41, Yafang Shao wrote: > > On Mon, Aug 18, 2025 at 10:35 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >> > >> > >> > >> On 18/08/2025 06:55, Yafang Shao wrote: > >>> Background > >>> ---------- > >>> > >>> Our production servers consistently configure THP to "never" due to > >>> historical incidents caused by its behavior. Key issues include: > >>> - Increased Memory Consumption > >>> THP significantly raises overall memory usage, reducing available memory > >>> for workloads. > >>> > >>> - Latency Spikes > >>> Random latency spikes occur due to frequent memory compaction triggered > >>> by THP. > >>> > >>> - Lack of Fine-Grained Control > >>> THP tuning is globally configured, making it unsuitable for containerized > >>> environments. When multiple workloads share a host, enabling THP without > >>> per-workload control leads to unpredictable behavior. > >>> > >>> Due to these issues, administrators avoid switching to madvise or always > >>> modes—unless per-workload THP control is implemented. > >>> > >>> To address this, we propose BPF-based THP policy for flexible adjustment. > >>> Additionally, as David mentioned [0], this mechanism can also serve as a > >>> policy prototyping tool (test policies via BPF before upstreaming them). > >> > >> Hi Yafang, > >> > >> A few points: > >> > >> The link [0] is mentioned a couple of times in the coverletter, but it doesnt seem > >> to be anywhere in the coverletter. > > > > Oops, my bad. > > > >> > >> I am probably missing something over here, but the current version won't accomplish > >> the usecase you have described at the start of the coverletter and are aiming for, right? > >> i.e. THP global policy "never", but get hugepages on an madvise or always basis. > > > > In "never" mode, THP allocation is entirely disabled (except via > > MADV_COLLAPSE). However, we can achieve the same behavior—and > > more—using a BPF program, even in "madvise" or "always" mode. Instead > > of introducing a new THP mode, we dynamically enforce policy via BPF. > > > > Deployment Steps in our production servers: > > > > 1. Initial Setup: > > - Set THP mode to "never" (disabling THP by default). > > - Attach the BPF program and pin the BPF maps and links. > > - Pinning ensures persistence (like a kernel module), preventing > > disruption under system pressure. > > - A THP whitelist map tracks allowed cgroups (initially empty → no THP > > allocations). > > > > 2. Enable THP Control: > > - Switch THP mode to "always" or "madvise" (BPF now governs actual allocations). > > > Ah ok, so I was missing this part. With this solution you will still have to change > the system policy to madvise or always, and then basically disable THP for everyone apart > from the cgroups that want it? Right. > > > > > 3. Dynamic Management: > > - To permit THP for a cgroup, add its ID to the whitelist map. > > - To revoke permission, remove the cgroup ID from the map. > > - The BPF program can be updated live (policy adjustments require no > > task interruption). > > > >> I think there was a new THP mode introduced in some earlier revision where you can switch to it > >> from "never" and then you can use bpf programs with it, but its not in this revision? > >> It might be useful to add your specific usecase as a selftest. > >> > >> Do we have some numbers on what the overhead of calling the bpf program is in the > >> pagefault path as its a critical path? > > > > In our current implementation, THP allocation occurs during the page > > fault path. As such, I have not yet evaluated performance for this > > specific case. > > The overhead is expected to be workload-dependent, primarily influenced by: > > - Memory availability: The presence (or absence) of higher-order free pages > > - System pressure: Contention for memory compaction, NUMA balancing, > > or direct reclaim > > > > Yes, I think might be worth seeing if perf indicates that you are spending more time > in __handle_mm_fault with this series + bpf program attached compared to without? I will test it. -- Regards Yafang