Shakeel Butt <shakeel.butt@xxxxxxxxx> writes: > On Mon, Aug 18, 2025 at 10:01:22AM -0700, Roman Gushchin wrote: >> This patchset adds an ability to customize the out of memory >> handling using bpf. >> >> It focuses on two parts: >> 1) OOM handling policy, >> 2) PSI-based OOM invocation. >> >> The idea to use bpf for customizing the OOM handling is not new, but >> unlike the previous proposal [1], which augmented the existing task >> ranking policy, this one tries to be as generic as possible and >> leverage the full power of the modern bpf. >> >> It provides a generic interface which is called before the existing OOM >> killer code and allows implementing any policy, e.g. picking a victim >> task or memory cgroup or potentially even releasing memory in other >> ways, e.g. deleting tmpfs files (the last one might require some >> additional but relatively simple changes). > > The releasing memory part is really interesting and useful. I can see > much more reliable and targetted oom reaping with this approach. > >> >> The past attempt to implement memory-cgroup aware policy [2] showed >> that there are multiple opinions on what the best policy is. As it's >> highly workload-dependent and specific to a concrete way of organizing >> workloads, the structure of the cgroup tree etc, > > and user space policies like Google has very clear priorities among > concurrently running workloads while many other users do not. > >> a customizable >> bpf-based implementation is preferable over a in-kernel implementation >> with a dozen on sysctls. > > +1 > >> >> The second part is related to the fundamental question on when to >> declare the OOM event. It's a trade-off between the risk of >> unnecessary OOM kills and associated work losses and the risk of >> infinite trashing and effective soft lockups. In the last few years >> several PSI-based userspace solutions were developed (e.g. OOMd [3] or >> systemd-OOMd [4] > > and Android's LMKD (https://source.android.com/docs/core/perf/lmkd) uses > PSI too. > >> ). The common idea was to use userspace daemons to >> implement custom OOM logic as well as rely on PSI monitoring to avoid >> stalls. In this scenario the userspace daemon was supposed to handle >> the majority of OOMs, while the in-kernel OOM killer worked as the >> last resort measure to guarantee that the system would never deadlock >> on the memory. But this approach creates additional infrastructure >> churn: userspace OOM daemon is a separate entity which needs to be >> deployed, updated, monitored. A completely different pipeline needs to >> be built to monitor both types of OOM events and collect associated >> logs. A userspace daemon is more restricted in terms on what data is >> available to it. Implementing a daemon which can work reliably under a >> heavy memory pressure in the system is also tricky. > > Thanks for raising this and it is really challenging on very aggressive > overcommitted system. The userspace oom-killer needs cpu (or scheduling) > and memory guarantees as it needs to run and collect stats to decide who > to kill. Even with that, it can still get stuck in some global kernel > locks (I remember at Google I have seen their userspace oom-killer which > was a thread in borglet stuck on cgroup mutex or kernfs lock or > something). Anyways I see a lot of potential of this BPF based > oom-killer. > > Orthogonally I am wondering if we can enable actions other than killing. > For example some workloads might prefer to get frozen or migrated away > instead of being killed. Absolutely, PSI events handling in the kernel (via BPF) opens a broad range of possibilities. e.g. we can tune cgroup knobs, freeze/unfreeze tasks, remove tmpfs files, promote/demote memory to other tiers, etc. I was also thinking about tuning the readahead based on the memory pressure. Thanks!