On Mon, Jun 02, 2025 at 07:01:56PM +0100, Matthew Wilcox wrote: > On Thu, May 29, 2025 at 05:14:23PM -0400, Johannes Weiner wrote: > > On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote: > > > Barry's problem is that we're all nervous about possibly regressing > > > performance on some unknown workloads. Just try Barry's proposal, see > > > if anyone actually compains or if we're just afraid of our own shadows. > > > > I actually explained why I think this is a terrible idea. But okay, I > > tried the patch anyway. > > Sorry, I must've missed that one ;-( Apologies for my tone. The discussion is spread out over too many threads... > > This is 'git log' on a hot kernel repo after a large IO stream: > > > > VANILLA BARRY > > Real time 49.93 ( +0.00%) 60.36 ( +20.48%) > > User time 32.10 ( +0.00%) 32.09 ( -0.04%) > > System time 14.41 ( +0.00%) 14.64 ( +1.50%) > > pgmajfault 9227.00 ( +0.00%) 18390.00 ( +99.30%) > > workingset_refault_file 184.00 ( +0.00%) 236899.00 (+127954.05%) > > > > Clearly we can't generally ignore page cache hits just because the > > mmaps() are intermittent. > > > > The whole point is to cache across processes and their various > > apertures into a common, long-lived filesystem space. > > > > Barry knows something about the relationship between certain processes > > and certain files that he could exploit with MADV_COLD-on-exit > > semantics. But that's not something the kernel can safely assume. Not > > without defeating the page cache for an entire class of file accesses. > > So what about distinguishing between exited-normally processes (ie git > log) vs killed-by-oom processes (ie Barry's usecase)? Update the > referenced bit in the first case and not the second? In cloud environments, it's common to restart a workload immediately after an OOM kill. The hosts tend to handle a fairly dynamic mix of batch jobs and semi-predictable user request load, all while also trying to target decent average host utilization. Adapting to external load peaks is laggy (spawning new workers, rebalancing). In such setups, OOM conditions are generally assumed to be highly transient. And quick restarts are important to avoid cascading failures in the worker pool during load peaks. So I don't think OOM is a good universal signal that the workload is gone and the memory is cold.