Re: [DISCUSSION] proposed mctl() API

Johannes Weiner <hannes@xxxxxxxxxxx> · Wed, 4 Jun 2025 09:21:21 -0400

On Mon, Jun 02, 2025 at 07:01:56PM +0100, Matthew Wilcox wrote:
> On Thu, May 29, 2025 at 05:14:23PM -0400, Johannes Weiner wrote:
> > On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
> > > Barry's problem is that we're all nervous about possibly regressing
> > > performance on some unknown workloads.  Just try Barry's proposal, see
> > > if anyone actually compains or if we're just afraid of our own shadows.
> > 
> > I actually explained why I think this is a terrible idea. But okay, I
> > tried the patch anyway.
> 
> Sorry, I must've missed that one ;-(

Apologies for my tone. The discussion is spread out over too many
threads...

> > This is 'git log' on a hot kernel repo after a large IO stream:
> > 
> >                                      VANILLA                      BARRY
> > Real time                 49.93 (    +0.00%)         60.36 (   +20.48%)
> > User time                 32.10 (    +0.00%)         32.09 (    -0.04%)
> > System time               14.41 (    +0.00%)         14.64 (    +1.50%)
> > pgmajfault              9227.00 (    +0.00%)      18390.00 (   +99.30%)
> > workingset_refault_file  184.00 (    +0.00%)    236899.00 (+127954.05%)
> > 
> > Clearly we can't generally ignore page cache hits just because the
> > mmaps() are intermittent.
> > 
> > The whole point is to cache across processes and their various
> > apertures into a common, long-lived filesystem space.
> > 
> > Barry knows something about the relationship between certain processes
> > and certain files that he could exploit with MADV_COLD-on-exit
> > semantics. But that's not something the kernel can safely assume. Not
> > without defeating the page cache for an entire class of file accesses.
> 
> So what about distinguishing between exited-normally processes (ie git
> log) vs killed-by-oom processes (ie Barry's usecase)?  Update the
> referenced bit in the first case and not the second?

In cloud environments, it's common to restart a workload immediately
after an OOM kill.

The hosts tend to handle a fairly dynamic mix of batch jobs and
semi-predictable user request load, all while also trying to target
decent average host utilization. Adapting to external load peaks is
laggy (spawning new workers, rebalancing).

In such setups, OOM conditions are generally assumed to be highly
transient. And quick restarts are important to avoid cascading
failures in the worker pool during load peaks.

So I don't think OOM is a good universal signal that the workload is
gone and the memory is cold.