On Sun 11-05-25 16:36:24, Yafang Shao wrote: > On our HDFS servers with 12 HDDs per server, a HDFS datanode[0] startup > involves scanning all files and caching their metadata (including dentries > and inodes) in memory. Each HDD contains approximately 2 million files, > resulting in a total of ~20 million cached dentries after initialization. > > To minimize dentry reclamation, we set vfs_cache_pressure to 1. Despite > this configuration, memory pressure conditions can still trigger > reclamation of up to 50% of cached dentries, reducing the cache from 20 > million to approximately 10 million entries. During the subsequent cache > rebuild period, any HDFS datanode restart operation incurs substantial > latency penalties until full cache recovery completes. > > To maintain service stability, we need to preserve more dentries during > memory reclamation. The current minimum reclaim ratio (1/100 of total > dentries) remains too aggressive for our workload. This patch introduces > vfs_cache_pressure_denom for more granular cache pressure control. The > configuration [vfs_cache_pressure=1, vfs_cache_pressure_denom=10000] > effectively maintains the full 20 million dentry cache under memory > pressure, preventing datanode restart performance degradation. > > Link: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#NameNode+and+DataNodes [0] > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> Makes sense. The patch looks good. Feel free to add: Reviewed-by: Jan Kara <jack@xxxxxxx> Honza > --- > Documentation/admin-guide/sysctl/vm.rst | 32 ++++++++++++++++--------- > fs/dcache.c | 11 ++++++++- > 2 files changed, 31 insertions(+), 12 deletions(-) > > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst > index 8290177b4f75..d385985b305f 100644 > --- a/Documentation/admin-guide/sysctl/vm.rst > +++ b/Documentation/admin-guide/sysctl/vm.rst > @@ -75,6 +75,7 @@ Currently, these files are in /proc/sys/vm: > - unprivileged_userfaultfd > - user_reserve_kbytes > - vfs_cache_pressure > +- vfs_cache_pressure_denom > - watermark_boost_factor > - watermark_scale_factor > - zone_reclaim_mode > @@ -1017,19 +1018,28 @@ vfs_cache_pressure > This percentage value controls the tendency of the kernel to reclaim > the memory which is used for caching of directory and inode objects. > > -At the default value of vfs_cache_pressure=100 the kernel will attempt to > -reclaim dentries and inodes at a "fair" rate with respect to pagecache and > -swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer > -to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will > -never reclaim dentries and inodes due to memory pressure and this can easily > -lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 > -causes the kernel to prefer to reclaim dentries and inodes. > +At the default value of vfs_cache_pressure=vfs_cache_pressure_denom the kernel > +will attempt to reclaim dentries and inodes at a "fair" rate with respect to > +pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes the > +kernel to prefer to retain dentry and inode caches. When vfs_cache_pressure=0, > +the kernel will never reclaim dentries and inodes due to memory pressure and > +this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure > +beyond vfs_cache_pressure_denom causes the kernel to prefer to reclaim dentries > +and inodes. > > -Increasing vfs_cache_pressure significantly beyond 100 may have negative > -performance impact. Reclaim code needs to take various locks to find freeable > -directory and inode objects. With vfs_cache_pressure=1000, it will look for > -ten times more freeable objects than there are. > +Increasing vfs_cache_pressure significantly beyond vfs_cache_pressure_denom may > +have negative performance impact. Reclaim code needs to take various locks to > +find freeable directory and inode objects. When vfs_cache_pressure equals > +(10 * vfs_cache_pressure_denom), it will look for ten times more freeable > +objects than there are. > > +Note: This setting should always be used together with vfs_cache_pressure_denom. > + > +vfs_cache_pressure_denom > +======================== > + > +Defaults to 100 (minimum allowed value). Requires corresponding > +vfs_cache_pressure setting to take effect. > > watermark_boost_factor > ====================== > diff --git a/fs/dcache.c b/fs/dcache.c > index bd5aa136153a..ed46818c151c 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -74,10 +74,11 @@ > * arbitrary, since it's serialized on rename_lock > */ > static int sysctl_vfs_cache_pressure __read_mostly = 100; > +static int sysctl_vfs_cache_pressure_denom __read_mostly = 100; > > unsigned long vfs_pressure_ratio(unsigned long val) > { > - return mult_frac(val, sysctl_vfs_cache_pressure, 100); > + return mult_frac(val, sysctl_vfs_cache_pressure, sysctl_vfs_cache_pressure_denom); > } > EXPORT_SYMBOL_GPL(vfs_pressure_ratio); > > @@ -225,6 +226,14 @@ static const struct ctl_table vm_dcache_sysctls[] = { > .proc_handler = proc_dointvec_minmax, > .extra1 = SYSCTL_ZERO, > }, > + { > + .procname = "vfs_cache_pressure_denom", > + .data = &sysctl_vfs_cache_pressure_denom, > + .maxlen = sizeof(sysctl_vfs_cache_pressure_denom), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = SYSCTL_ONE_HUNDRED, > + }, > }; > > static int __init init_fs_dcache_sysctls(void) > -- > 2.43.5 > -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR