Re: [PATCH] vfs: Add sysctl vfs_cache_pressure_denom for bulk file operations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun 11-05-25 16:36:24, Yafang Shao wrote:
> On our HDFS servers with 12 HDDs per server, a HDFS datanode[0] startup
> involves scanning all files and caching their metadata (including dentries
> and inodes) in memory. Each HDD contains approximately 2 million files,
> resulting in a total of ~20 million cached dentries after initialization.
> 
> To minimize dentry reclamation, we set vfs_cache_pressure to 1. Despite
> this configuration, memory pressure conditions can still trigger
> reclamation of up to 50% of cached dentries, reducing the cache from 20
> million to approximately 10 million entries. During the subsequent cache
> rebuild period, any HDFS datanode restart operation incurs substantial
> latency penalties until full cache recovery completes.
> 
> To maintain service stability, we need to preserve more dentries during
> memory reclamation. The current minimum reclaim ratio (1/100 of total
> dentries) remains too aggressive for our workload. This patch introduces
> vfs_cache_pressure_denom for more granular cache pressure control. The
> configuration [vfs_cache_pressure=1, vfs_cache_pressure_denom=10000]
> effectively maintains the full 20 million dentry cache under memory
> pressure, preventing datanode restart performance degradation.
> 
> Link: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#NameNode+and+DataNodes [0]
> 
> Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>

Makes sense. The patch looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@xxxxxxx>

								Honza

> ---
>  Documentation/admin-guide/sysctl/vm.rst | 32 ++++++++++++++++---------
>  fs/dcache.c                             | 11 ++++++++-
>  2 files changed, 31 insertions(+), 12 deletions(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index 8290177b4f75..d385985b305f 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -75,6 +75,7 @@ Currently, these files are in /proc/sys/vm:
>  - unprivileged_userfaultfd
>  - user_reserve_kbytes
>  - vfs_cache_pressure
> +- vfs_cache_pressure_denom
>  - watermark_boost_factor
>  - watermark_scale_factor
>  - zone_reclaim_mode
> @@ -1017,19 +1018,28 @@ vfs_cache_pressure
>  This percentage value controls the tendency of the kernel to reclaim
>  the memory which is used for caching of directory and inode objects.
>  
> -At the default value of vfs_cache_pressure=100 the kernel will attempt to
> -reclaim dentries and inodes at a "fair" rate with respect to pagecache and
> -swapcache reclaim.  Decreasing vfs_cache_pressure causes the kernel to prefer
> -to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will
> -never reclaim dentries and inodes due to memory pressure and this can easily
> -lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
> -causes the kernel to prefer to reclaim dentries and inodes.
> +At the default value of vfs_cache_pressure=vfs_cache_pressure_denom the kernel
> +will attempt to reclaim dentries and inodes at a "fair" rate with respect to
> +pagecache and swapcache reclaim.  Decreasing vfs_cache_pressure causes the
> +kernel to prefer to retain dentry and inode caches. When vfs_cache_pressure=0,
> +the kernel will never reclaim dentries and inodes due to memory pressure and
> +this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure
> +beyond vfs_cache_pressure_denom causes the kernel to prefer to reclaim dentries
> +and inodes.
>  
> -Increasing vfs_cache_pressure significantly beyond 100 may have negative
> -performance impact. Reclaim code needs to take various locks to find freeable
> -directory and inode objects. With vfs_cache_pressure=1000, it will look for
> -ten times more freeable objects than there are.
> +Increasing vfs_cache_pressure significantly beyond vfs_cache_pressure_denom may
> +have negative performance impact. Reclaim code needs to take various locks to
> +find freeable directory and inode objects. When vfs_cache_pressure equals
> +(10 * vfs_cache_pressure_denom), it will look for ten times more freeable
> +objects than there are.
>  
> +Note: This setting should always be used together with vfs_cache_pressure_denom.
> +
> +vfs_cache_pressure_denom
> +========================
> +
> +Defaults to 100 (minimum allowed value). Requires corresponding
> +vfs_cache_pressure setting to take effect.
>  
>  watermark_boost_factor
>  ======================
> diff --git a/fs/dcache.c b/fs/dcache.c
> index bd5aa136153a..ed46818c151c 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -74,10 +74,11 @@
>   * arbitrary, since it's serialized on rename_lock
>   */
>  static int sysctl_vfs_cache_pressure __read_mostly = 100;
> +static int sysctl_vfs_cache_pressure_denom __read_mostly = 100;
>  
>  unsigned long vfs_pressure_ratio(unsigned long val)
>  {
> -	return mult_frac(val, sysctl_vfs_cache_pressure, 100);
> +	return mult_frac(val, sysctl_vfs_cache_pressure, sysctl_vfs_cache_pressure_denom);
>  }
>  EXPORT_SYMBOL_GPL(vfs_pressure_ratio);
>  
> @@ -225,6 +226,14 @@ static const struct ctl_table vm_dcache_sysctls[] = {
>  		.proc_handler	= proc_dointvec_minmax,
>  		.extra1		= SYSCTL_ZERO,
>  	},
> +	{
> +		.procname	= "vfs_cache_pressure_denom",
> +		.data		= &sysctl_vfs_cache_pressure_denom,
> +		.maxlen		= sizeof(sysctl_vfs_cache_pressure_denom),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ONE_HUNDRED,
> +	},
>  };
>  
>  static int __init init_fs_dcache_sysctls(void)
> -- 
> 2.43.5
> 
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux