On Mon, Jun 30, 2025 at 5:13 AM Sasha Levin <sashal@xxxxxxxxxx> wrote: > > On Sun, Jun 29, 2025 at 09:58:12PM +0200, Mateusz Guzik wrote: > >On Sun, Jun 29, 2025 at 03:40:21AM -0400, Sasha Levin wrote: > >> When sysctl_nr_open is set to a very high value (for example, 1073741816 > >> as set by systemd), processes attempting to use file descriptors near > >> the limit can trigger massive memory allocation attempts that exceed > >> INT_MAX, resulting in a WARNING in mm/slub.c: > >> > >> WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288 > >> > >> This happens because kvmalloc_array() and kvmalloc() check if the > >> requested size exceeds INT_MAX and emit a warning when the allocation is > >> not flagged with __GFP_NOWARN. > >> > >> Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a > >> process calls dup2(oldfd, 1073741880), the kernel attempts to allocate: > >> - File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes > >> - Multiple bitmaps: ~400MB > >> - Total allocation size: > 8GB (exceeding INT_MAX = 2,147,483,647) > >> > >> Reproducer: > >> 1. Set /proc/sys/fs/nr_open to 1073741816: > >> # echo 1073741816 > /proc/sys/fs/nr_open > >> > >> 2. Run a program that uses a high file descriptor: > >> #include <unistd.h> > >> #include <sys/resource.h> > >> > >> int main() { > >> struct rlimit rlim = {1073741824, 1073741824}; > >> setrlimit(RLIMIT_NOFILE, &rlim); > >> dup2(2, 1073741880); // Triggers the warning > >> return 0; > >> } > >> > >> 3. Observe WARNING in dmesg at mm/slub.c:5027 > >> > >> systemd commit a8b627a introduced automatic bumping of fs.nr_open to the > >> maximum possible value. The rationale was that systems with memory > >> control groups (memcg) no longer need separate file descriptor limits > >> since memory is properly accounted. However, this change overlooked > >> that: > >> > >> 1. The kernel's allocation functions still enforce INT_MAX as a maximum > >> size regardless of memcg accounting > >> 2. Programs and tests that legitimately test file descriptor limits can > >> inadvertently trigger massive allocations > >> 3. The resulting allocations (>8GB) are impractical and will always fail > >> > > > >alloc_fdtable() seems like the wrong place to do it. > > > >If there is an explicit de facto limit, the machinery which alters > >fs.nr_open should validate against it. > > > >I understand this might result in systemd setting a new value which > >significantly lower than what it uses now which technically is a change > >in behavior, but I don't think it's a big deal. > > > >I'm assuming the kernel can't just set the value to something very high > >by default. > > > >But in that case perhaps it could expose the max settable value? Then > >systemd would not have to guess. > > The patch is in alloc_fdtable() because it's addressing a memory > allocator limitation, not a fundamental file descriptor limitation. > > The INT_MAX restriction comes from kvmalloc(), not from any inherent > constraint on how many FDs a process can have. If we implemented sparse > FD tables or if kvmalloc() later supports larger allocations, the same > nr_open value could become usable without any changes to FD handling > code. > > Putting the check at the sysctl layer would codify a temporary > implementation detail of the memory allocator as if it were a > fundamental FD limit. By keeping it at the allocation point, the check > reflects what it actually is - a current limitation of how large a > contiguous allocation we can make. > > This placement also means the limit naturally adjusts if the underlying > implementation changes, rather than requiring coordinated updates > between the sysctl validation and the allocator capabilities. > > I don't have a strong opinion either way... > Allowing privileged userspace to set a limit which the kernel knows it cannot reach sounds like a bug to me. Indeed the limitation is an artifact of the current implementation, I don't understand the logic behind pretending it's not there. Regardless, not my call :) -- Mateusz Guzik <mjguzik gmail.com>