Re: [PATCH] fs: Prevent file descriptor table allocations exceeding INT_MAX

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jun 30, 2025 at 5:13 AM Sasha Levin <sashal@xxxxxxxxxx> wrote:
>
> On Sun, Jun 29, 2025 at 09:58:12PM +0200, Mateusz Guzik wrote:
> >On Sun, Jun 29, 2025 at 03:40:21AM -0400, Sasha Levin wrote:
> >> When sysctl_nr_open is set to a very high value (for example, 1073741816
> >> as set by systemd), processes attempting to use file descriptors near
> >> the limit can trigger massive memory allocation attempts that exceed
> >> INT_MAX, resulting in a WARNING in mm/slub.c:
> >>
> >>   WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288
> >>
> >> This happens because kvmalloc_array() and kvmalloc() check if the
> >> requested size exceeds INT_MAX and emit a warning when the allocation is
> >> not flagged with __GFP_NOWARN.
> >>
> >> Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a
> >> process calls dup2(oldfd, 1073741880), the kernel attempts to allocate:
> >> - File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes
> >> - Multiple bitmaps: ~400MB
> >> - Total allocation size: > 8GB (exceeding INT_MAX = 2,147,483,647)
> >>
> >> Reproducer:
> >> 1. Set /proc/sys/fs/nr_open to 1073741816:
> >>    # echo 1073741816 > /proc/sys/fs/nr_open
> >>
> >> 2. Run a program that uses a high file descriptor:
> >>    #include <unistd.h>
> >>    #include <sys/resource.h>
> >>
> >>    int main() {
> >>        struct rlimit rlim = {1073741824, 1073741824};
> >>        setrlimit(RLIMIT_NOFILE, &rlim);
> >>        dup2(2, 1073741880);  // Triggers the warning
> >>        return 0;
> >>    }
> >>
> >> 3. Observe WARNING in dmesg at mm/slub.c:5027
> >>
> >> systemd commit a8b627a introduced automatic bumping of fs.nr_open to the
> >> maximum possible value. The rationale was that systems with memory
> >> control groups (memcg) no longer need separate file descriptor limits
> >> since memory is properly accounted. However, this change overlooked
> >> that:
> >>
> >> 1. The kernel's allocation functions still enforce INT_MAX as a maximum
> >>    size regardless of memcg accounting
> >> 2. Programs and tests that legitimately test file descriptor limits can
> >>    inadvertently trigger massive allocations
> >> 3. The resulting allocations (>8GB) are impractical and will always fail
> >>
> >
> >alloc_fdtable() seems like the wrong place to do it.
> >
> >If there is an explicit de facto limit, the machinery which alters
> >fs.nr_open should validate against it.
> >
> >I understand this might result in systemd setting a new value which
> >significantly lower than what it uses now which technically is a change
> >in behavior, but I don't think it's a big deal.
> >
> >I'm assuming the kernel can't just set the value to something very high
> >by default.
> >
> >But in that case perhaps it could expose the max settable value? Then
> >systemd would not have to guess.
>
> The patch is in alloc_fdtable() because it's addressing a memory
> allocator limitation, not a fundamental file descriptor limitation.
>
> The INT_MAX restriction comes from kvmalloc(), not from any inherent
> constraint on how many FDs a process can have. If we implemented sparse
> FD tables or if kvmalloc() later supports larger allocations, the same
> nr_open value could become usable without any changes to FD handling
> code.
>
> Putting the check at the sysctl layer would codify a temporary
> implementation detail of the memory allocator as if it were a
> fundamental FD limit. By keeping it at the allocation point, the check
> reflects what it actually is - a current limitation of how large a
> contiguous allocation we can make.
>
> This placement also means the limit naturally adjusts if the underlying
> implementation changes, rather than requiring coordinated updates
> between the sysctl validation and the allocator capabilities.
>
> I don't have a strong opinion either way...
>

Allowing privileged userspace to set a limit which the kernel knows it
cannot reach sounds like a bug to me.

Indeed the limitation is an artifact of the current implementation, I
don't understand the logic behind pretending it's not there.

Regardless, not my call :)
-- 
Mateusz Guzik <mjguzik gmail.com>





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux