Re: [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/9/25 03:00, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@xxxxxxxxxx>
> 
> Overview:
> 
> This patch set introduces kmalloc_nolock() which is the next logical
> step towards any context allocation necessary to remove bpf_mem_alloc
> and get rid of preallocation requirement in BPF infrastructure.
> In production BPF maps grew to gigabytes in size. Preallocation wastes
> memory. Alloc from any context addresses this issue for BPF and
> other subsystems that are forced to preallocate too.
> This long task started with introduction of alloc_pages_nolock(),
> then memcg and objcg were converted to operate from any context
> including NMI, this set completes the task with kmalloc_nolock()
> that builds on top of alloc_pages_nolock() and memcg changes.
> After that BPF subsystem will gradually adopt it everywhere.
> 
> The patch set is on top of slab/for-next that already has
> pre-patch "locking/local_lock: Expose dep_map in local_trylock_t." applied.
> I think the patch set should be routed via vbabka/slab.git.

Thanks, added to slab/for-next. There were no conflicts with mm-unstable
when tried locally.

> v4->v5:
> - New patch "Reuse first bit for OBJEXTS_ALLOC_FAIL" to free up a bit
>   and use it to mark slabobj_ext vector allocated with kmalloc_nolock(),
>   so that freeing of the vector can be done with kfree_nolock()
> - Call kasan_slab_free() directly from kfree_nolock() instead of deferring to
>   do_slab_free() to avoid double poisoning
> - Addressed other minor issues spotted by Harry
> 
> v4:
> https://lore.kernel.org/all/20250718021646.73353-1-alexei.starovoitov@xxxxxxxxx/
> 
> v3->v4:
> - Converted local_lock_cpu_slab() to macro
> - Reordered patches 5 and 6
> - Emphasized that kfree_nolock() shouldn't be used on kmalloc()-ed objects
> - Addressed other comments and improved commit logs
> - Fixed build issues reported by bots
> 
> v3:
> https://lore.kernel.org/bpf/20250716022950.69330-1-alexei.starovoitov@xxxxxxxxx/
> 
> v2->v3:
> - Adopted Sebastian's local_lock_cpu_slab(), but dropped gfpflags
>   to avoid extra branch for performance reasons,
>   and added local_unlock_cpu_slab() for symmetry.
> - Dropped local_lock_lockdep_start/end() pair and switched to
>   per kmem_cache lockdep class on PREEMPT_RT to silence false positive
>   when the same cpu/task acquires two local_lock-s.
> - Refactorred defer_free per Sebastian's suggestion
> - Fixed slab leak when it needs to be deactivated via irq_work and llist
>   as Vlastimil proposed. Including defer_free_barrier().
> - Use kmem_cache->offset for llist_node pointer when linking objects
>   instead of zero offset, since whole object could be used for slabs
>   with ctors and other cases.
> - Fixed "cnt = 1; goto redo;" issue.
> - Fixed slab leak in alloc_single_from_new_slab().
> - Retested with slab_debug, RT, !RT, lockdep, kasan, slab_tiny
> - Added acks to patches 1-4 that should be good to go.
> 
> v2:
> https://lore.kernel.org/bpf/20250709015303.8107-1-alexei.starovoitov@xxxxxxxxx/
> 
> v1->v2:
> Added more comments for this non-trivial logic and addressed earlier comments.
> In particular:
> - Introduce alloc_frozen_pages_nolock() to avoid refcnt race
> - alloc_pages_nolock() defaults to GFP_COMP
> - Support SLUB_TINY
> - Added more variants to stress tester to discover that kfree_nolock() can
>   OOM, because deferred per-slab llist won't be serviced if kfree_nolock()
>   gets unlucky long enough. Scraped previous approach and switched to
>   global per-cpu llist with immediate irq_work_queue() to process all
>   object sizes.
> - Reentrant kmalloc cannot deactivate_slab(). In v1 the node hint was
>   downgraded to NUMA_NO_NODE before calling slab_alloc(). Realized it's not
>   good enough. There are odd cases that can trigger deactivate. Rewrote
>   this part.
> - Struggled with SLAB_NO_CMPXCHG. Thankfully Harry had a great suggestion:
>   https://lore.kernel.org/bpf/aFvfr1KiNrLofavW@hyeyoo/
>   which was adopted. So slab_debug works now.
> - In v1 I had to s/local_lock_irqsave/local_lock_irqsave_check/ in a bunch
>   of places in mm/slub.c to avoid lockdep false positives.
>   Came up with much cleaner approach to silence invalid lockdep reports
>   without sacrificing lockdep coverage. See local_lock_lockdep_start/end().
> 
> v1:
> https://lore.kernel.org/bpf/20250501032718.65476-1-alexei.starovoitov@xxxxxxxxx/
> 
> Alexei Starovoitov (6):
>   locking/local_lock: Introduce local_lock_is_locked().
>   mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock().
>   mm: Introduce alloc_frozen_pages_nolock()
>   slab: Make slub local_(try)lock more precise for LOCKDEP
>   slab: Reuse first bit for OBJEXTS_ALLOC_FAIL
>   slab: Introduce kmalloc_nolock() and kfree_nolock().
> 
>  include/linux/gfp.h                 |   2 +-
>  include/linux/kasan.h               |  13 +-
>  include/linux/local_lock.h          |   2 +
>  include/linux/local_lock_internal.h |   7 +
>  include/linux/memcontrol.h          |  12 +-
>  include/linux/rtmutex.h             |  10 +
>  include/linux/slab.h                |   4 +
>  kernel/bpf/stream.c                 |   2 +-
>  kernel/bpf/syscall.c                |   2 +-
>  kernel/locking/rtmutex_common.h     |   9 -
>  mm/Kconfig                          |   1 +
>  mm/internal.h                       |   4 +
>  mm/kasan/common.c                   |   5 +-
>  mm/page_alloc.c                     |  55 ++--
>  mm/slab.h                           |   7 +
>  mm/slab_common.c                    |   3 +
>  mm/slub.c                           | 495 +++++++++++++++++++++++++---
>  17 files changed, 541 insertions(+), 92 deletions(-)
> 





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux