On Tue, Apr 22, 2025 at 1:51 PM Christoph Hellwig <hch@xxxxxx> wrote: > > On Tue, Apr 22, 2025 at 11:24:03AM +0800, Xiao Ni wrote: > > Now del_gendisk and put_disk are called asynchronously in workqueue work. > > del_gendisk deletes device node by devtmpfs. devtmpfs tries to open this > > array again and it flush the workqueue at the bigging of open process. So > > a deadlock happens. > > > > The asynchronous way also has a problem that the device node can still > > exist after mdadm --stop command returns in a short window. So udev rule > > can open this device node and create the struct mddev in kernel again. > > > > So put del_gendisk in ioctl path and still leave put_disk in > > md_kobj_release to avoid uaf. > > The md lifetime rules are complicated enough as-is. So while I won't > object to this change per-see I'd rather have it reviewed by the md > maintainers independently. > > In the meantime this should ensure devtmpfs doesn't call into > blkdev_get_no_open and thus put_disk: > > diff --git a/block/bdev.c b/block/bdev.c > index 6a34179192c9..97d4c0ab1670 100644 > --- a/block/bdev.c > +++ b/block/bdev.c > @@ -1274,18 +1274,23 @@ void sync_bdevs(bool wait) > */ > void bdev_statx(const struct path *path, struct kstat *stat, u32 request_mask) > { > - struct inode *backing_inode; > struct block_device *bdev; > > - backing_inode = d_backing_inode(path->dentry); > - > /* > - * Note that backing_inode is the inode of a block device node file, > - * not the block device's internal inode. Therefore it is *not* valid > - * to use I_BDEV() here; the block device has to be looked up by i_rdev > - * instead. > + * Note that d_backing_inode() returnsthe inode of a block device node > + * file, not the block device's internal inode. > + * > + * Therefore it is *not* valid to use I_BDEV() here; the block device > + * has to be looked up by i_rdev instead. > + * > + * Only do this lookup if actually needed to avoid the performance > + * overhead of the lookup, and to avoid injecting bdev lifetime issues > + * into devtmpfs. > */ > - bdev = blkdev_get_no_open(backing_inode->i_rdev); > + if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC))) > + return; > + > + bdev = blkdev_get_no_open(d_backing_inode(path->dentry)->i_rdev); > if (!bdev) > return; > > Hi Christoph Thanks for the point. I tried this patch and it's stuck on the device mapper device during shutdown. [ 848.520483] systemd-shutdow[-- MARK -- Tue Apr 22 08:20:00 2025] [ 989.175210] INFO: task systemd-shutdow:1 blocked for more than 122 seconds. [ 989.175667] Not tainted 6.15.0-rc3+ #3 [ 989.176327] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 989.176728] task:systemd-shutdow state:D stack:0 pid:1 tgid:1 ppid:0 task_flags:0x480100 flags:0x00004002 [ 989.177287] Call Trace: [ 989.177429] <TASK> [ 989.177910] __schedule+0x2c7/0x690 [ 989.178501] schedule+0x23/0x80 [ 989.179265] schedule_timeout+0xe8/0x100 [ 989.179488] ? sched_clock_cpu+0xb/0x190 [ 989.179703] ? __smp_call_single_queue+0xac/0x120 [ 989.180436] ? ttwu_queue_wakelist+0xd0/0xf0 [ 989.181065] __wait_for_common+0x99/0x1c0 [ 989.181287] ? __pfx_schedule_timeout+0x10/0x10 [ 989.181935] devtmpfs_submit_req+0x6e/0x90 [ 989.182154] devtmpfs_delete_node+0x60/0x90 [ 989.182374] ? kernfs_remove+0x5e/0x70 [ 989.182579] ? sysfs_remove_group+0x46/0x90 [ 989.182787] device_del+0x315/0x3d0 [ 989.183389] ? mutex_lock+0xe/0x30 [ 989.183963] del_gendisk+0x216/0x320 [ 989.184180] cleanup_mapped_device+0x13f/0x150 [dm_mod] [ 989.184502] __dm_destroy+0x12e/0x1f0 [dm_mod] [ 989.185162] dev_remove+0x119/0x1a0 [dm_mod] [ 989.185830] ctl_ioctl+0x21c/0x370 [dm_mod] [ 989.186090] dm_ctl_ioctl+0xa/0x20 [dm_mod] [ 989.186325] __x64_sys_ioctl+0x92/0xc0 [ 989.186533] do_syscall_64+0x79/0x160 [ 989.186738] ? blkg_alloc+0x10f/0x1c0 [ 989.186976] ? __memcg_slab_free_hook+0xf2/0x140 [ 989.187616] ? __x64_sys_close+0x39/0x80 [ 989.187836] ? kmem_cache_free+0x33c/0x450 [ 989.188057] ? syscall_exit_to_user_mode+0xc/0x1d0 [ 989.188696] ? do_syscall_64+0x85/0x160 [ 989.188904] ? __pfx_submit_bio_wait_endio+0x10/0x10 [ 989.189222] ? blkdev_fsync+0x41/0x60 [ 989.189468] ? syscall_exit_to_user_mode+0xc/0x1d0 [ 989.190096] ? do_syscall_64+0x85/0x160 [ 989.190316] ? do_sys_openat2+0x8c/0xd0 [ 989.190524] ? syscall_exit_to_user_mode+0xc/0x1d0 [ 989.191160] ? do_syscall_64+0x85/0x160 [ 989.191374] ? syscall_exit_to_user_mode+0xc/0x1d0 [ 989.192001] ? do_syscall_64+0x85/0x160 [ 989.192226] ? do_syscall_64+0x85/0x160 [ 989.192434] ? do_syscall_64+0x85/0x160 [ 989.192641] ? syscall_exit_to_user_mode+0xc/0x1d0 [ 989.193287] ? do_syscall_64+0x85/0x160 [ 989.193509] ? do_syscall_64+0x85/0x160 [ 989.193719] ? exc_page_fault+0x64/0x140 [ 989.2: 00000246 ORIG_RAX: 0000000000000010 [ 989.294592] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fef13903b0b [ 989.295388] RDX: 00007ffef60fbe50 RSI: 00000000c138fd04 RDI: 0000000000000005 [ 989.296153] RBP: 00007ffef60fbfc0 R08: 0000000000000000 R09: 00007ffef60fb190 [ 989.296936] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005 [ 989.297708] R13: 0000000000000000 R14: 00007ffef60fbe10 R15: 0000000000000000 [ 989.298512] </TASK> [ 989.298700] INFO: task kdevtmpfs:506 blocked for more than 123 seconds. [ 989.299139] Not tainted 6.15.0-rc3+ #3 [ 989.299795] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 989.300279] task:kdevtmpfs state:D stack:0 pid:506 tgid:506 ppid:2 task_flags:0x208140 flags:0x00004000 [ 989.300837] Call Trace: [ 989.300994] <TASK> [ 68/0x290 [ 989.701395] ? __update_idle_core+0x23/0xb0 [ 989.701611] devtmpfs_work_loop+0x10d/0x2a0 [ 989.701979] ? __pfx_devtmpfsd+0x10/0x10 [ 989.700x10/0x10 [ 989.802290] ret_from_fork+0x30/0x50 [ 989.802498] ? __pfx_kthread+0x10/0x10 [ 989.802694] ret_from_fork_asm+0x1a/0x30 [ 989.802926] </TASK> [ 989.803103] INFO: task kworker/38:1:586 blocked for more than 123 seconds. [ 989.803453] Not tainted 6.15.0-rc3+ #3 There is another reason that I want to put del_gendisk in the ioctl path. Because sometimes device node can still exist after command mdadm --stop. del_gendisk removes inode first and then removes device node (e.g. /dev/md0). So there is a small window that device node can be open again. Then some strange things happen. Sometimes the array is created but device node can't be created (I guess it's removed by devtempfs?). Sometimes the kernel message prints "block device autoloading is deprecated and will be removed". If your patch can work well, md array can call flush_workqueue(md_misc_wq) in do_md_stop to avoid the deadlock that del_gendisk needs to wait all sysfs calls to return. Regards Xiao Xiao