On Tue, Apr 22, 2025 at 4:31 PM Xiao Ni <xni@xxxxxxxxxx> wrote: > > On Tue, Apr 22, 2025 at 1:51 PM Christoph Hellwig <hch@xxxxxx> wrote: > > > > On Tue, Apr 22, 2025 at 11:24:03AM +0800, Xiao Ni wrote: > > > Now del_gendisk and put_disk are called asynchronously in workqueue work. > > > del_gendisk deletes device node by devtmpfs. devtmpfs tries to open this > > > array again and it flush the workqueue at the bigging of open process. So > > > a deadlock happens. > > > > > > The asynchronous way also has a problem that the device node can still > > > exist after mdadm --stop command returns in a short window. So udev rule > > > can open this device node and create the struct mddev in kernel again. > > > > > > So put del_gendisk in ioctl path and still leave put_disk in > > > md_kobj_release to avoid uaf. > > > > The md lifetime rules are complicated enough as-is. So while I won't > > object to this change per-see I'd rather have it reviewed by the md > > maintainers independently. > > > > In the meantime this should ensure devtmpfs doesn't call into > > blkdev_get_no_open and thus put_disk: > > > > diff --git a/block/bdev.c b/block/bdev.c > > index 6a34179192c9..97d4c0ab1670 100644 > > --- a/block/bdev.c > > +++ b/block/bdev.c > > @@ -1274,18 +1274,23 @@ void sync_bdevs(bool wait) > > */ > > void bdev_statx(const struct path *path, struct kstat *stat, u32 request_mask) > > { > > - struct inode *backing_inode; > > struct block_device *bdev; > > > > - backing_inode = d_backing_inode(path->dentry); > > - > > /* > > - * Note that backing_inode is the inode of a block device node file, > > - * not the block device's internal inode. Therefore it is *not* valid > > - * to use I_BDEV() here; the block device has to be looked up by i_rdev > > - * instead. > > + * Note that d_backing_inode() returnsthe inode of a block device node > > + * file, not the block device's internal inode. > > + * > > + * Therefore it is *not* valid to use I_BDEV() here; the block device > > + * has to be looked up by i_rdev instead. > > + * > > + * Only do this lookup if actually needed to avoid the performance > > + * overhead of the lookup, and to avoid injecting bdev lifetime issues > > + * into devtmpfs. > > */ > > - bdev = blkdev_get_no_open(backing_inode->i_rdev); > > + if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC))) > > + return; > > + > > + bdev = blkdev_get_no_open(d_backing_inode(path->dentry)->i_rdev); > > if (!bdev) > > return; > > > > > > Hi Christoph > > Thanks for the point. I tried this patch and it's stuck on the device > mapper device during shutdown. Sorry, please ignore the following error. I made a mistake when building the kernel. Regards Xiao > > [ 848.520483] systemd-shutdow[-- MARK -- Tue Apr 22 08:20:00 2025] > [ 989.175210] INFO: task systemd-shutdow:1 blocked for more than 122 seconds. > [ 989.175667] Not tainted 6.15.0-rc3+ #3 > [ 989.176327] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 989.176728] task:systemd-shutdow state:D stack:0 pid:1 > tgid:1 ppid:0 task_flags:0x480100 flags:0x00004002 > [ 989.177287] Call Trace: > [ 989.177429] <TASK> > [ 989.177910] __schedule+0x2c7/0x690 > [ 989.178501] schedule+0x23/0x80 > [ 989.179265] schedule_timeout+0xe8/0x100 > [ 989.179488] ? sched_clock_cpu+0xb/0x190 > [ 989.179703] ? __smp_call_single_queue+0xac/0x120 > [ 989.180436] ? ttwu_queue_wakelist+0xd0/0xf0 > [ 989.181065] __wait_for_common+0x99/0x1c0 > [ 989.181287] ? __pfx_schedule_timeout+0x10/0x10 > [ 989.181935] devtmpfs_submit_req+0x6e/0x90 > [ 989.182154] devtmpfs_delete_node+0x60/0x90 > [ 989.182374] ? kernfs_remove+0x5e/0x70 > [ 989.182579] ? sysfs_remove_group+0x46/0x90 > [ 989.182787] device_del+0x315/0x3d0 > [ 989.183389] ? mutex_lock+0xe/0x30 > [ 989.183963] del_gendisk+0x216/0x320 > [ 989.184180] cleanup_mapped_device+0x13f/0x150 [dm_mod] > [ 989.184502] __dm_destroy+0x12e/0x1f0 [dm_mod] > [ 989.185162] dev_remove+0x119/0x1a0 [dm_mod] > [ 989.185830] ctl_ioctl+0x21c/0x370 [dm_mod] > [ 989.186090] dm_ctl_ioctl+0xa/0x20 [dm_mod] > [ 989.186325] __x64_sys_ioctl+0x92/0xc0 > [ 989.186533] do_syscall_64+0x79/0x160 > [ 989.186738] ? blkg_alloc+0x10f/0x1c0 > [ 989.186976] ? __memcg_slab_free_hook+0xf2/0x140 > [ 989.187616] ? __x64_sys_close+0x39/0x80 > [ 989.187836] ? kmem_cache_free+0x33c/0x450 > [ 989.188057] ? syscall_exit_to_user_mode+0xc/0x1d0 > [ 989.188696] ? do_syscall_64+0x85/0x160 > [ 989.188904] ? __pfx_submit_bio_wait_endio+0x10/0x10 > [ 989.189222] ? blkdev_fsync+0x41/0x60 > [ 989.189468] ? syscall_exit_to_user_mode+0xc/0x1d0 > [ 989.190096] ? do_syscall_64+0x85/0x160 > [ 989.190316] ? do_sys_openat2+0x8c/0xd0 > [ 989.190524] ? syscall_exit_to_user_mode+0xc/0x1d0 > [ 989.191160] ? do_syscall_64+0x85/0x160 > [ 989.191374] ? syscall_exit_to_user_mode+0xc/0x1d0 > [ 989.192001] ? do_syscall_64+0x85/0x160 > [ 989.192226] ? do_syscall_64+0x85/0x160 > [ 989.192434] ? do_syscall_64+0x85/0x160 > [ 989.192641] ? syscall_exit_to_user_mode+0xc/0x1d0 > [ 989.193287] ? do_syscall_64+0x85/0x160 > [ 989.193509] ? do_syscall_64+0x85/0x160 > [ 989.193719] ? exc_page_fault+0x64/0x140 > [ 989.2: 00000246 ORIG_RAX: 0000000000000010 > [ 989.294592] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fef13903b0b > [ 989.295388] RDX: 00007ffef60fbe50 RSI: 00000000c138fd04 RDI: 0000000000000005 > [ 989.296153] RBP: 00007ffef60fbfc0 R08: 0000000000000000 R09: 00007ffef60fb190 > [ 989.296936] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005 > [ 989.297708] R13: 0000000000000000 R14: 00007ffef60fbe10 R15: 0000000000000000 > [ 989.298512] </TASK> > [ 989.298700] INFO: task kdevtmpfs:506 blocked for more than 123 seconds. > [ 989.299139] Not tainted 6.15.0-rc3+ #3 > [ 989.299795] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 989.300279] task:kdevtmpfs state:D stack:0 pid:506 > tgid:506 ppid:2 task_flags:0x208140 flags:0x00004000 > [ 989.300837] Call Trace: > [ 989.300994] <TASK> > [ 68/0x290 > [ 989.701395] ? __update_idle_core+0x23/0xb0 > [ 989.701611] devtmpfs_work_loop+0x10d/0x2a0 > [ 989.701979] ? __pfx_devtmpfsd+0x10/0x10 > [ 989.700x10/0x10 > [ 989.802290] ret_from_fork+0x30/0x50 > [ 989.802498] ? __pfx_kthread+0x10/0x10 > [ 989.802694] ret_from_fork_asm+0x1a/0x30 > [ 989.802926] </TASK> > [ 989.803103] INFO: task kworker/38:1:586 blocked for more than 123 seconds. > [ 989.803453] Not tainted 6.15.0-rc3+ #3 > > > There is another reason that I want to put del_gendisk in the ioctl > path. Because sometimes device node can still exist after command > mdadm --stop. del_gendisk removes inode first and then removes device > node (e.g. /dev/md0). So there is a small window that device node can > be open again. Then some strange things happen. Sometimes the array is > created but device node can't be created (I guess it's removed by > devtempfs?). Sometimes the kernel message prints "block device > autoloading is deprecated and will be removed". > > If your patch can work well, md array can call > flush_workqueue(md_misc_wq) in do_md_stop to avoid the deadlock that > del_gendisk needs to wait all sysfs calls to return. > > > Regards > Xiao > > Xiao