On 8/8/25 2:04 PM, Yu Kuai wrote: > Hi, > > 在 2025/08/08 16:09, Nilay Shroff 写道: >> >> >> On 8/8/25 6:18 AM, Yu Kuai wrote: >>> Hi, >>> >>> 在 2025/08/07 21:44, Zheng Qixing 写道: >>>> Hi, >>>> >>>> >>>> 在 2025/8/7 19:47, Nilay Shroff 写道: >>>>> >>>>> On 8/7/25 12:50 PM, Zheng Qixing wrote: >>>>>> From: Zheng Qixing <zhengqixing@xxxxxxxxxx> >>>>>> >>>>>> Device-mapper can call add_disk() multiple times for the same gendisk >>>>>> due to its two-phase creation process (dm create + dm load). This leads >>>>>> to kobject double initialization errors when the underlying iSCSI devices >>>>>> become temporarily unavailable and then reappear. >>>>>> >>>>>> However, if the first add_disk() call fails and is retried, the queue_kobj >>>>>> gets initialized twice, causing: >>>>>> >>>>>> kobject: kobject (ffff88810c27bb90): tried to init an initialized object, >>>>>> something is seriously wrong. >>>>>> Call Trace: >>>>>> <TASK> >>>>>> dump_stack_lvl+0x5b/0x80 >>>>>> kobject_init.cold+0x43/0x51 >>>>>> blk_register_queue+0x46/0x280 >>>>>> add_disk_fwnode+0xb5/0x280 >>>>>> dm_setup_md_queue+0x194/0x1c0 >>>>>> table_load+0x297/0x2d0 >>>>>> ctl_ioctl+0x2a2/0x480 >>>>>> dm_ctl_ioctl+0xe/0x20 >>>>>> __x64_sys_ioctl+0xc7/0x110 >>>>>> do_syscall_64+0x72/0x390 >>>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>>>>> >>>>>> Fix this by separating kobject initialization from sysfs registration: >>>>>> - Initialize queue_kobj early during gendisk allocation >>>>>> - add_disk() only adds the already-initialized kobject to sysfs >>>>>> - del_gendisk() removes from sysfs but doesn't destroy the kobject >>>>>> - Final cleanup happens when the disk is released >>>>>> >>>>>> Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs") >>>>>> Reported-by: Li Lingfeng <lilingfeng3@xxxxxxxxxx> >>>>>> Closes: https://lore.kernel.org/all/83591d0b-2467-433c-bce0-5581298eb161@xxxxxxxxxx/ >>>>>> Signed-off-by: Zheng Qixing <zhengqixing@xxxxxxxxxx> >>>>>> --- >>>>>> block/blk-sysfs.c | 4 +--- >>>>>> block/blk.h | 1 + >>>>>> block/genhd.c | 2 ++ >>>>>> 3 files changed, 4 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >>>>>> index 396cded255ea..37d8654faff9 100644 >>>>>> --- a/block/blk-sysfs.c >>>>>> +++ b/block/blk-sysfs.c >>>>>> @@ -847,7 +847,7 @@ static void blk_queue_release(struct kobject *kobj) >>>>>> /* nothing to do here, all data is associated with the parent gendisk */ >>>>>> } >>>>>> -static const struct kobj_type blk_queue_ktype = { >>>>>> +const struct kobj_type blk_queue_ktype = { >>>>>> .default_groups = blk_queue_attr_groups, >>>>>> .sysfs_ops = &queue_sysfs_ops, >>>>>> .release = blk_queue_release, >>>>>> @@ -875,7 +875,6 @@ int blk_register_queue(struct gendisk *disk) >>>>>> struct request_queue *q = disk->queue; >>>>>> int ret; >>>>>> - kobject_init(&disk->queue_kobj, &blk_queue_ktype); >>>>>> ret = kobject_add(&disk->queue_kobj, &disk_to_dev(disk)->kobj, "queue"); >>>>>> if (ret < 0) >>>>>> goto out_put_queue_kobj; >>>>> If the kobject_add() fails here, then we jump to the label out_put_queue_kobj, >>>>> where we release/put disk->queue_kobj. That would decrement the kref of >>>>> disk->queue_kobj and possibly bring it to zero. >>>> >>>> >>>> Since we remove the kobject_init() into alloc disk, when the kobject_add() fails here, >>>> >>>> it should return without kobject_del/put(). >>> >>> Yes, sorry I didn't noticed. >>>> >>>> >>>> If kobject_add() succeeds but later steps fail, we should call kobject_del() to rollback. >>>> >>>> >>>> The current error handling with kobject_put() in blk_register_queue() is indeed problematic. >>>> >>>> >>>>> Next time, when we call add_disk() again without invoking kobject_init() >>>>> (because the initialization is now moved outside add_disk()), the refcount >>>>> of disk->queue_kobj — which was previously released — would now go for a >>>>> toss. Wouldn't that lead to use-after-free or inconsistent state? >>>>> >>>>>> @@ -986,5 +985,4 @@ void blk_unregister_queue(struct gendisk *disk) >>>>>> elevator_set_none(q); >>>>>> blk_debugfs_remove(disk); >>>>>> - kobject_put(&disk->queue_kobj); >>>>>> } >>>>> I'm thinking a case where add_disk() fails after the queue is registered. >>>>> In that case, we call blk_unregister_queue() — which would ideally put() >>>>> the disk->queue_kobj. >>>>> But if we skip that put() in blk_unregister_queue() (and that's what we do >>>>> above), and then later retry add_disk(), wouldn’t kobject_add() from >>>>> blk_register_queue() complain loudly — since we’re trying to add a kobject >>>>> that was already added previously? >>> >>> This is exactly the problem reported orginally, now is the same >>> procedures before 2bd85221a625: >>> >>> 1) allocate memory: kobject_init >>> 2) register queue: kobject_add >>> 3) unregister queue: kobject_del >>> 4) free memory: kobject_put >>> >>> Noted that kobject_add is corresponding to kobject_del, and they don't >>> grab/release kobject reference. 2) and 3) can be executed multiple >>> times, the only thing that I noticed is the following uevent: >>> >>> kobject_uevent(&disk->queue_kobj, KOBJ_ADD); >>> >>> Looks like the uevent is only valid for the first one, if first add_disk >>> failed and then queue is registered again, there won't be uevent again, >>> see state_add_uevent_sent. >>> >> Why do you think so? We always send the "add" uevent when we register >> queue and then send "remove" uevent when the queue is unregistered. This >> behavior should remain intact with this change. The kobj->state_add_uevent_sent >> is only used for sending "remove" event during automatic cleanup of kobject. >> It shouldn't have any side effect while we re-register queue or retry >> add_disk(). Or am I missing something here? > > Yes, turns out I misread the code after I tested and found that user > didn't get the queue uevent. Just dig deeper and realize this is due > to dev_uevent_filter() from the top devices_kset stop the queue kobj > uevent. > > So, if I don't misread the code now, looks like the kobject_uevent() > call to queue_kobj is useless. > Yes that's correct. And so this patch doesn't alter the queue_kobj uevent behavior and so we're good here. Thanks, --Nilay