On 9/9/25 3:06 PM, Yu Kuai wrote: > Hi, > > 在 2025/09/09 17:26, Nilay Shroff 写道: >> >> >> On 9/9/25 12:46 PM, Yu Kuai wrote: >>> Hi, >>> >>> 在 2025/09/09 14:52, Nilay Shroff 写道: >>>> >>>> >>>> On 9/9/25 12:08 PM, Yu Kuai wrote: >>>>> Hi, >>>>> >>>>> 在 2025/09/09 14:29, Nilay Shroff 写道: >>>>>> >>>>>> >>>>>> On 9/8/25 11:45 AM, Yu Kuai wrote: >>>>>>> From: Yu Kuai <yukuai3@xxxxxxxxxx> >>>>>>> >>>>>>> request_queue->nr_requests can be changed by: >>>>>>> >>>>>>> a) switching elevator by update nr_hw_queues >>>>>>> b) switching elevator by elevator sysfs attribute >>>>>>> c) configue queue sysfs attribute nr_requests >>>>>>> >>>>>>> Current lock order is: >>>>>>> >>>>>>> 1) update_nr_hwq_lock, case a,b >>>>>>> 2) freeze_queue >>>>>>> 3) elevator_lock, cas a,b,c >>>>>>> >>>>>>> And update nr_requests is seriablized by elevator_lock() already, >>>>>>> however, in the case c), we'll have to allocate new sched_tags if >>>>>>> nr_requests grow, and do this with elevator_lock held and queue >>>>>>> freezed has the risk of deadlock. >>>>>>> >>>>>>> Hence use update_nr_hwq_lock instead, make it possible to allocate >>>>>>> memory if tags grow, meanwhile also prevent nr_requests to be changed >>>>>>> concurrently. >>>>>>> >>>>>>> Signed-off-by: Yu Kuai <yukuai3@xxxxxxxxxx> >>>>>>> --- >>>>>>> block/blk-sysfs.c | 12 +++++++++--- >>>>>>> 1 file changed, 9 insertions(+), 3 deletions(-) >>>>>>> >>>>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >>>>>>> index f99519f7a820..7ea15bf68b4b 100644 >>>>>>> --- a/block/blk-sysfs.c >>>>>>> +++ b/block/blk-sysfs.c >>>>>>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count) >>>>>>> int ret, err; >>>>>>> unsigned int memflags; >>>>>>> struct request_queue *q = disk->queue; >>>>>>> + struct blk_mq_tag_set *set = q->tag_set; >>>>>>> ret = queue_var_store(&nr, page, count); >>>>>>> if (ret < 0) >>>>>>> return ret; >>>>>>> - memflags = blk_mq_freeze_queue(q); >>>>>>> - mutex_lock(&q->elevator_lock); >>>>>>> + /* serialize updating nr_requests with switching elevator */ >>>>>>> + down_write(&set->update_nr_hwq_lock); >>>>>>> >>>>>> For serializing update of nr_requests with switching elevator, >>>>>> we should use disable_elv_switch(). So with this change we >>>>>> don't need to acquire ->update_nr_hwq_lock in writer context >>>>>> while running blk_mq_update_nr_requests but instead it can run >>>>>> acquiring ->update_nr_hwq_lock in the reader context. >>>>>> >>>>>> So the code flow should be, >>>>>> >>>>>> disable_elv_switch => this would set QUEUE_FLAG_NO_ELV_SWITCH >>>>>> ... >>>>>> down_read ->update_nr_hwq_lock >>>>>> acquire ->freeze_lock >>>>>> acquire ->elevator_lock; >>>>>> ... >>>>>> ... >>>>>> release ->elevator_lock; >>>>>> release ->freeze_lock >>>>>> >>>>>> clear QUEUE_FLAG_NO_ELV_SWITCH >>>>>> up_read ->update_nr_hwq_lock >>>>>> >>>>> >>>>> Yes, this make sense, however, there is also an implied condition that >>>>> we should serialize queue_requests_store() with itself, what if a >>>>> concurrent caller succeed the disable_elv_switch() before the >>>>> down_read() in this way? >>>>> >>>>> t1: >>>>> disable_elv_switch >>>>> t2: >>>>> disable_elv_switch >>>>> >>>>> down_read down_read >>>>> >>>> I believe this is already protected with the kernfs internal >>>> mutex locks. So you shouldn't be able to run two sysfs store >>>> operations concurrently on the same attribute file. >>>> >>> >>> There really is no such internal lock, the call stack is: >>> >>> kernfs_fop_write_iter >>> sysfs_kf_write >>> queue_attr_store >>> >>> There is only a file level mutex kernfs_open_file->lock from the top >>> function kernfs_fop_write_iter(), however, this lock is not the same >>> if we open the same attribute file from different context. >>> >> Oh yes this lock only protects if the same fd is being written >> concurrently. However if we open the same sysfs file from different process >> contexts then fd would be different and so this lock wouldn't protect >> the simultaneous update of sysfs attribute. Having said that, >> looking through the code again it seems that q->nr_requests update >> is protected with ->elevator_lock (including both the elevator switch >> code and your proposed changes in this patchset). So my question is >> do we really need to synchronize nr_requests update code with elevator >> swiupdate_nr_hwq_locktch code? So in another words what if we don't at >> all use ->update_nr_hwq_lock in queue_requests_store? > > 1) lock update_nr_hwq_lock, then no one can change nr_queuests > 2) checking input nr_reqeusts > 3) if grow, allocate memory > > Main idea here is we can checking if nr_requests grow and then allocate > mermory, without concern that nr_requests can be changed after memory > allocation. > If nr_requests changes after memory allocation we're still good because eventually we'd only have one consistent value of nr_requests. For instance, if process A is updating nr_requests to 128 and sched switch is updating nr_requests to 256 simultaneously then we'd either see nr_requests set to 128 or 256 in the end depending on who runs last. We wouldn't get into a situation where we find some inconsistent update to nr_requests, isn't it? > BTW, I think this sysfs attr is really a slow path, and it's fine to > grab the write lock. > Yep you're right. But I think we should avoid locks if possible. Thanks, --Nilay