Re: [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock

Nilay Shroff <nilay@xxxxxxxxxxxxx> · Tue, 9 Sep 2025 15:41:30 +0530



On 9/9/25 3:06 PM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/09/09 17:26, Nilay Shroff 写道:
>>
>>
>> On 9/9/25 12:46 PM, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2025/09/09 14:52, Nilay Shroff 写道:
>>>>
>>>>
>>>> On 9/9/25 12:08 PM, Yu Kuai wrote:
>>>>> Hi,
>>>>>
>>>>> 在 2025/09/09 14:29, Nilay Shroff 写道:
>>>>>>
>>>>>>
>>>>>> On 9/8/25 11:45 AM, Yu Kuai wrote:
>>>>>>> From: Yu Kuai <yukuai3@xxxxxxxxxx>
>>>>>>>
>>>>>>> request_queue->nr_requests can be changed by:
>>>>>>>
>>>>>>> a) switching elevator by update nr_hw_queues
>>>>>>> b) switching elevator by elevator sysfs attribute
>>>>>>> c) configue queue sysfs attribute nr_requests
>>>>>>>
>>>>>>> Current lock order is:
>>>>>>>
>>>>>>> 1) update_nr_hwq_lock, case a,b
>>>>>>> 2) freeze_queue
>>>>>>> 3) elevator_lock, cas a,b,c
>>>>>>>
>>>>>>> And update nr_requests is seriablized by elevator_lock() already,
>>>>>>> however, in the case c), we'll have to allocate new sched_tags if
>>>>>>> nr_requests grow, and do this with elevator_lock held and queue
>>>>>>> freezed has the risk of deadlock.
>>>>>>>
>>>>>>> Hence use update_nr_hwq_lock instead, make it possible to allocate
>>>>>>> memory if tags grow, meanwhile also prevent nr_requests to be changed
>>>>>>> concurrently.
>>>>>>>
>>>>>>> Signed-off-by: Yu Kuai <yukuai3@xxxxxxxxxx>
>>>>>>> ---
>>>>>>>     block/blk-sysfs.c | 12 +++++++++---
>>>>>>>     1 file changed, 9 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>>>>>> index f99519f7a820..7ea15bf68b4b 100644
>>>>>>> --- a/block/blk-sysfs.c
>>>>>>> +++ b/block/blk-sysfs.c
>>>>>>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>>>>>>>         int ret, err;
>>>>>>>         unsigned int memflags;
>>>>>>>         struct request_queue *q = disk->queue;
>>>>>>> +    struct blk_mq_tag_set *set = q->tag_set;
>>>>>>>           ret = queue_var_store(&nr, page, count);
>>>>>>>         if (ret < 0)
>>>>>>>             return ret;
>>>>>>>     -    memflags = blk_mq_freeze_queue(q);
>>>>>>> -    mutex_lock(&q->elevator_lock);
>>>>>>> +    /* serialize updating nr_requests with switching elevator */
>>>>>>> +    down_write(&set->update_nr_hwq_lock);
>>>>>>>     
>>>>>> For serializing update of nr_requests with switching elevator,
>>>>>> we should use disable_elv_switch(). So with this change we
>>>>>> don't need to acquire ->update_nr_hwq_lock in writer context
>>>>>> while running blk_mq_update_nr_requests but instead it can run
>>>>>> acquiring ->update_nr_hwq_lock in the reader context.
>>>>>>
>>>>>> So the code flow should be,
>>>>>>
>>>>>> disable_elv_switch  => this would set QUEUE_FLAG_NO_ELV_SWITCH
>>>>>> ...
>>>>>> down_read ->update_nr_hwq_lock
>>>>>> acquire ->freeze_lock
>>>>>> acquire ->elevator_lock;
>>>>>> ...
>>>>>> ...
>>>>>> release ->elevator_lock;
>>>>>> release ->freeze_lock
>>>>>>
>>>>>> clear QUEUE_FLAG_NO_ELV_SWITCH
>>>>>> up_read ->update_nr_hwq_lock
>>>>>>
>>>>>
>>>>> Yes, this make sense, however, there is also an implied condition that
>>>>> we should serialize queue_requests_store() with itself, what if a
>>>>> concurrent caller succeed the disable_elv_switch() before the
>>>>> down_read() in this way?
>>>>>
>>>>> t1:
>>>>> disable_elv_switch
>>>>>           t2:
>>>>>           disable_elv_switch
>>>>>
>>>>> down_read    down_read
>>>>>
>>>> I believe this is already protected with the kernfs internal
>>>> mutex locks. So you shouldn't be able to run two sysfs store
>>>> operations concurrently on the same attribute file.
>>>>
>>>
>>> There really is no such internal lock, the call stack is:
>>>
>>> kernfs_fop_write_iter
>>>   sysfs_kf_write
>>>    queue_attr_store
>>>
>>> There is only a file level mutex kernfs_open_file->lock from the top
>>> function kernfs_fop_write_iter(), however, this lock is not the same
>>> if we open the same attribute file from different context.
>>>
>> Oh yes this lock only protects if the same fd is being written
>> concurrently. However if we open the same sysfs file from different process
>> contexts then fd would be different and so this lock wouldn't protect
>> the simultaneous update of sysfs attribute. Having said that,
>> looking through the code again it seems that q->nr_requests update
>> is protected with ->elevator_lock (including both the elevator switch
>> code and your proposed changes in this patchset). So my question is
>> do we really need to synchronize nr_requests update code with elevator
>> swiupdate_nr_hwq_locktch code? So in another words what if we don't at
>> all use ->update_nr_hwq_lock in queue_requests_store?
> 
> 1) lock update_nr_hwq_lock, then no one can change nr_queuests
> 2) checking input nr_reqeusts
> 3) if grow, allocate memory
> 
> Main idea here is we can checking if nr_requests grow and then allocate
> mermory, without concern that nr_requests can be changed after memory
> allocation.
> 
If nr_requests changes after memory allocation we're still good because
eventually we'd only have one consistent value of nr_requests. For 
instance, if process A is updating nr_requests to 128 and sched switch
is updating nr_requests to 256 simultaneously then we'd either see 
nr_requests set to 128 or 256 in the end depending on who runs last.
We wouldn't get into a situation where we find some inconsistent update
to nr_requests, isn't it?

> BTW, I think this sysfs attr is really a slow path, and it's fine to
> grab the write lock.
> 
Yep you're right. But I think we should avoid locks if possible.

Thanks,
--Nilay