Re: [PATCH] blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx

Jens Axboe <axboe@xxxxxxxxx> · Thu, 28 Aug 2025 11:23:01 -0600

On 8/28/25 3:28 AM, Li Nan wrote:
> 
> 
> ? 2025/8/27 16:10, Ming Lei ??:
>> On Wed, Aug 27, 2025 at 11:22:06AM +0800, Li Nan wrote:
>>>
>>>
>>> ? 2025/8/27 9:35, Ming Lei ??:
>>>> On Wed, Aug 27, 2025 at 09:04:45AM +0800, Yu Kuai wrote:
>>>>> Hi,
>>>>>
>>>>> ? 2025/08/27 8:58, Ming Lei ??:
>>>>>> On Tue, Aug 26, 2025 at 04:48:54PM +0800, linan666@xxxxxxxxxxxxxxx wrote:
>>>>>>> From: Li Nan <linan122@xxxxxxxxxx>
>>>>>>>
>>>>>>> In __blk_mq_update_nr_hw_queues() the return value of
>>>>>>> blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
>>>>>>
>>>>>> Looks we should check its return value and handle the failure in both
>>>>>> the call site and blk_mq_sysfs_register_hctxs().
>>>>>
>>>>>   From __blk_mq_update_nr_hw_queues(), the old hctxs is already
>>>>> unregistered, and this function is void, we failed to register new hctxs
>>>>> because of memory allocation failure. I really don't know how to handle
>>>>> the failure here, do you have any suggestions?
>>>>
>>>> It is out of memory, I think it is fine to do whatever to leave queue state
>>>> intact instead of making it `partial workable`, such as:
>>>>
>>>> - try update nr_hw_queues to 1
>>>>
>>>> - if it still fails, delete disk & mark queue as dead if disk is attached
>>>>
>>>
>>> If we ignore these non-critical sysfs creation failures, the disk remains
>>> usable with no loss of functionality. Deleting the disk seems to escalate
>>> the error?
>>
>> It is more like a workaround by ignoring the sysfs register failure. And if
>> the issue need to be fixed in this way, you have to document it. >
>> In case of OOM, it usually means that the system isn't usable any more.
>> But it is NOIO allocation and the typical use case is for error recovery in
>> nvme pci, so there may not be enough pages for noio allocation only. That is
>> the reason for ignoring sysfs register in blk_mq_update_nr_hw_queues()?
>>
>> But NVMe has been pretty fragile in this area by using non-owner queue
>> freeze, and call blk_mq_update_nr_hw_queues() on frozen queue, so it is
>> really necessary to take it into account?
> 
> I agree with your points about NOIO and NVMe.
> 
> I hit this issue in null_blk during fuzz testing with memory-fault
> injection. Changing the number of hardware queues under OOM is
> extremely rare in real-world usage. So I think adding a workaround and
> documenting it is sufficient. What do you think?

Working around it is fine, as it isn't a situation we really need to
worry about. But let's please not do it by poking at kobject internals.

-- 
Jens Axboe