On 8/28/25 7:09 PM, Yu Kuai wrote: > Hi, > > ? 2025/08/29 1:23, Jens Axboe ??: >> On 8/28/25 3:28 AM, Li Nan wrote: >>> >>> >>> ? 2025/8/27 16:10, Ming Lei ??: >>>> On Wed, Aug 27, 2025 at 11:22:06AM +0800, Li Nan wrote: >>>>> >>>>> >>>>> ? 2025/8/27 9:35, Ming Lei ??: >>>>>> On Wed, Aug 27, 2025 at 09:04:45AM +0800, Yu Kuai wrote: >>>>>>> Hi, >>>>>>> >>>>>>> ? 2025/08/27 8:58, Ming Lei ??: >>>>>>>> On Tue, Aug 26, 2025 at 04:48:54PM +0800, linan666@xxxxxxxxxxxxxxx wrote: >>>>>>>>> From: Li Nan <linan122@xxxxxxxxxx> >>>>>>>>> >>>>>>>>> In __blk_mq_update_nr_hw_queues() the return value of >>>>>>>>> blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx >>>>>>>> >>>>>>>> Looks we should check its return value and handle the failure in both >>>>>>>> the call site and blk_mq_sysfs_register_hctxs(). >>>>>>> >>>>>>> From __blk_mq_update_nr_hw_queues(), the old hctxs is already >>>>>>> unregistered, and this function is void, we failed to register new hctxs >>>>>>> because of memory allocation failure. I really don't know how to handle >>>>>>> the failure here, do you have any suggestions? >>>>>> >>>>>> It is out of memory, I think it is fine to do whatever to leave queue state >>>>>> intact instead of making it `partial workable`, such as: >>>>>> >>>>>> - try update nr_hw_queues to 1 >>>>>> >>>>>> - if it still fails, delete disk & mark queue as dead if disk is attached >>>>>> >>>>> >>>>> If we ignore these non-critical sysfs creation failures, the disk remains >>>>> usable with no loss of functionality. Deleting the disk seems to escalate >>>>> the error? >>>> >>>> It is more like a workaround by ignoring the sysfs register failure. And if >>>> the issue need to be fixed in this way, you have to document it. > >>>> In case of OOM, it usually means that the system isn't usable any more. >>>> But it is NOIO allocation and the typical use case is for error recovery in >>>> nvme pci, so there may not be enough pages for noio allocation only. That is >>>> the reason for ignoring sysfs register in blk_mq_update_nr_hw_queues()? >>>> >>>> But NVMe has been pretty fragile in this area by using non-owner queue >>>> freeze, and call blk_mq_update_nr_hw_queues() on frozen queue, so it is >>>> really necessary to take it into account? >>> >>> I agree with your points about NOIO and NVMe. >>> >>> I hit this issue in null_blk during fuzz testing with memory-fault >>> injection. Changing the number of hardware queues under OOM is >>> extremely rare in real-world usage. So I think adding a workaround and >>> documenting it is sufficient. What do you think? >> >> Working around it is fine, as it isn't a situation we really need to >> worry about. But let's please not do it by poking at kobject internals. >> > > There is already used in someplaces like sysfs_slab_unlink(). > > Do we prefre add a new hctx->state like BLK_MQ_S_REGISTERED? If it's already used in a few spots, then I guess we should just be using it as well rather than have a state around it. So I guess it's fine. I'll just grab the patch. -- Jens Axboe