Re: [PATCH] block: plug attempts to batch allocate tags multiple times

Xue He <xue01.he@xxxxxxxxxxx> · Thu, 4 Sep 2025 06:32:09 +0000

On 2025/09/03/18:35PM, Yu Kuai wrote:
>On 2025/09/03 16:41 PM, Xue He wrote:
>> On 2025/09/02 08:47 AM, Yu Kuai wrote:
>>> On 2025/09/01 16:22 PM, Xue He wrote:
>> ......
>> 
>> the information of my nvme like this:
>> number of CPU: 16
>> memory: 16G
>> nvme nvme0: 16/0/16 default/read/poll queue
>> cat /sys/class/nvme/nvme0/nvme0n1/queue/nr_requests
>> 1023
>> 
>> In more precise terms, I think it is not that the tags are fully exhausted,
>> but rather that after scanning the bitmap for free bits, the remaining
>> contiguous bits are nsufficient to meet the requirement (have but not enough).
>> The specific function involved is __sbitmap_queue_get_batch in lib/sbitmap.c.
>>                      get_mask = ((1UL << nr_tags) - 1) << nr;
>>                      if (nr_tags > 1) {
>>                              printk("before %ld\n", get_mask);
>>                      }
>>                      while (!atomic_long_try_cmpxchg(ptr, &val,
>>                                                        get_mask | val))
>>                              ;
>>                      get_mask = (get_mask & ~val) >> nr;
>> 
>> where during the batch acquisition of contiguous free bits, an atomic operation
>> is performed, resulting in the actual tag_mask obtained differing from the
>> originally requested one.
>
>Yes, so this function will likely to obtain less tags than nr_tags,the
>mask is always start from first zero bit with nr_tags bit, and
>sbitmap_deferred_clear() is called uncondionally, it's likely there are
>non-zero bits within this range.
>
>Just wonder, do you consider fixing this directly in
>__blk_mq_alloc_requests_batch()?
>
>  - call sbitmap_deferred_clear() and retry on allocation failure, so
>that the whole word can be used even if previous allocated request are
>done, especially for nvme with huge tag depths;
>  - retry blk_mq_get_tags() until data->nr_tags is zero;
>

I haven't tried this yet, as I'm concerned that if it spin here, it might
introduce more latency. Anyway, I may try to implement this idea and do some
tests to observe the results.

Thanks.