On 2025/09/03 18:35 PM, Yu Kuai wrote: >On 2025/09/03 16:41 PM, Xue He wrote: >> On 2025/09/02 08:47 AM, Yu Kuai wrote: >>> On 2025/09/01 16:22, Xue He wrote: >> ...... > > >Yes, so this function will likely to obtain less tags than nr_tags,the >mask is always start from first zero bit with nr_tags bit, and >sbitmap_deferred_clear() is called uncondionally, it's likely there are >non-zero bits within this range. > >Just wonder, do you consider fixing this directly in >__blk_mq_alloc_requests_batch()? > > - call sbitmap_deferred_clear() and retry on allocation failure, so >that the whole word can be used even if previous allocated request are >done, especially for nvme with huge tag depths; > - retry blk_mq_get_tags() until data->nr_tags is zero; Hi, Yu Kuai, I'm not entirely sure if I understand correctly, but during each tag allocation, sbitmap_deferred_clear() is typically called first, as seen in the __sbitmap_queue_get_batch() function. for (i = 0; i < sb->map_nr; i++) { struct sbitmap_word *map = &sb->map[index]; unsigned long get_mask; unsigned int map_depth = __map_depth(sb, index); unsigned long val; sbitmap_deferred_clear(map, 0, 0, 0); ------------------------------------------------------------------------ so I try to recall blk_mq_get_tags() until data->nr_tags is zero, like: - int i, nr = 0; - tag_mask = blk_mq_get_tags(data, data->nr_tags, &tag_offset); - if (unlikely(!tag_mask)) - return NULL; - - tags = blk_mq_tags_from_data(data); - for (i = 0; tag_mask; i++) { - if (!(tag_mask & (1UL << i))) - continue; - tag = tag_offset + i; - prefetch(tags->static_rqs[tag]); - tag_mask &= ~(1UL << i); - rq = blk_mq_rq_ctx_init(data, tags, tag); - rq_list_add_head(data->cached_rqs, rq); - nr++; - } - if (!(data->rq_flags & RQF_SCHED_TAGS)) - blk_mq_add_active_requests(data->hctx, nr); - /* caller already holds a reference, add for remainder */ - percpu_ref_get_many(&data->q->q_usage_counter, nr - 1); - data->nr_tags -= nr; + do { + int i, nr = 0; + tag_mask = blk_mq_get_tags(data, data->nr_tags, &tag_offset); + if (unlikely(!tag_mask)) + return NULL; + tags = blk_mq_tags_from_data(data); + for (i = 0; tag_mask; i++) { + if (!(tag_mask & (1UL << i))) + continue; + tag = tag_offset + i; + prefetch(tags->static_rqs[tag]); + tag_mask &= ~(1UL << i); + rq = blk_mq_rq_ctx_init(data, tags, tag); + rq_list_add_head(data->cached_rqs, rq); + nr++; + } + if (!(data->rq_flags & RQF_SCHED_TAGS)) + blk_mq_add_active_requests(data->hctx, nr); + /* caller already holds a reference, add for remainder */ + percpu_ref_get_many(&data->q->q_usage_counter, nr - 1); + data->nr_tags -= nr; + } while (data->nr_tags); I added a loop structure, it also achieve a good results like before, but I have a question: although the loop will retry tag allocation when the required number of tags is not met, there is a risk of an infinite loop, right? However, I couldn't think of a safer condition to terminate the loop. Do you have any suggestions? Thanks, Xue