Re: [PATCH v5 1/4] ublk: require unique task per io instead of unique task per hctx

Jens Axboe <axboe@xxxxxxxxx> · Thu, 17 Apr 2025 20:30:30 -0600

On 4/17/25 7:30 PM, Ming Lei wrote:
> On Wed, Apr 16, 2025 at 01:46:05PM -0600, Uday Shankar wrote:
>> Currently, ublk_drv associates to each hardware queue (hctx) a unique
>> task (called the queue's ubq_daemon) which is allowed to issue
>> COMMIT_AND_FETCH commands against the hctx. If any other task attempts
>> to do so, the command fails immediately with EINVAL. When considered
>> together with the block layer architecture, the result is that for each
>> CPU C on the system, there is a unique ublk server thread which is
>> allowed to handle I/O submitted on CPU C. This can lead to suboptimal
>> performance under imbalanced load generation. For an extreme example,
>> suppose all the load is generated on CPUs mapping to a single ublk
>> server thread. Then that thread may be fully utilized and become the
>> bottleneck in the system, while other ublk server threads are totally
>> idle.
>>
>> This issue can also be addressed directly in the ublk server without
>> kernel support by having threads dequeue I/Os and pass them around to
>> ensure even load. But this solution requires inter-thread communication
>> at least twice for each I/O (submission and completion), which is
>> generally a bad pattern for performance. The problem gets even worse
>> with zero copy, as more inter-thread communication would be required to
>> have the buffer register/unregister calls to come from the correct
>> thread.
>>
>> Therefore, address this issue in ublk_drv by requiring a unique task per
>> I/O instead of per queue/hctx. Imbalanced load can then be balanced
>> across all ublk server threads by having threads issue FETCH_REQs in a
>> round-robin manner. As a small toy example, consider a system with a
>> single ublk device having 2 queues, each of queue depth 4. A ublk server
>> having 4 threads could issue its FETCH_REQs against this device as
>> follows (where each entry is the qid,tag pair that the FETCH_REQ
>> targets):
>>
>> poller thread:	T0	T1	T2	T3
>> 		0,0	0,1	0,2	0,3
>> 		1,3	1,0	1,1	1,2
>>
>> Since tags appear to be allocated in sequential chunks, this setup
>> provides a rough approximation to distributing I/Os round-robin across
>> all ublk server threads, while letting I/Os stay fully thread-local.
>>
>> Signed-off-by: Uday Shankar <ushankar@xxxxxxxxxxxxxxx>
>> Reviewed-by: Caleb Sander Mateos <csander@xxxxxxxxxxxxxxx>
>> ---
> 
> I guess this patch need to rebase against yesterday Jens's merge.

Given the set of changes on both the io_uring and block side, I'm going
to rebase those trees on -rc3 once that is out. So yeah, I think
rebasing and reposting this series against block-6.15 now would probably
be a good idea, and should then apply directly for the 6.16 tree.

> Given this change is big from ublk serer viewpoint, it should aim at
> v6.16

Agree.

-- 
Jens Axboe