On 4/17/25 7:30 PM, Ming Lei wrote: > On Wed, Apr 16, 2025 at 01:46:05PM -0600, Uday Shankar wrote: >> Currently, ublk_drv associates to each hardware queue (hctx) a unique >> task (called the queue's ubq_daemon) which is allowed to issue >> COMMIT_AND_FETCH commands against the hctx. If any other task attempts >> to do so, the command fails immediately with EINVAL. When considered >> together with the block layer architecture, the result is that for each >> CPU C on the system, there is a unique ublk server thread which is >> allowed to handle I/O submitted on CPU C. This can lead to suboptimal >> performance under imbalanced load generation. For an extreme example, >> suppose all the load is generated on CPUs mapping to a single ublk >> server thread. Then that thread may be fully utilized and become the >> bottleneck in the system, while other ublk server threads are totally >> idle. >> >> This issue can also be addressed directly in the ublk server without >> kernel support by having threads dequeue I/Os and pass them around to >> ensure even load. But this solution requires inter-thread communication >> at least twice for each I/O (submission and completion), which is >> generally a bad pattern for performance. The problem gets even worse >> with zero copy, as more inter-thread communication would be required to >> have the buffer register/unregister calls to come from the correct >> thread. >> >> Therefore, address this issue in ublk_drv by requiring a unique task per >> I/O instead of per queue/hctx. Imbalanced load can then be balanced >> across all ublk server threads by having threads issue FETCH_REQs in a >> round-robin manner. As a small toy example, consider a system with a >> single ublk device having 2 queues, each of queue depth 4. A ublk server >> having 4 threads could issue its FETCH_REQs against this device as >> follows (where each entry is the qid,tag pair that the FETCH_REQ >> targets): >> >> poller thread: T0 T1 T2 T3 >> 0,0 0,1 0,2 0,3 >> 1,3 1,0 1,1 1,2 >> >> Since tags appear to be allocated in sequential chunks, this setup >> provides a rough approximation to distributing I/Os round-robin across >> all ublk server threads, while letting I/Os stay fully thread-local. >> >> Signed-off-by: Uday Shankar <ushankar@xxxxxxxxxxxxxxx> >> Reviewed-by: Caleb Sander Mateos <csander@xxxxxxxxxxxxxxx> >> --- > > I guess this patch need to rebase against yesterday Jens's merge. Given the set of changes on both the io_uring and block side, I'm going to rebase those trees on -rc3 once that is out. So yeah, I think rebasing and reposting this series against block-6.15 now would probably be a good idea, and should then apply directly for the 6.16 tree. > Given this change is big from ublk serer viewpoint, it should aim at > v6.16 Agree. -- Jens Axboe