From: Yu Kuai <yukuai3@xxxxxxxxxx> Changes from v2: - add elevator lock/unlock macros in patch 1; - improve coding style and commit messages; - retest with a new environment - add test for scsi HDD and nvme; Changes from v1: - the ioc changes are send separately; - change the patch 1-3 order as suggested by Damien; Currently, both mq-deadline and bfq have global spin lock that will be grabbed inside elevator methods like dispatch_request, insert_requests, and bio_merge. And the global lock is the main reason mq-deadline and bfq can't scale very well. For dispatch_request method, current behavior is dispatching one request at a time. In the case of multiple dispatching contexts, This behavior, on the one hand, introduce intense lock contention: t1: t2: t3: lock lock lock // grab lock ops.dispatch_request unlock // grab lock ops.dispatch_request unlock // grab lock ops.dispatch_request unlock on the other hand, messing up the requests dispatching order: t1: lock rq1 = ops.dispatch_request unlock t2: lock rq2 = ops.dispatch_request unlock lock rq3 = ops.dispatch_request unlock lock rq4 = ops.dispatch_request unlock //rq1,rq3 issue to disk // rq2, rq4 issue to disk In this case, the elevator dispatch order is rq 1-2-3-4, however, such order in disk is rq 1-3-2-4, the order for rq2 and rq3 is inversed. While dispatching request, blk_mq_get_disatpch_budget() and blk_mq_get_driver_tag() must be called, and they are not ready to be called inside elevator methods, hence introduce a new method like dispatch_requests is not possible. In conclusion, this set factor the global lock out of dispatch_request method, and support request batch dispatch by calling the methods multiple time while holding the lock. Test Environment: arm64 Kunpeng-920, with 4 nodes 128 cores nvme: HWE52P431T6M005N scsi HDD: MG04ACA600E attached to hisi_sas_v3 null_blk set up: modprobe null_blk nr_devices=0 && udevadm settle && cd /sys/kernel/config/nullb && mkdir nullb0 && cd nullb0 && echo 0 > completion_nsec && echo 512 > blocksize && echo 0 > home_node && echo 0 > irqmode && echo 128 > submit_queues && echo 1024 > hw_queue_depth && echo 1024 > size && echo 0 > memory_backed && echo 2 > queue_mode && echo 1 > power || exit $? null_blk and nvme test script: [global] filename=/dev/{nullb0,nvme0n1} rw=randwrite bs=4k iodepth=32 iodepth_batch_submit=8 iodepth_batch_complete=8 direct=1 ioengine=io_uring time_based [write] numjobs=16 runtime=60 scsi HDD test script: noted this test aims to test if batch dispatch will affect IO merge. [global] filename=/dev/sda rw=write bs=4k iodepth=32 iodepth_batch_submit=1 direct=1 ioengine=libaio [write] offset_increment=1g numjobs=128 Test Result: 1) nullblk: iops test with high IO pressue | | deadline | bfq | | --------------- | -------- | -------- | | before this set | 256k | 153k | | after this set | 594k | 283k | 2) nvme: iops test with high IO pressue | | deadline | bfq | | --------------- | -------- | -------- | | before this set | 258k | 142k | | after this set | 568k | 214k | 3) scsi HDD: io merge test, elevator is deadline | | w/s | %wrqm | wareq-sz | aqu-sz | | --------------- | ----- | ----- | -------- | ------ | | before this set | 92.25 | 96.88 | 128 | 129 | | after this set | 92.63 | 96.88 | 128 | 129 | Yu Kuai (5): blk-mq-sched: introduce high level elevator lock mq-deadline: switch to use elevator lock block, bfq: switch to use elevator lock blk-mq-sched: refactor __blk_mq_do_dispatch_sched() blk-mq-sched: support request batch dispatching for sq elevator block/bfq-cgroup.c | 6 +- block/bfq-iosched.c | 53 +++++----- block/bfq-iosched.h | 2 - block/blk-mq-sched.c | 246 ++++++++++++++++++++++++++++++------------- block/blk-mq.h | 21 ++++ block/elevator.c | 1 + block/elevator.h | 14 ++- block/mq-deadline.c | 60 +++++------ 8 files changed, 263 insertions(+), 140 deletions(-) -- 2.39.2