Re: [PATCH v21 00/12] Improve write performance for zoned UFS devices

Bart Van Assche <bvanassche@xxxxxxx> · Fri, 18 Jul 2025 11:30:05 -0700

On 7/18/25 12:08 AM, Damien Le Moal wrote:
How did you test this ?

Hi Damien,

This patch series has been tested as follows:
- In an x86-64 VM:
  - By running blktests.
  - By running the attached two scripts. test-pipelining-zoned-writes
    submits small writes sequentially and has been used to compare IOPS
    with and without write pipelining. test-pipelining-and-requeuing
    submits sequential or random writes. This script has
    been used to verify that the HOST BUSY and UNALIGNED WRITE
    conditions are handled correctly for both I/O patterns.
- On an ARM development board with a ZUFS device, by running a multitude
  of I/O patterns on top of F2FS and a ZUFS device with data
  verification enabled.

I do not have a zoned UFS drive, so I used an NVMe ZNS drive, which should be
fine since the commands in the submission queues of a PCI controller are always
handled in order. So I added:

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index cce4c5b55aa9..36d16b8d3f37 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -108,7 +108,7 @@ int nvme_query_zone_info(struct nvme_ns *ns, unsigned lbaf,
  void nvme_update_zone_info(struct nvme_ns *ns, struct queue_limits *lim,
                 struct nvme_zone_info *zi)
  {
-       lim->features |= BLK_FEAT_ZONED;
+       lim->features |= BLK_FEAT_ZONED | BLK_FEAT_ORDERED_HWQ;
         lim->max_open_zones = zi->max_open_zones;
         lim->max_active_zones = zi->max_active_zones;
         lim->max_hw_zone_append_sectors = ns->ctrl->max_zone_append;

And ran this:

fio --name=test --filename=/dev/nvme1n2 --ioengine=io_uring --iodepth=128 \
	--direct=1 --bs=4096 --zonemode=zbd --rw=randwrite \
	--numjobs=1

And I get unaligned write errors 100% of the time. Looking at your patches
again, you are not handling REQ_NOWAIT case in blk_zone_wplug_handle_write(). If
you get REQ_NOWAIT BIO, which io_uring will issue, the code goes directly to
plugging the BIO, thus bypassing your from_cpu handling.

Didn't Jens recommend libaio instead of io_uring for zoned storage? See
also 
https://lore.kernel.org/linux-block/8c0f9d28-d68f-4800-b94f-1905079d4007@xxxxxxxxx/T/#mb61b6d1294da76a9f1be38edf6dceaf703112335. 
I ran all my tests with
libaio instead of io_uring.

But the same fio command with libaio (no REQ_NOWAIT in that case) also fails.

While this patch series addresses most potential causes of reordering by
the block layer, it does not address all possible causes of reordering.
An example of a potential cause of reordering that has not been
addressed by this patch series can be found in blk_mq_insert_requests().
That function either inserts requests in a software or a hardware queue.
Bypassing the software queue for some requests can cause reordering.
Another example can be found in blk_mq_dispatch_rq_list(). If the block
driver responds with BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the
requests that have not been accepted by the block driver are added to
the &hctx->dispatch list. If these requests came from a software queue,
adding these to hctx->dispatch_list instead of putting them back in
their original position in the software queue can cause reordering.

Patches 8 and 9 work around this by retrying writes in the unlikely case
that reordering happens. I think this is a more pragmatic solution than
making more changes in the block layer to make it fully preserve the
request order. In the traces that I gathered and that I inspected, I
did not see any UNALIGNED WRITE errors being reported by ZUFS devices.

Thanks,

Bart.
#!/bin/bash

set -eu

stop_tracing() {
    if lsof -t /sys/kernel/tracing/trace_pipe | xargs -r kill; then :; fi
    echo 0 >/sys/kernel/tracing/tracing_on
}

tracing_active() {
    [ "$(cat /sys/kernel/tracing/tracing_on)" = 1 ]
}

start_tracing() {
    rm -f /tmp/block-trace.txt
    stop_tracing
    (
	cd /sys/kernel/tracing
	echo nop > current_tracer
	echo > trace
	echo 0 > events/enable
	echo 1 > events/block/enable
	echo 0 > events/block/block_dirty_buffer/enable
	echo 0 > events/block/block_touch_buffer/enable
	echo 1 > tracing_on
	cat trace_pipe >/tmp/block-trace.txt
    ) &
    tracing_pid=$!
    while ! tracing_active; do
	sleep .1
    done
}

end_tracing() {
    if [ -n "$tracing_pid" ]; then kill "$tracing_pid"; fi
    stop_tracing
}

qd=${1:-64}
# Log error recovery actions
echo 63 > /sys/module/scsi_mod/parameters/scsi_logging_level
if modprobe -r scsi_debug; then :; fi
params=(
	delay=0
	dev_size_mb=256
	every_nth=$((2 * qd))
	max_queue="${qd}"
	ndelay=100000           # 100 us
	opts=0x28000            # SDEBUG_OPT_UNALIGNED_WRITE | SDEBUG_OPT_HOST_BUSY
	preserves_write_order=1
	sector_size=4096
	zbc=host-managed
	zone_nr_conv=0
	zone_size_mb=4
)
modprobe scsi_debug "${params[@]}"
while true; do
	bdev=$(cd /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block && echo *) 2>/dev/null
	if [ -e /dev/"${bdev}" ]; then break; fi
	sleep .1
done
dev=/dev/"${bdev}"
[ -b "${dev}" ]
for rw in write randwrite; do
    start_tracing
    params=(
	--direct=1
	--filename="${dev}"
	--iodepth="${qd}"
	--iodepth_batch=$(((qd + 3) / 4))
	--ioengine=libaio
	--ioscheduler=none
	--gtod_reduce=1
	--name="$(basename "${dev}")"
	--runtime=30
	--rw="$rw"
	--time_based=1
	--zonemode=zbd
    )
    fio "${params[@]}"
    rc=$?
    end_tracing
    if grep -avH " ref 1 " "/sys/kernel/debug/block/${bdev}/zone_wplugs"; then
	echo
	echo "Detected one or more reference count leaks!"
	break
    fi
    echo ''
    [ $rc = 0 ] || break
done
echo 0 > /sys/module/scsi_mod/parameters/scsi_logging_level
#!/bin/bash

set -e

run_cmd() {
    if [ -z "$android" ]; then
	eval "$1"
    else
	adb shell "$1"
    fi
}

tracing_active() {
    [ "$(run_cmd "cat /sys/kernel/tracing/tracing_on")" = 1 ]
}

start_tracing() {
    rm -f /tmp/block-trace.txt
    cmd="(if [ ! -e /sys/kernel/tracing/trace ]; then mount -t tracefs none /sys/kernel/tracing; fi &&
	cd /sys/kernel/tracing &&
	if lsof -t /sys/kernel/tracing/trace_pipe | xargs -r kill; then :; fi &&
	echo 0 > tracing_on &&
	echo nop > current_tracer &&
	echo > trace &&
	echo 0 > events/enable &&
	echo 1 > events/block/enable &&
	echo 0 > events/block/block_dirty_buffer/enable &&
	echo 0 > events/block/block_touch_buffer/enable &&
	if [ -e events/nullb ]; then echo 1 > events/nullb/enable; fi &&
	echo 1 > tracing_on &&
	cat trace_pipe)"
    run_cmd "$cmd" >"/tmp/block-trace-$1.txt" &
    tracing_pid=$!
    while ! tracing_active; do
	sleep .1
    done
}

end_tracing() {
    sleep 5
    if [ -n "$tracing_pid" ]; then kill "$tracing_pid"; fi
    run_cmd "cd /sys/kernel/tracing &&
	if lsof -t /sys/kernel/tracing/trace_pipe | xargs -r kill; then :; fi &&
	echo 0 >/sys/kernel/tracing/tracing_on"
}

android=
fastest_cpucore=
tracing=

while [ "${1#-}" != "$1" ]; do
    case "$1" in
	-a)
	    android=true; shift;;
	-t)
	    tracing=true; shift;;
	*)
	    usage;;
    esac
done

set -u

if [ -n "${android}" ]; then
    adb root 1>&2
    adb push ~/software/fio/fio /tmp >&/dev/null
    adb push ~/software/util-linux/blkzone /tmp >&/dev/null
    fastest_cpucore=$(adb shell 'grep -aH . /sys/devices/system/cpu/cpu[0-9]*/cpufreq/cpuinfo_max_freq 2>/dev/null' |
		      sed 's/:/ /' |
		      sort -rnk2 |
		      head -n1 |
		      sed -e 's|/sys/devices/system/cpu/cpu||;s|/cpufreq.*||')
    if [ -z "$fastest_cpucore" ]; then
	fastest_cpucore=$(($(adb shell nproc) - 1))
    fi
    [ -n "$fastest_cpucore" ]
fi

for mode in "none 0" "none 1" "mq-deadline 0" "mq-deadline 1"; do
    for d in /sys/kernel/config/nullb/*; do
	if [ -d "$d" ] && rmdir "$d"; then :; fi
    done
    read -r iosched preserves_write_order <<<"$mode"
    echo "==== iosched=$iosched preserves_write_order=$preserves_write_order"
    if [ -z "$android" ]; then
	if true; then
	    if modprobe -r scsi_debug; then :; fi
	    params=(
		ndelay=100000            # 100 us
		host_max_queue=64
		preserves_write_order="${preserves_write_order}"
		dev_size_mb=1024         # 1 GiB
		submit_queues="$(nproc)"
		zone_size_mb=1           # 1 MiB
		zone_nr_conv=0
		zbc=2
	    )
	    modprobe scsi_debug "${params[@]}"
	    udevadm settle
	    dev=/dev/$(cd /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block && echo *)
	    basename=$(basename "${dev}")
	else
	    if modprobe -r null_blk; then :; fi
	    modprobe null_blk nr_devices=0
	    (
		cd /sys/kernel/config/nullb
		mkdir nullb0
		cd nullb0
		params=(
		    completion_nsec=100000   # 100 us
		    hw_queue_depth=64
		    irqmode=2                # NULL_IRQ_TIMER
		    max_sectors=$((4096/512))
		    memory_backed=1
		    preserves_write_order="${preserves_write_order}"
		    size=1                   # 1 GiB
		    submit_queues="$(nproc)"
		    zone_size=1              # 1 MiB
		    zoned=1
		    power=1
		)
		for p in "${params[@]}"; do
		    if ! echo "${p//*=}" > "${p//=*}"; then
			echo "$p"
			exit 1
		    fi
		done
	    )
	    basename=nullb0
	    dev=/dev/${basename}
	    udevadm settle
	fi
	[ -b "${dev}" ]
    else
	# Retrieve the device name assigned to the zoned logical unit.
	basename=$(adb shell grep -lvw 0 /sys/class/block/sd*/queue/chunk_sectors 2>/dev/null |
			     sed 's|/sys/class/block/||g;s|/queue/chunk_sectors||g')
	# Disable block layer request merging.
	dev="/dev/block/${basename}"
    fi
    run_cmd "echo 4096 > /sys/class/block/${basename}/queue/max_sectors_kb"
    # 0: disable I/O statistics
    run_cmd "echo 0 > /sys/class/block/${basename}/queue/iostats"
    # 2: do not attempt any merges
    run_cmd "echo 2 > /sys/class/block/${basename}/queue/nomerges"
    # 2: complete on the requesting CPU
    run_cmd "echo 2 > /sys/class/block/${basename}/queue/rq_affinity"
    if [ -n "${tracing}" ]; then
	start_tracing "${iosched}-${preserves_write_order}"
    fi
    params1=(
	--name=trim
	--filename="${dev}"
	--direct=1
	--end_fsync=1
	--ioengine=pvsync
	--gtod_reduce=1
	--rw=trim
	--size=100%
	--thread=1
	--zonemode=zbd
    )
    params2=(
	--name=measure-iops
	--filename="${dev}"
	--direct=1
	--ioscheduler="${iosched}"
	--gtod_reduce=1
	--runtime=30
	--rw=write
	--thread=1
	--time_based=1
	--zonemode=zbd
    )
    if [ -n "$fastest_cpucore" ]; then
	fio_args+=(--cpus_allowed="${fastest_cpucore}")
    fi
    if [ "$preserves_write_order" = 1 ]; then
	params2+=(
	    --ioengine=libaio
	    --iodepth=64
	    --iodepth_batch=16
	)
    else
	params2+=(
	    --ioengine=pvsync2
	)
    fi
    set +e
    echo "fio ${params2[*]}"
    # Finish all open zones to prevent that the maximum number of open zones is
    # exceeded. Next, trim all zones and measure IOPS.
    if [ -z "$android" ]; then
	blkzone finish "${dev}"
	fio "${params1[@]}" >"/tmp/fio-trim-${iosched}-${preserves_write_order}.txt"
	fio "${params2[@]}"
    else
	adb shell /tmp/blkzone finish "${dev}"
	adb shell /tmp/fio "${params1[@]}" >/dev/null
	adb shell /tmp/fio "${params2[@]}"
    fi
    ret=$?
    set -e
    if [ -n "${tracing}" ]; then
	end_tracing
    fi
    [ "$ret" = 0 ] || break
done