On 7/18/25 12:08 AM, Damien Le Moal wrote:
How did you test this ?
Hi Damien, This patch series has been tested as follows: - In an x86-64 VM: - By running blktests. - By running the attached two scripts. test-pipelining-zoned-writes submits small writes sequentially and has been used to compare IOPS with and without write pipelining. test-pipelining-and-requeuing submits sequential or random writes. This script has been used to verify that the HOST BUSY and UNALIGNED WRITE conditions are handled correctly for both I/O patterns. - On an ARM development board with a ZUFS device, by running a multitude of I/O patterns on top of F2FS and a ZUFS device with data verification enabled.
I do not have a zoned UFS drive, so I used an NVMe ZNS drive, which should be fine since the commands in the submission queues of a PCI controller are always handled in order. So I added: diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c index cce4c5b55aa9..36d16b8d3f37 100644 --- a/drivers/nvme/host/zns.c +++ b/drivers/nvme/host/zns.c @@ -108,7 +108,7 @@ int nvme_query_zone_info(struct nvme_ns *ns, unsigned lbaf, void nvme_update_zone_info(struct nvme_ns *ns, struct queue_limits *lim, struct nvme_zone_info *zi) { - lim->features |= BLK_FEAT_ZONED; + lim->features |= BLK_FEAT_ZONED | BLK_FEAT_ORDERED_HWQ; lim->max_open_zones = zi->max_open_zones; lim->max_active_zones = zi->max_active_zones; lim->max_hw_zone_append_sectors = ns->ctrl->max_zone_append; And ran this: fio --name=test --filename=/dev/nvme1n2 --ioengine=io_uring --iodepth=128 \ --direct=1 --bs=4096 --zonemode=zbd --rw=randwrite \ --numjobs=1 And I get unaligned write errors 100% of the time. Looking at your patches again, you are not handling REQ_NOWAIT case in blk_zone_wplug_handle_write(). If you get REQ_NOWAIT BIO, which io_uring will issue, the code goes directly to plugging the BIO, thus bypassing your from_cpu handling.
Didn't Jens recommend libaio instead of io_uring for zoned storage? Seealso https://lore.kernel.org/linux-block/8c0f9d28-d68f-4800-b94f-1905079d4007@xxxxxxxxx/T/#mb61b6d1294da76a9f1be38edf6dceaf703112335. I ran all my tests with
libaio instead of io_uring.
But the same fio command with libaio (no REQ_NOWAIT in that case) also fails.
While this patch series addresses most potential causes of reordering by the block layer, it does not address all possible causes of reordering. An example of a potential cause of reordering that has not been addressed by this patch series can be found in blk_mq_insert_requests(). That function either inserts requests in a software or a hardware queue. Bypassing the software queue for some requests can cause reordering. Another example can be found in blk_mq_dispatch_rq_list(). If the block driver responds with BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the requests that have not been accepted by the block driver are added to the &hctx->dispatch list. If these requests came from a software queue, adding these to hctx->dispatch_list instead of putting them back in their original position in the software queue can cause reordering. Patches 8 and 9 work around this by retrying writes in the unlikely case that reordering happens. I think this is a more pragmatic solution than making more changes in the block layer to make it fully preserve the request order. In the traces that I gathered and that I inspected, I did not see any UNALIGNED WRITE errors being reported by ZUFS devices. Thanks, Bart.
#!/bin/bash set -eu stop_tracing() { if lsof -t /sys/kernel/tracing/trace_pipe | xargs -r kill; then :; fi echo 0 >/sys/kernel/tracing/tracing_on } tracing_active() { [ "$(cat /sys/kernel/tracing/tracing_on)" = 1 ] } start_tracing() { rm -f /tmp/block-trace.txt stop_tracing ( cd /sys/kernel/tracing echo nop > current_tracer echo > trace echo 0 > events/enable echo 1 > events/block/enable echo 0 > events/block/block_dirty_buffer/enable echo 0 > events/block/block_touch_buffer/enable echo 1 > tracing_on cat trace_pipe >/tmp/block-trace.txt ) & tracing_pid=$! while ! tracing_active; do sleep .1 done } end_tracing() { if [ -n "$tracing_pid" ]; then kill "$tracing_pid"; fi stop_tracing } qd=${1:-64} # Log error recovery actions echo 63 > /sys/module/scsi_mod/parameters/scsi_logging_level if modprobe -r scsi_debug; then :; fi params=( delay=0 dev_size_mb=256 every_nth=$((2 * qd)) max_queue="${qd}" ndelay=100000 # 100 us opts=0x28000 # SDEBUG_OPT_UNALIGNED_WRITE | SDEBUG_OPT_HOST_BUSY preserves_write_order=1 sector_size=4096 zbc=host-managed zone_nr_conv=0 zone_size_mb=4 ) modprobe scsi_debug "${params[@]}" while true; do bdev=$(cd /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block && echo *) 2>/dev/null if [ -e /dev/"${bdev}" ]; then break; fi sleep .1 done dev=/dev/"${bdev}" [ -b "${dev}" ] for rw in write randwrite; do start_tracing params=( --direct=1 --filename="${dev}" --iodepth="${qd}" --iodepth_batch=$(((qd + 3) / 4)) --ioengine=libaio --ioscheduler=none --gtod_reduce=1 --name="$(basename "${dev}")" --runtime=30 --rw="$rw" --time_based=1 --zonemode=zbd ) fio "${params[@]}" rc=$? end_tracing if grep -avH " ref 1 " "/sys/kernel/debug/block/${bdev}/zone_wplugs"; then echo echo "Detected one or more reference count leaks!" break fi echo '' [ $rc = 0 ] || break done echo 0 > /sys/module/scsi_mod/parameters/scsi_logging_level
#!/bin/bash set -e run_cmd() { if [ -z "$android" ]; then eval "$1" else adb shell "$1" fi } tracing_active() { [ "$(run_cmd "cat /sys/kernel/tracing/tracing_on")" = 1 ] } start_tracing() { rm -f /tmp/block-trace.txt cmd="(if [ ! -e /sys/kernel/tracing/trace ]; then mount -t tracefs none /sys/kernel/tracing; fi && cd /sys/kernel/tracing && if lsof -t /sys/kernel/tracing/trace_pipe | xargs -r kill; then :; fi && echo 0 > tracing_on && echo nop > current_tracer && echo > trace && echo 0 > events/enable && echo 1 > events/block/enable && echo 0 > events/block/block_dirty_buffer/enable && echo 0 > events/block/block_touch_buffer/enable && if [ -e events/nullb ]; then echo 1 > events/nullb/enable; fi && echo 1 > tracing_on && cat trace_pipe)" run_cmd "$cmd" >"/tmp/block-trace-$1.txt" & tracing_pid=$! while ! tracing_active; do sleep .1 done } end_tracing() { sleep 5 if [ -n "$tracing_pid" ]; then kill "$tracing_pid"; fi run_cmd "cd /sys/kernel/tracing && if lsof -t /sys/kernel/tracing/trace_pipe | xargs -r kill; then :; fi && echo 0 >/sys/kernel/tracing/tracing_on" } android= fastest_cpucore= tracing= while [ "${1#-}" != "$1" ]; do case "$1" in -a) android=true; shift;; -t) tracing=true; shift;; *) usage;; esac done set -u if [ -n "${android}" ]; then adb root 1>&2 adb push ~/software/fio/fio /tmp >&/dev/null adb push ~/software/util-linux/blkzone /tmp >&/dev/null fastest_cpucore=$(adb shell 'grep -aH . /sys/devices/system/cpu/cpu[0-9]*/cpufreq/cpuinfo_max_freq 2>/dev/null' | sed 's/:/ /' | sort -rnk2 | head -n1 | sed -e 's|/sys/devices/system/cpu/cpu||;s|/cpufreq.*||') if [ -z "$fastest_cpucore" ]; then fastest_cpucore=$(($(adb shell nproc) - 1)) fi [ -n "$fastest_cpucore" ] fi for mode in "none 0" "none 1" "mq-deadline 0" "mq-deadline 1"; do for d in /sys/kernel/config/nullb/*; do if [ -d "$d" ] && rmdir "$d"; then :; fi done read -r iosched preserves_write_order <<<"$mode" echo "==== iosched=$iosched preserves_write_order=$preserves_write_order" if [ -z "$android" ]; then if true; then if modprobe -r scsi_debug; then :; fi params=( ndelay=100000 # 100 us host_max_queue=64 preserves_write_order="${preserves_write_order}" dev_size_mb=1024 # 1 GiB submit_queues="$(nproc)" zone_size_mb=1 # 1 MiB zone_nr_conv=0 zbc=2 ) modprobe scsi_debug "${params[@]}" udevadm settle dev=/dev/$(cd /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block && echo *) basename=$(basename "${dev}") else if modprobe -r null_blk; then :; fi modprobe null_blk nr_devices=0 ( cd /sys/kernel/config/nullb mkdir nullb0 cd nullb0 params=( completion_nsec=100000 # 100 us hw_queue_depth=64 irqmode=2 # NULL_IRQ_TIMER max_sectors=$((4096/512)) memory_backed=1 preserves_write_order="${preserves_write_order}" size=1 # 1 GiB submit_queues="$(nproc)" zone_size=1 # 1 MiB zoned=1 power=1 ) for p in "${params[@]}"; do if ! echo "${p//*=}" > "${p//=*}"; then echo "$p" exit 1 fi done ) basename=nullb0 dev=/dev/${basename} udevadm settle fi [ -b "${dev}" ] else # Retrieve the device name assigned to the zoned logical unit. basename=$(adb shell grep -lvw 0 /sys/class/block/sd*/queue/chunk_sectors 2>/dev/null | sed 's|/sys/class/block/||g;s|/queue/chunk_sectors||g') # Disable block layer request merging. dev="/dev/block/${basename}" fi run_cmd "echo 4096 > /sys/class/block/${basename}/queue/max_sectors_kb" # 0: disable I/O statistics run_cmd "echo 0 > /sys/class/block/${basename}/queue/iostats" # 2: do not attempt any merges run_cmd "echo 2 > /sys/class/block/${basename}/queue/nomerges" # 2: complete on the requesting CPU run_cmd "echo 2 > /sys/class/block/${basename}/queue/rq_affinity" if [ -n "${tracing}" ]; then start_tracing "${iosched}-${preserves_write_order}" fi params1=( --name=trim --filename="${dev}" --direct=1 --end_fsync=1 --ioengine=pvsync --gtod_reduce=1 --rw=trim --size=100% --thread=1 --zonemode=zbd ) params2=( --name=measure-iops --filename="${dev}" --direct=1 --ioscheduler="${iosched}" --gtod_reduce=1 --runtime=30 --rw=write --thread=1 --time_based=1 --zonemode=zbd ) if [ -n "$fastest_cpucore" ]; then fio_args+=(--cpus_allowed="${fastest_cpucore}") fi if [ "$preserves_write_order" = 1 ]; then params2+=( --ioengine=libaio --iodepth=64 --iodepth_batch=16 ) else params2+=( --ioengine=pvsync2 ) fi set +e echo "fio ${params2[*]}" # Finish all open zones to prevent that the maximum number of open zones is # exceeded. Next, trim all zones and measure IOPS. if [ -z "$android" ]; then blkzone finish "${dev}" fio "${params1[@]}" >"/tmp/fio-trim-${iosched}-${preserves_write_order}.txt" fio "${params2[@]}" else adb shell /tmp/blkzone finish "${dev}" adb shell /tmp/fio "${params1[@]}" >/dev/null adb shell /tmp/fio "${params2[@]}" fi ret=$? set -e if [ -n "${tracing}" ]; then end_tracing fi [ "$ret" = 0 ] || break done