On 7/9/25 7:07 AM, Bart Van Assche wrote: > Support pipelining of zoned writes if the block driver preserves the write > order per hardware queue. Track per zone to which software queue writes > have been queued. If zoned writes are pipelined, submit new writes to the > same software queue as the writes that are already in progress. This > prevents reordering by submitting requests for the same zone to different > software or hardware queues. > > Cc: Christoph Hellwig <hch@xxxxxx> > Cc: Damien Le Moal <dlemoal@xxxxxxxxxx> > Signed-off-by: Bart Van Assche <bvanassche@xxxxxxx> [...] > @@ -764,14 +767,18 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio) > static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk, > struct blk_zone_wplug *zwplug) > { > + int cpu; > + > + lockdep_assert_held(&zwplug->lock); > + > /* > * Take a reference on the zone write plug and schedule the submission > * of the next plugged BIO. blk_zone_wplug_bio_work() will release the > * reference we take here. > */ > - WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)); > refcount_inc(&zwplug->ref); > - queue_work(disk->zone_wplugs_wq, &zwplug->bio_work); > + cpu = zwplug->from_cpu >= 0 ? zwplug->from_cpu : WORK_CPU_UNBOUND; > + queue_work_on(cpu, disk->zone_wplugs_wq, &zwplug->bio_work); Please change this to a more readable form with an expanded "if". The local variable "cpu" should probably be named "work_cpu" for clarity too. > } > > static inline void disk_zone_wplug_add_bio(struct gendisk *disk, > @@ -932,7 +939,8 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug, > * We know such BIO will fail, and that would potentially overflow our > * write pointer offset beyond the end of the zone. > */ > - if (disk_zone_wplug_is_full(disk, zwplug)) > + if (!disk->queue->limits.driver_preserves_write_order > + && disk_zone_wplug_is_full(disk, zwplug)) Writing to a zone that is full is an error, pipelining or not. So why do you change this ? This does not make sense. > return false; > > if (bio_op(bio) == REQ_OP_ZONE_APPEND) { > @@ -956,7 +964,8 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug, > * with a start sector not unaligned to the zone write pointer > * will fail. > */ > - if (bio_offset_from_zone_start(bio) != zwplug->wp_offset) > + if (!disk->queue->limits.driver_preserves_write_order > + && bio_offset_from_zone_start(bio) != zwplug->wp_offset) Same here. This does not depend pipelining: write should have been received in order and be aligned with the wp. So why change this condition ? > return false; > } > > @@ -1033,8 +1044,23 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs) > return true; > } > > - /* Otherwise, plug and submit the BIO. */ > - zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED; > + if (dpwo && zwplug->from_cpu < 0) { > + /* No zoned writes are in progress. Select the current CPU. */ > + zwplug->from_cpu = raw_smp_processor_id(); > + goto plug; > + } else if (dpwo) { > + /* > + * The block driver preserves the write order. Submit the bio > + * from zwplug->from_cpu. > + */ > + goto plug; Can you change this to: if (dpwo) { /* * The block driver preserves the write order: if we do not * have any writes in progress already, use the current CPU * to submit the BIO. Otherwise, keep using zwplug->from_cpu. */ if (zwplug->from_cpu < 0) zwplug->from_cpu = raw_smp_processor_id(); goto plug; } /* * The block driver does not preserve the write order. Plug and * submit the BIO. */ zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED; That is a lot simpler and easier to read. > + } else { > + /* > + * The block driver does not preserve the write order. Plug and > + * submit the BIO. > + */ > + zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED; > + } > > spin_unlock_irqrestore(&zwplug->lock, flags); > > @@ -1298,6 +1329,9 @@ static void blk_zone_submit_one_bio(struct blk_zone_wplug *zwplug) > } else { > blk_mq_submit_bio(bio); > } > + > + return disk->queue->limits.driver_preserves_write_order && > + !need_resched(); I think we really need a helper for that "disk->queue->limits.driver_preserves_write_order". But if you make this a feature, it will be better. Also, here, the test using need_resched() really need a comment explaining why you look at that. I do not get it personally. > } > > static void blk_zone_wplug_bio_work(struct work_struct *work) > @@ -1305,7 +1339,8 @@ static void blk_zone_wplug_bio_work(struct work_struct *work) > struct blk_zone_wplug *zwplug = > container_of(work, struct blk_zone_wplug, bio_work); > > - blk_zone_submit_one_bio(zwplug); > + while (blk_zone_submit_one_bio(zwplug)) > + ; So patch 6 split created that blk_zone_submit_one_bio() function, but all you do here is this change. I really do not see how that is better. Let's not do that and simply have the loop expanded here to something like: do { ... } while (disk->queue->limits.driver_preserves_write_order && !need_resched()); > > /* Drop the reference we took in disk_zone_wplug_schedule_bio_work(). */ > disk_put_zone_wplug(zwplug); > @@ -1831,6 +1866,7 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug, > unsigned int zwp_zone_no, zwp_ref; > unsigned int zwp_bio_list_size; > unsigned long flags; > + int from_cpu; > > spin_lock_irqsave(&zwplug->lock, flags); > zwp_zone_no = zwplug->zone_no; > @@ -1838,10 +1874,12 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug, > zwp_ref = refcount_read(&zwplug->ref); > zwp_wp_offset = zwplug->wp_offset; > zwp_bio_list_size = bio_list_size(&zwplug->bio_list); > + from_cpu = zwplug->from_cpu; > spin_unlock_irqrestore(&zwplug->lock, flags); > > - seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref, > - zwp_wp_offset, zwp_bio_list_size); > + seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u bio_list_size %u from_cpu %d\n", > + zwp_zone_no, zwp_flags, zwp_ref, zwp_wp_offset, > + zwp_bio_list_size, from_cpu); > } > > int queue_zone_wplugs_show(void *data, struct seq_file *m) -- Damien Le Moal Western Digital Research