Re: [PATCHv3 0/8] direct-io: even more flexible io vectors

Jan Kara <jack@xxxxxxx> · Wed, 27 Aug 2025 17:20:53 +0200

On Tue 26-08-25 10:29:58, Ritesh Harjani wrote:
> Keith Busch <kbusch@xxxxxxxxxx> writes:
> 
> > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
> >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
> >> > Keith Busch <kbusch@xxxxxxxx> writes:
> >> > >
> >> > >   - EXT4 falls back to buffered io for writes but not for reads.
> >> > 
> >> > ++linux-ext4 to get any historical context behind why the difference of
> >> > behaviour in reads v/s writes for EXT4 DIO. 
> >> 
> >> Hum, how did you test? Because in the basic testing I did (with vanilla
> >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
> >> falling back to buffered IO only if the underlying file itself does not
> >> support any kind of direct IO.
> >
> > Simple test case (dio-offset-test.c) below.
> >
> > I also ran this on vanilla kernel and got these results:
> >
> >   # mkfs.ext4 /dev/vda
> >   # mount /dev/vda /mnt/ext4/
> >   # make dio-offset-test
> >   # ./dio-offset-test /mnt/ext4/foobar
> >   write: Success
> >   read: Invalid argument
> >
> > I tracked the "write: Success" down to ext4's handling for the "special"
> > -ENOTBLK error after ext4_want_directio_fallback() returns "true".
> >
> 
> Right. Ext4 has fallback only for dio writes but not for DIO reads... 
> 
> buffered
> static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> {
> 	/* must be a directio to fall back to buffered */
> 	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> 		    (IOMAP_WRITE | IOMAP_DIRECT))
> 		return false;
> 
>     ...
> }
> 
> So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
>     -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...
> 
> 
> 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> 	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> 		return -EINVAL;
> 
> EXT4 then fallsback to buffered-io only for writes, but not for reads. 

Right. And the fallback for writes was actually inadvertedly "added" by
commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That
changed the error handling logic. Previously if iomap_dio_bio_iter()
returned EINVAL, it got propagated to userspace regardless of what
->iomap_end() returned. After this commit if ->iomap_end() returns error
(which is ENOTBLK in ext4 case), it gets propagated to userspace instead of
the error returned by iomap_dio_bio_iter().

Now both the old and new behavior make some sense so I won't argue that the
new iomap_iter() behavior is wrong. But I think we should change ext4 back
to the old behavior of failing unaligned dio writes instead of them falling
back to buffered IO. I think something like the attached patch should do
the trick - it makes unaligned dio writes fail again while writes to holes
of indirect-block mapped files still correctly fall back to buffered IO.
Once fstests run completes, I'll do a proper submission...


								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR
>From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@xxxxxxx>
Date: Wed, 27 Aug 2025 14:55:19 +0200
Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL

Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
changed the error handling logic in iomap_iter(). Previously any error
from iomap_dio_bio_iter() got propagated to userspace, after this commit
if ->iomap_end returns error, it gets propagated to userspace instead of
an error from iomap_dio_bio_iter(). This results in unaligned writes to
ext4 to silently fallback to buffered IO instead of erroring out.

Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
unnecessary these days. It is enough to return ENOTBLK from
ext4_iomap_begin() when we don't support DIO write for that particular
file offset (due to hole).

Fixes: bc264fea0f6f ("iomap: support incremental iomap_iter advances")
Signed-off-by: Jan Kara <jack@xxxxxxx>
---
 fs/ext4/file.c  |  2 --
 fs/ext4/inode.c | 35 -----------------------------------
 2 files changed, 37 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 93240e35ee36..cf39f57d21e9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -579,8 +579,6 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		iomap_ops = &ext4_iomap_overwrite_ops;
 	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
 			   dio_flags, NULL, 0);
-	if (ret == -ENOTBLK)
-		ret = 0;
 	if (extend) {
 		/*
 		 * We always perform extending DIO write synchronously so by
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5b7a15db4953..c3b23c90fd11 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3872,47 +3872,12 @@ static int ext4_iomap_overwrite_begin(struct inode *inode, loff_t offset,
 	return ret;
 }
 
-static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
-{
-	/* must be a directio to fall back to buffered */
-	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
-		    (IOMAP_WRITE | IOMAP_DIRECT))
-		return false;
-
-	/* atomic writes are all-or-nothing */
-	if (flags & IOMAP_ATOMIC)
-		return false;
-
-	/* can only try again if we wrote nothing */
-	return written == 0;
-}
-
-static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
-			  ssize_t written, unsigned flags, struct iomap *iomap)
-{
-	/*
-	 * Check to see whether an error occurred while writing out the data to
-	 * the allocated blocks. If so, return the magic error code for
-	 * non-atomic write so that we fallback to buffered I/O and attempt to
-	 * complete the remainder of the I/O.
-	 * For non-atomic writes, any blocks that may have been
-	 * allocated in preparation for the direct I/O will be reused during
-	 * buffered I/O. For atomic write, we never fallback to buffered-io.
-	 */
-	if (ext4_want_directio_fallback(flags, written))
-		return -ENOTBLK;
-
-	return 0;
-}
-
 const struct iomap_ops ext4_iomap_ops = {
 	.iomap_begin		= ext4_iomap_begin,
-	.iomap_end		= ext4_iomap_end,
 };
 
 const struct iomap_ops ext4_iomap_overwrite_ops = {
 	.iomap_begin		= ext4_iomap_overwrite_begin,
-	.iomap_end		= ext4_iomap_end,
 };
 
 static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
-- 
2.43.0