On Tue 26-08-25 10:29:58, Ritesh Harjani wrote: > Keith Busch <kbusch@xxxxxxxxxx> writes: > > > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote: > >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote: > >> > Keith Busch <kbusch@xxxxxxxx> writes: > >> > > > >> > > - EXT4 falls back to buffered io for writes but not for reads. > >> > > >> > ++linux-ext4 to get any historical context behind why the difference of > >> > behaviour in reads v/s writes for EXT4 DIO. > >> > >> Hum, how did you test? Because in the basic testing I did (with vanilla > >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be > >> falling back to buffered IO only if the underlying file itself does not > >> support any kind of direct IO. > > > > Simple test case (dio-offset-test.c) below. > > > > I also ran this on vanilla kernel and got these results: > > > > # mkfs.ext4 /dev/vda > > # mount /dev/vda /mnt/ext4/ > > # make dio-offset-test > > # ./dio-offset-test /mnt/ext4/foobar > > write: Success > > read: Invalid argument > > > > I tracked the "write: Success" down to ext4's handling for the "special" > > -ENOTBLK error after ext4_want_directio_fallback() returns "true". > > > > Right. Ext4 has fallback only for dio writes but not for DIO reads... > > buffered > static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written) > { > /* must be a directio to fall back to buffered */ > if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) != > (IOMAP_WRITE | IOMAP_DIRECT)) > return false; > > ... > } > > So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw > -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from... > > > if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) || > !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter)) > return -EINVAL; > > EXT4 then fallsback to buffered-io only for writes, but not for reads. Right. And the fallback for writes was actually inadvertedly "added" by commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That changed the error handling logic. Previously if iomap_dio_bio_iter() returned EINVAL, it got propagated to userspace regardless of what ->iomap_end() returned. After this commit if ->iomap_end() returns error (which is ENOTBLK in ext4 case), it gets propagated to userspace instead of the error returned by iomap_dio_bio_iter(). Now both the old and new behavior make some sense so I won't argue that the new iomap_iter() behavior is wrong. But I think we should change ext4 back to the old behavior of failing unaligned dio writes instead of them falling back to buffered IO. I think something like the attached patch should do the trick - it makes unaligned dio writes fail again while writes to holes of indirect-block mapped files still correctly fall back to buffered IO. Once fstests run completes, I'll do a proper submission... Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR
>From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@xxxxxxx> Date: Wed, 27 Aug 2025 14:55:19 +0200 Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances") changed the error handling logic in iomap_iter(). Previously any error from iomap_dio_bio_iter() got propagated to userspace, after this commit if ->iomap_end returns error, it gets propagated to userspace instead of an error from iomap_dio_bio_iter(). This results in unaligned writes to ext4 to silently fallback to buffered IO instead of erroring out. Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems unnecessary these days. It is enough to return ENOTBLK from ext4_iomap_begin() when we don't support DIO write for that particular file offset (due to hole). Fixes: bc264fea0f6f ("iomap: support incremental iomap_iter advances") Signed-off-by: Jan Kara <jack@xxxxxxx> --- fs/ext4/file.c | 2 -- fs/ext4/inode.c | 35 ----------------------------------- 2 files changed, 37 deletions(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 93240e35ee36..cf39f57d21e9 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -579,8 +579,6 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) iomap_ops = &ext4_iomap_overwrite_ops; ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops, dio_flags, NULL, 0); - if (ret == -ENOTBLK) - ret = 0; if (extend) { /* * We always perform extending DIO write synchronously so by diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5b7a15db4953..c3b23c90fd11 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3872,47 +3872,12 @@ static int ext4_iomap_overwrite_begin(struct inode *inode, loff_t offset, return ret; } -static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written) -{ - /* must be a directio to fall back to buffered */ - if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) != - (IOMAP_WRITE | IOMAP_DIRECT)) - return false; - - /* atomic writes are all-or-nothing */ - if (flags & IOMAP_ATOMIC) - return false; - - /* can only try again if we wrote nothing */ - return written == 0; -} - -static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length, - ssize_t written, unsigned flags, struct iomap *iomap) -{ - /* - * Check to see whether an error occurred while writing out the data to - * the allocated blocks. If so, return the magic error code for - * non-atomic write so that we fallback to buffered I/O and attempt to - * complete the remainder of the I/O. - * For non-atomic writes, any blocks that may have been - * allocated in preparation for the direct I/O will be reused during - * buffered I/O. For atomic write, we never fallback to buffered-io. - */ - if (ext4_want_directio_fallback(flags, written)) - return -ENOTBLK; - - return 0; -} - const struct iomap_ops ext4_iomap_ops = { .iomap_begin = ext4_iomap_begin, - .iomap_end = ext4_iomap_end, }; const struct iomap_ops ext4_iomap_overwrite_ops = { .iomap_begin = ext4_iomap_overwrite_begin, - .iomap_end = ext4_iomap_end, }; static int ext4_iomap_begin_report(struct inode *inode, loff_t offset, -- 2.43.0