On Mon, Jun 16, 2025 at 05:29:32AM -0700, Christoph Hellwig wrote: > On Fri, Jun 13, 2025 at 05:23:48AM -0400, Mike Snitzer wrote: > > Which in practice has proven a hard requirement for O_DIRECT in my > > testing > > What fails if you don't page align the memory? > > > But if you looking at patch 5 in this series: > > https://lore.kernel.org/linux-nfs/20250610205737.63343-6-snitzer@xxxxxxxxxx/ > > > > I added fs/nfsd/vfs.c:is_dio_aligned(), which is basically a tweaked > > ditto of fs/btrfs/direct-io.c:check_direct_IO(): > > No idea why btrfs still has this, but it's not a general requirement > from the block layer or other file system. You just need to be > aligned to the dma alignment in the queue limits, which for most NVMe, > SCSI or ATA devices reports a dword alignment. Some of the more > obscure drivers might require more alignment, or just report it due to > copy and paste. Yeah, should probably be fixed and the rest of filesystems audited. > > What I found is that unless SUNRPC TPC stored the WRITE payload in a > > page-aligned boundary then iov_iter_alignment() would fail. > > iov_iter_alignment would fail, or yout check based on it? The latter > will fail, but it doesn't check anything that matters :) > The latter, the check based on iov_iter_alignment() failed. I understand your point. Thankfully I can confirm that dword alignment is all that is needed on modern hardware, just showing my work: I retested a 512K write payload that is aligned to the XFS bdev's logical_block_size (512b) fails when I skip the iov_iter_alignment() check at a high level. Because it fails in fs/iomap/direct-io.c:iomap_dio_bio_iter() with this check: if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) || !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter)) return -EINVAL; Because: static inline bool bdev_iter_is_aligned(struct block_device *bdev, struct iov_iter *iter) { return iov_iter_is_aligned(iter, bdev_dma_alignment(bdev), bdev_logical_block_size(bdev) - 1); } and because bdev_dma_alignment for my particular test bdev is 511 :( But that's OK... my test bdev is a bad example (archaic VMware vSphere provided SCSI device): it doesn't reflect expected modern hardware. But I just slapped together a test pmem blockdevice (memory backed, using memmap=6G!18G) and it too has dma_alignment=511 I do have access to a KVM guest with a virtio_scsi root bdev that has dma_alignment=3 I also just confirmed that modern NVMe devices on another testbed also have dma_alignment=3, whew... I'd like NFSD to be able to know if its bvec is dma-aligned, before issuing DIO writes to underlying XFS. AFAIK I can do that simply by checking the STATX_DIOALIGN provided dio_mem_align... Thanks, Mike