[Cc'ing Al and Andrew] On Tue, Jun 17, 2025 at 04:26:42PM -0400, Mike Snitzer wrote: > On Mon, Jun 16, 2025 at 09:37:01PM -0700, Christoph Hellwig wrote: > > On Mon, Jun 16, 2025 at 12:07:42PM -0400, Mike Snitzer wrote: > > > But that's OK... my test bdev is a bad example (archaic VMware vSphere > > > provided SCSI device): it doesn't reflect expected modern hardware. > > > > > > But I just slapped together a test pmem blockdevice (memory backed, > > > using memmap=6G!18G) and it too has dma_alignment=511 > > > > That's the block layer default when not overriden by the driver, I guess > > pmem folks didn't care enough. I suspect it should not have any > > alignment requirements at all. > > Yeah, I hacked it with this just to quickly simulate NVMe's dma_alignment: > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index 210fb77f51ba..0ab2826073f9 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -457,6 +457,7 @@ static int pmem_attach_disk(struct device *dev, > .max_hw_sectors = UINT_MAX, > .features = BLK_FEAT_WRITE_CACHE | > BLK_FEAT_SYNCHRONOUS, > + .dma_alignment = 3, > }; > int nid = dev_to_node(dev), fua; > struct resource *res = &nsio->res; > > > > I'd like NFSD to be able to know if its bvec is dma-aligned, before > > > issuing DIO writes to underlying XFS. AFAIK I can do that simply by > > > checking the STATX_DIOALIGN provided dio_mem_align... > > > > Exactly. > > I'm finding that even with dma_alignment=3 the bvec, that > nfsd_vfs_write()'s call to xdr_buf_to_bvec() produces from NFS's WRITE > payload, still causes iov_iter_aligned_bvec() to return false. > > The reason is that iov_iter_aligned_bvec() inspects each member of the > bio_vec in isolation (in its while() loop). So even though NFS WRITE > payload's overall size is aligned on-disk (e.g. offset=0 len=512K) its > first and last bvec members are _not_ aligned (due to 512K NFS WRITE > payload being offset 148 bytes into the first page of the pages > allocated for it by SUNRPC). So iov_iter_aligned_bvec() fails at this > check: > > if (len & len_mask) > return false; > > with tracing I added: > > nfsd-14027 [001] ..... 3734.668780: nfsd_vfs_write: iov_iter_aligned_bvec: addr_mask=3 len_mask=511 > nfsd-14027 [001] ..... 3734.668781: nfsd_vfs_write: iov_iter_aligned_bvec: len=3948 & len_mask=511 failed > > Is this another case of the checks being too strict? The bvec does > describe a contiguous 512K extent of on-disk LBA, just not if > inspected piece-wise. > > BTW, XFS's directio code _will_ also check with > iov_iter_aligned_bvec() via iov_iter_is_aligned(). This works, I just don't know what (if any) breakage it exposes us to: Author: Mike Snitzer <snitzer@xxxxxxxxxx> Date: Tue Jun 17 22:04:44 2025 +0000 Subject: lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec iov_iter_aligned_bvec() is strictly checking alignment of each element of the bvec to arrive at whether the bvec is aligned relative to dma_alignment and on-disk alignment. Checking each element individually results in disallowing a bvec that in aggregate is perfectly aligned relative to the provided @len_mask. Relax the on-disk alignment checking such that it is done on the full extent described by the bvec but still do piecewise checking of the dma_alignment for each bvec's bv_offset. This allows for NFS's WRITE payload to be issued using O_DIRECT as long as the bvec created with xdr_buf_to_bvec() is composed of pages that respect the underlying device's dma_alignment (@addr_mask) and the overall contiguous on-disk extent is aligned relative to the logical_block_size (@len_mask). Signed-off-by: Mike Snitzer <snitzer@xxxxxxxxxx> diff --git a/lib/iov_iter.c b/lib/iov_iter.c index bdb37d572e97..b2ae482b8a1d 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -819,13 +819,14 @@ static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask, unsigned skip = i->iov_offset; size_t size = i->count; + if (size & len_mask) + return false; + do { size_t len = bvec->bv_len; if (len > size) len = size; - if (len & len_mask) - return false; if ((unsigned long)(bvec->bv_offset + skip) & addr_mask) return false;