Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/28/25 12:22 PM, Jeff Layton wrote:
> On Wed, 2025-08-27 at 15:41 -0400, Mike Snitzer wrote:
>> On Wed, Aug 27, 2025 at 11:34:03AM -0400, Chuck Lever wrote:
>>> On 8/26/25 2:57 PM, Mike Snitzer wrote:
>>>> If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
>>>> DIO-aligned block (on either end of the READ). The expanded READ is
>>>> verified to have proper offset/len (logical_block_size) and
>>>> dma_alignment checking.
>>>>
>>>> Must allocate and use a bounce-buffer page (called 'start_extra_page')
>>>> if/when expanding the misaligned READ requires reading extra partial
>>>> page at the start of the READ so that its DIO-aligned. Otherwise that
>>>> extra page at the start will make its way back to the NFS client and
>>>> corruption will occur. As found, and then this fix of using an extra
>>>> page verified, using the 'dt' utility:
>>>>   dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
>>>>      iotype=sequential pattern=iot onerr=abort oncerr=abort
>>>> see: https://github.com/RobinTMiller/dt.git
>>>>
>>>> Any misaligned READ that is less than 32K won't be expanded to be
>>>> DIO-aligned (this heuristic just avoids excess work, like allocating
>>>> start_extra_page, for smaller IO that can generally already perform
>>>> well using buffered IO).
>>>>
>>>> Suggested-by: Jeff Layton <jlayton@xxxxxxxxxx>
>>>> Suggested-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>> Signed-off-by: Mike Snitzer <snitzer@xxxxxxxxxx>
>>>> Reviewed-by: Jeff Layton <jlayton@xxxxxxxxxx>
>>>> ---
>>>>  fs/nfsd/vfs.c              | 200 +++++++++++++++++++++++++++++++++++--
>>>>  include/linux/sunrpc/svc.h |   5 +-
>>>>  2 files changed, 194 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>>>> index c340708fbab4d..64732dc8985d6 100644
>>>> --- a/fs/nfsd/vfs.c
>>>> +++ b/fs/nfsd/vfs.c
>>>> @@ -19,6 +19,7 @@
>>>>  #include <linux/splice.h>
>>>>  #include <linux/falloc.h>
>>>>  #include <linux/fcntl.h>
>>>> +#include <linux/math.h>
>>>>  #include <linux/namei.h>
>>>>  #include <linux/delay.h>
>>>>  #include <linux/fsnotify.h>
>>>> @@ -1073,6 +1074,153 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>>>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>>>>  }
>>>>  
>>>> +struct nfsd_read_dio {
>>>> +	loff_t start;
>>>> +	loff_t end;
>>>> +	unsigned long start_extra;
>>>> +	unsigned long end_extra;
>>>> +	struct page *start_extra_page;
>>>> +};
>>>> +
>>>> +static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
>>>> +{
>>>> +	memset(read_dio, 0, sizeof(*read_dio));
>>>> +	read_dio->start_extra_page = NULL;
>>>> +}
>>>> +
>>>> +#define NFSD_READ_DIO_MIN_KB (32 << 10)
>>>> +
>>>> +static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>>> +				  struct nfsd_file *nf, loff_t offset,
>>>> +				  unsigned long len, unsigned int base,
>>>> +				  struct nfsd_read_dio *read_dio)
>>>> +{
>>>> +	const u32 dio_blocksize = nf->nf_dio_read_offset_align;
>>>> +	loff_t middle_end, orig_end = offset + len;
>>>> +
>>>> +	if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
>>>> +		      "%s: underlying filesystem has not provided DIO alignment info\n",
>>>> +		      __func__))
>>>> +		return false;
>>>> +	if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
>>>> +		      "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
>>>> +		      __func__, dio_blocksize, PAGE_SIZE))
>>>> +		return false;
>>>
>>> IMHO these checks do not warrant a WARN. Perhaps a trace event, instead?
>>
>> I won't die on this hill, I just don't see the risk of these given
>> they are highly unlikely ("famous last words").
>>
>> But if they trigger we should surely be made aware immediately.  Not
>> only if someone happens to have a trace event enabled (which would
>> only happen with further support and engineering involvement to chase
>> "why isn't O_DIRECT being used like NFSD was optionally configured
>> to!?").
>>
> 
> 
> A kernel log message in this case makes sense to me, since it is a
> (minor) administrative issue. WARN_ONCE() is going to throw a big,
> scary stack trace, however that won't be terribly useful. We'll get hit
> with bug reports from it for years.

Agreed, the stack trace isn't very useful information.


> Maybe pr_notice_once() for this? Or, maybe a pr_notice_once, but do it
> on a per-export basis?

Right, I think warning once and then turning off the warning for all
subsequent problems is going to cause a lot of missed problems. Warning
once per export sounds like a step in the right direction.


>>>> +	/* Return early if IO is irreparably misaligned (len < PAGE_SIZE,
>>>> +	 * or base not aligned).
>>>> +	 * Ondisk alignment is implied by the following code that expands
>>>> +	 * misaligned IO to have a DIO-aligned offset and len.
>>>> +	 */
>>>> +	if (unlikely(len < dio_blocksize) || ((base & (nf->nf_dio_mem_align-1)) != 0))
>>>> +		return false;
>>>> +
>>>> +	init_nfsd_read_dio(read_dio);
>>>> +
>>>> +	read_dio->start = round_down(offset, dio_blocksize);
>>>> +	read_dio->end = round_up(orig_end, dio_blocksize);
>>>> +	read_dio->start_extra = offset - read_dio->start;
>>>> +	read_dio->end_extra = read_dio->end - orig_end;
>>>> +
>>>> +	/*
>>>> +	 * Any misaligned READ less than NFSD_READ_DIO_MIN_KB won't be expanded
>>>> +	 * to be DIO-aligned (this heuristic avoids excess work, like allocating
>>>> +	 * start_extra_page, for smaller IO that can generally already perform
>>>> +	 * well using buffered IO).
>>>> +	 */
>>>> +	if ((read_dio->start_extra || read_dio->end_extra) &&
>>>> +	    (len < NFSD_READ_DIO_MIN_KB)) {
>>>> +		init_nfsd_read_dio(read_dio);
>>>> +		return false;
>>>> +	}
>>>> +
>>>> +	if (read_dio->start_extra) {
>>>> +		read_dio->start_extra_page = alloc_page(GFP_KERNEL);
>>>
>>> This introduces a page allocation where there weren't any before. For
>>> NFSD, I/O pages come from rqstp->rq_pages[] so that memory allocation
>>> like this is not needed on an I/O path.
>>
>> NFSD never supported DIO before. Yes, with this patch there is
>> a single page allocation in the misaligned DIO READ path (if it
>> requires reading extra before the client requested data starts).
>>
>> I tried to succinctly explain the need for the extra page allocation
>> for misaligned DIO READ in this patch's header (in 2nd paragraph
>> of the above header).
>>
>> I cannot see how to read extra, not requested by the client, into the
>> head of rq_pages without causing serious problems. So that cannot be
>> what you're saying needed.
>>
>>> So I think the answer to this is that I want you to implement reading
>>> an entire aligned range from the file and then forming the NFS READ
>>> response with only the range of bytes that the client requested, as we
>>> discussed before.
>>
>> That is what I'm doing. But you're taking issue with my implementation
>> that uses a single extra page.
>>
>>> The use of xdr_buf and bvec should make that quite
>>> straightforward.
>>
>> Is your suggestion to, rather than allocate a disjoint single page,
>> borrow the extra page from the end of rq_pages? Just map it into the
>> bvec instead of my extra page?
>>
>>> This should make the aligned and unaligned cases nearly identical and
>>> much less fraught.
>>
>> Regardless of which memory used to read the extra data, I don't see
>> how the care I've taken to read extra but hide that fact from the
>> client can be avoided. So the pre/post misaligned DIO READ code won't
>> change a whole lot. But once I understand your suggestion better
>> (after a clarifying response to this message) hopefully I'll see what
>> you're saying.
>>
>> All said, this patchset is very important to me, I don't want it to
>> miss v6.18 -- I'm still "in it to win it" but it feels like I do need
>> your or others' help to pull this off.
>>
>> And/or is it possible to accept this initial implementation with
>> mutual understanding that we must revisit your concern about my
>> allocating an extra page for the misaligned DIO READ path?
>>
>>>> +		if (WARN_ONCE(read_dio->start_extra_page == NULL,
>>>> +			      "%s: Unable to allocate start_extra_page\n", __func__)) {
>>>> +			init_nfsd_read_dio(read_dio);
>>>> +			return false;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	return true;
>>>> +}
>>>> +
>>>> +static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
>>>> +						 struct nfsd_read_dio *read_dio,
>>>> +						 ssize_t bytes_read,
>>>> +						 unsigned long bytes_expected,
>>>> +						 loff_t *offset,
>>>> +						 unsigned long *rq_bvec_numpages)
>>>> +{
>>>> +	ssize_t host_err = bytes_read;
>>>> +	loff_t v;
>>>> +
>>>> +	if (!read_dio->start_extra && !read_dio->end_extra)
>>>> +		return host_err;
>>>> +
>>>> +	/* If nfsd_analyze_read_dio() allocated a start_extra_page it must
>>>> +	 * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
>>>> +	 */
>>>> +	if (read_dio->start_extra_page) {
>>>> +		__free_page(read_dio->start_extra_page);
>>>> +		*rq_bvec_numpages -= 1;
>>>> +		v = *rq_bvec_numpages;
>>>> +		memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
>>>> +			v * sizeof(struct bio_vec));
>>>> +	}
>>>> +	/* Eliminate any end_extra bytes from the last page */
>>>> +	v = *rq_bvec_numpages;
>>>> +	rqstp->rq_bvec[v].bv_len -= read_dio->end_extra;
>>>> +
>>>> +	if (host_err < 0) {
>>>> +		/* Underlying FS will return -EINVAL if misaligned
>>>> +		 * DIO is attempted because it shouldn't be.
>>>> +		 */
>>>> +		WARN_ON_ONCE(host_err == -EINVAL);
>>>> +		return host_err;
>>>> +	}
>>>> +
>>>> +	/* nfsd_analyze_read_dio() may have expanded the start and end,
>>>> +	 * if so adjust returned read size to reflect original extent.
>>>> +	 */
>>>> +	*offset += read_dio->start_extra;
>>>> +	if (likely(host_err >= read_dio->start_extra)) {
>>>> +		host_err -= read_dio->start_extra;
>>>> +		if (host_err > bytes_expected)
>>>> +			host_err = bytes_expected;
>>>> +	} else {
>>>> +		/* Short read that didn't read any of requested data */
>>>> +		host_err = 0;
>>>> +	}
>>>> +
>>>> +	return host_err;
>>>> +}
>>>> +
>>>> +static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
>>>> +		unsigned addr_mask, unsigned len_mask)
>>>> +{
>>>> +	const struct bio_vec *bvec = i->bvec;
>>>> +	unsigned skip = i->iov_offset;
>>>> +	size_t size = i->count;
>>>
>>> checkpatch.pl is complaining about the use of "unsigned" rather than
>>> "unsigned int".
>>
>> OK.
>>
>>>> +
>>>> +	if (size & len_mask)
>>>> +		return false;
>>>> +	do {
>>>> +		size_t len = bvec->bv_len;
>>>> +
>>>> +		if (len > size)
>>>> +			len = size;
>>>> +		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
>>>> +			return false;
>>>> +		bvec++;
>>>> +		size -= len;
>>>> +		skip = 0;
>>>> +	} while (size);
>>>> +
>>>> +	return true;
>>>> +}
>>>> +
>>>>  /**
>>>>   * nfsd_iter_read - Perform a VFS read using an iterator
>>>>   * @rqstp: RPC transaction context
>>>> @@ -1094,7 +1242,8 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>>>  		      unsigned int base, u32 *eof)
>>>>  {
>>>>  	struct file *file = nf->nf_file;
>>>> -	unsigned long v, total;
>>>> +	unsigned long v, total, in_count = *count;
>>>> +	struct nfsd_read_dio read_dio;
>>>>  	struct iov_iter iter;
>>>>  	struct kiocb kiocb;
>>>>  	ssize_t host_err;
>>>> @@ -1102,13 +1251,34 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>>>  
>>>>  	init_sync_kiocb(&kiocb, file);
>>>>  
>>>> +	v = 0;
>>>> +	total = in_count;
>>>> +
>>>>  	switch (nfsd_io_cache_read) {
>>>>  	case NFSD_IO_DIRECT:
>>>> -		/* Verify ondisk and memory DIO alignment */
>>>> -		if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
>>>> -		    (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0) &&
>>>> -		    (base & (nf->nf_dio_mem_align - 1)) == 0)
>>>> -			kiocb.ki_flags = IOCB_DIRECT;
>>>> +		/*
>>>> +		 * If NFSD_IO_DIRECT enabled, expand any misaligned READ to
>>>> +		 * the next DIO-aligned block (on either end of the READ).
>>>> +		 */
>>>> +		if (nfsd_analyze_read_dio(rqstp, fhp, nf, offset,
>>>> +					  in_count, base, &read_dio)) {
>>>> +			/* trace_nfsd_read_vector() will reflect larger
>>>> +			 * DIO-aligned READ.
>>>> +			 */
>>>> +			offset = read_dio.start;
>>>> +			in_count = read_dio.end - offset;
>>>> +			total = in_count;
>>>> +
>>>> +			kiocb.ki_flags |= IOCB_DIRECT;
>>>> +			if (read_dio.start_extra) {
>>>> +				len = read_dio.start_extra;
>>>> +				bvec_set_page(&rqstp->rq_bvec[v],
>>>> +					      read_dio.start_extra_page,
>>>> +					      len, PAGE_SIZE - len);
>>>> +				total -= len;
>>>> +				++v;
>>>> +			}
>>>> +		}
>>>>  		break;
>>>>  	case NFSD_IO_DONTCACHE:
>>>>  		kiocb.ki_flags = IOCB_DONTCACHE;
>>>> @@ -1120,8 +1290,6 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>>>  
>>>>  	kiocb.ki_pos = offset;
>>>>  
>>>> -	v = 0;
>>>> -	total = *count;
>>>>  	while (total) {
>>>>  		len = min_t(size_t, total, PAGE_SIZE - base);
>>>>  		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
>>>> @@ -1132,9 +1300,21 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>>>  	}
>>>>  	WARN_ON_ONCE(v > rqstp->rq_maxpages);
>>>>  
>>>> -	trace_nfsd_read_vector(rqstp, fhp, offset, *count);
>>>> -	iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
>>>> +	trace_nfsd_read_vector(rqstp, fhp, offset, in_count);
>>>> +	iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, in_count);
>>>> +
>>>> +	if ((kiocb.ki_flags & IOCB_DIRECT) &&
>>>> +	    !nfsd_iov_iter_aligned_bvec(&iter, nf->nf_dio_mem_align-1,
>>>> +					nf->nf_dio_read_offset_align-1))
>>>> +		kiocb.ki_flags &= ~IOCB_DIRECT;
>>>> +
>>>>  	host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
>>>> +
>>>> +	if (in_count != *count) {
>>>> +		/* misaligned DIO expanded read to be DIO-aligned */
>>>> +		host_err = nfsd_complete_misaligned_read_dio(rqstp, &read_dio,
>>>> +					host_err, *count, &offset, &v);
>>>> +	}
>>>>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>>>>  }
>>>>  
>>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>>>> index e64ab444e0a7f..190c2667500e2 100644
>>>> --- a/include/linux/sunrpc/svc.h
>>>> +++ b/include/linux/sunrpc/svc.h
>>>> @@ -163,10 +163,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>>>>   * pages, one for the request, and one for the reply.
>>>>   * nfsd_splice_actor() might need an extra page when a READ payload
>>>>   * is not page-aligned.
>>>> + * nfsd_iter_read() might need two extra pages when a READ payload
>>>> + * is not DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor()
>>>> + * are mutually exclusive (so reuse page reserved for nfsd_splice_actor).
>>>>   */
>>>>  static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
>>>>  {
>>>> -	return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
>>>> +	return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
>>>>  }
>>>>  
>>>>  /*
>>>
>>> To properly evaluate the impact of using direct I/O for reads with real
>>> world user workloads, we will want to identify (or construct) some
>>> metrics (and this is future work, but near-term future).
>>>
>>> Seems like allocating memory becomes difficult only when too many pages
>>> are dirty. I am skeptical that the issue is due to read caching, since
>>> clean pages in the page cache are pretty easy to evict quickly, AIUI. If
>>> that's incorrect, I'd like to understand why.
>>
>> The much more problematic case is heavy WRITE workload with a working
>> set that far exceeds system memory.
>>
>> But I agree it doesn't make a whole lot of sense that clean pages in
>> the page cache would be getting in the way.  All I can tell you is
>> that in my experience MM seems to _not_ evict them quickly (but more
>> focused read-only testing is warranted to further understand the
>> dynamics and heuristics in MM and beyond -- especially if/when
>> READ-only then a pivot to a mix of heavy READ and WRITE or
>> WRITE-only).
>>
>> NFSD using DIO is optional. I thought the point was to get it as an
>> available option so that _others_ could experiment and help categorize
>> the benefits/pitfalls further?
>>
>> I cannot be a one man show on all this. I welcome more help from
>> anyone interested.
> 


-- 
Chuck Lever




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux