Re: nfs client and io_uring zero copy receive

Trond Myklebust <trondmy@xxxxxxxxxx> · Tue, 22 Jul 2025 15:40:06 -0400

On Tue, 2025-07-22 at 22:01 +0300, Anton Gavriliuk wrote:
> > The only way you can avoid memory copies here is to use RDMA to
> > allow
> > the server to write its replies directly into the correct client
> > read
> > buffers.
> 
> I remounted with rdma
> 
> [root@23-127-77-6 ~]# mount -t nfs -o
> proto=rdma,nconnect=16,rsize=4194304,wsize=4194304 192.168.0.7:/mnt
> /mnt
> [root@23-127-77-6 ~]# mount -v|grep -i rdma
> 192.168.0.7:/mnt on /mnt type nfs4
> (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,fat
> al_neterrors=none,proto=rdma,nconnect=16,port=20049,timeo=600,retrans
> =2,sec=sys,clientaddr=192.168.0.8,local_lock=none,addr=192.168.0.7)
> [root@23-127-77-6 ~]#
> 
> and repeat sequential read.
> 
> According to perf top, memcpy is gone,
> 
> Samples: 64K of event 'cycles:P', 4000 Hz, Event count (approx.):
> 22510217633 lost: 0/0 drop: 0/0
> Overhead  Shared Object                      Symbol
>   13,12%  [nfs]                              [k] nfs_generic_pg_test
>   11,32%  [nfs]                              [k] nfs_page_group_lock
>   10,42%  [nfs]                              [k] nfs_clear_request
>    5,41%  [kernel]                           [k] gup_fast_pte_range
>    4,11%  [nfs]                              [k]
> nfs_page_group_sync_on_bit
>    3,36%  [nfs]                              [k] nfs_page_create
>    3,13%  [nfs]                              [k]
> __nfs_pageio_add_request
>    2,10%  [nfs]                              [k]
> __nfs_find_lock_context
> 
> but it didn't improve read bandwidth at all.  Even slightly worse
> compared to proto=tcp.

So that more or less proves that those memcpys were never the root
cause of your performance problem.

I suspect you'll want to look at the server performance. Maybe also
look at the client tunables that limit concurrency, such as the
sunrpc.rdma_slot_table_entries sysctl, or the nfs.max_session_slots
module parameter, etc.

> 
> Anton
> 
> вт, 22 июл. 2025 г. в 21:43, Trond Myklebust <trondmy@xxxxxxxxxx>:
> > 
> > On Tue, 2025-07-22 at 21:10 +0300, Anton Gavriliuk wrote:
> > > Hi
> > > 
> > > I am trying to exceed 20 GB/s doing sequential read from a single
> > > file
> > > on the nfs client.
> > > 
> > > perf top shows excessive memcpy usage:
> > > 
> > > Samples: 237K of event 'cycles:P', 4000 Hz, Event count
> > > (approx.):
> > > 120872739112 lost: 0/0 drop: 0/0
> > > Overhead  Shared Object                      Symbol
> > >   20,54%  [kernel]                           [k] memcpy
> > >    6,52%  [nfs]                              [k]
> > > nfs_generic_pg_test
> > >    5,12%  [nfs]                              [k]
> > > nfs_page_group_lock
> > >    4,92%  [kernel]                           [k] _copy_to_iter
> > >    4,79%  [kernel]                           [k] gro_list_prepare
> > >    2,77%  [nfs]                              [k]
> > > nfs_clear_request
> > >    2,10%  [nfs]                              [k]
> > > __nfs_pageio_add_request
> > >    2,07%  [kernel]                           [k]
> > > check_heap_object
> > >    2,00%  [kernel]                           [k] __slab_free
> > > 
> > > Can nfs client be adopted to use zero copy ?, for example by
> > > using
> > > io_uring zero copy rx.
> > > 
> > 
> > The client has no idea in which order the server will return
> > replies to
> > the RPC calls it sends. So no, it can't queue up those reply
> > buffers in
> > advance.
> > 
> > The only way you can avoid memory copies here is to use RDMA to
> > allow
> > the server to write its replies directly into the correct client
> > read
> > buffers.
> > 
> > --
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > trondmy@xxxxxxxxxx, trond.myklebust@xxxxxxxxxxxxxxx