On 8/19/2025 1:16 PM, Chuck Lever wrote:
On 8/19/25 12:58 PM, Tom Talpey wrote:
On 8/19/2025 12:08 PM, Chuck Lever wrote:
On 8/19/25 12:03 PM, Tom Talpey wrote:
On 8/11/2025 4:35 PM, Chuck Lever wrote:
From: Chuck Lever <chuck.lever@xxxxxxxxxx>
Reduce the per-connection footprint in the host's and RNIC's memory
management TLBs by combining groups of a connection's Receive
buffers into fewer IOVAs.
This is an interesting and potentially useful approach. Keeping
the iova count (==1) reduces the size of work requests and greatly
simplifies processing.
But how large are the iova's currently? RPCRDMA_DEF_INLINE_THRESH
is just 4096, which would mean typically <= 2 iova's. The max is
arbitrarily but consistently 64KB, is this complexity worth it?
The pool's shard size is RPCRDMA_MAX_INLINE_THRESH, or 64KB. That's the
largest inline threshold this implementation allows.
The default inline threshold is 4KB, so one shared can hold up to
sixteen 4KB Receive buffers. The default credit limit is 64, plus 8
batch overflow, so 72 Receive buffers total per connection.
And, allocating large contiguous buffers would seem to shift the
burden to kmalloc and/or the IOMMU, so it's not free, right?
Can you elaborate on what you mean by "burden" ?
Sure, it's that somebody has to manage the iova scatter/gather
segments.
Using kmalloc or its moral equivalent offers a contract that the
memory returned is physically contiguous, 1 segment. That's
gonna scale badly.
I'm still not sure what's not going to scale. We're already using
kmalloc today, one per Receive buffer. I'm making it one kmalloc per
shard (which can contain more than a dozen Receive buffers).
Sorry - availability of free contiguous memory (pages). Over
time, fragmentation and demand may limit this or at least make
it more costly/precious.
Using the IOMMU, when available, stuffs the s/g list into its
hardware. Simple at the verb layer (again 1 segment) but uses
the shared hardware resource to provide it.
Another approach might be to use fast-register for the receive
buffers, instead of ib_register_mr on the privileged lmr. This
would be a page list with first-byte-offset and length, which
would put it the adapter's TPT instead of the PCI-facing IOMMU.
The fmr's would registerd only once, unlike the fmr's used for
remote transfers, so the cost would remain low. And fmr's typically
support 16 segments minimum, so no restriction there.
I can experiment with fast registration. The goal of this work is to
reduce the per-connection hardware footprint.
My point is that it seems unnecessary somehow in the RPCRDMA
layer.
Well, if this effort is intriguing to others, it can certainly be moved
into the RDMA core. I already intend to convert the RPC/RDMA client
Receive code to use it too.
Not sure. The SMBdirect code doesn't need it, because it uses
somewhat small receive buffers that are much smaller than 4KB.
Block protocols probably not.
I sort-of think this is a special requirement of the rpcrdma
protocol, which was architected in part to preserve older xdr
code by hiding RDMA from the RPC consumer and therefore segmenting
almost everything. But there may be other reasons to consider this,
need to ponder that a bit.
But, that's just my intuition. Finding some way to measure
any benefit (performance, setup overhead, scalbility, ...) would
be certainly be useful.
That is a primary purpose of me posting this RFC. As stated in the patch
description, I would like some help quantifying the improvement (if
there is any).
I'll ponder that too. There are potential benefits at several
layers, this makes things tricky.
Tom.