Re: [RFC PATCH v2] svcrdma: Introduce Receive buffer arenas

Chuck Lever <cel@xxxxxxxxxx> · Tue, 19 Aug 2025 13:16:00 -0400

On 8/19/25 12:58 PM, Tom Talpey wrote:
> On 8/19/2025 12:08 PM, Chuck Lever wrote:
>> On 8/19/25 12:03 PM, Tom Talpey wrote:
>>> On 8/11/2025 4:35 PM, Chuck Lever wrote:
>>>> From: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>>
>>>> Reduce the per-connection footprint in the host's and RNIC's memory
>>>> management TLBs by combining groups of a connection's Receive
>>>> buffers into fewer IOVAs.
>>>
>>> This is an interesting and potentially useful approach. Keeping
>>> the iova count (==1) reduces the size of work requests and greatly
>>> simplifies processing.
>>>
>>> But how large are the iova's currently? RPCRDMA_DEF_INLINE_THRESH
>>> is just 4096, which would mean typically <= 2 iova's. The max is
>>> arbitrarily but consistently 64KB, is this complexity worth it?
>>
>> The pool's shard size is RPCRDMA_MAX_INLINE_THRESH, or 64KB. That's the
>> largest inline threshold this implementation allows.
>>
>> The default inline threshold is 4KB, so one shared can hold up to
>> sixteen 4KB Receive buffers. The default credit limit is 64, plus 8
>> batch overflow, so 72 Receive buffers total per connection.
>>
>>
>>> And, allocating large contiguous buffers would seem to shift the
>>> burden to kmalloc and/or the IOMMU, so it's not free, right?
>>
>> Can you elaborate on what you mean by "burden" ?
> 
> Sure, it's that somebody has to manage the iova scatter/gather
> segments.
> 
> Using kmalloc or its moral equivalent offers a contract that the
> memory returned is physically contiguous, 1 segment. That's
> gonna scale badly.

I'm still not sure what's not going to scale. We're already using
kmalloc today, one per Receive buffer. I'm making it one kmalloc per
shard (which can contain more than a dozen Receive buffers).

> Using the IOMMU, when available, stuffs the s/g list into its
> hardware. Simple at the verb layer (again 1 segment) but uses
> the shared hardware resource to provide it.
> 
> Another approach might be to use fast-register for the receive
> buffers, instead of ib_register_mr on the privileged lmr. This
> would be a page list with first-byte-offset and length, which
> would put it the adapter's TPT instead of the PCI-facing IOMMU.
> The fmr's would registerd only once, unlike the fmr's used for
> remote transfers, so the cost would remain low. And fmr's typically
> support 16 segments minimum, so no restriction there.

I can experiment with fast registration. The goal of this work is to
reduce the per-connection hardware footprint.

> My point is that it seems unnecessary somehow in the RPCRDMA
> layer.

Well, if this effort is intriguing to others, it can certainly be moved
into the RDMA core. I already intend to convert the RPC/RDMA client
Receive code to use it too.

> But, that's just my intuition. Finding some way to measure
> any benefit (performance, setup overhead, scalbility, ...) would
> be certainly be useful.

That is a primary purpose of me posting this RFC. As stated in the patch
description, I would like some help quantifying the improvement (if
there is any).

-- 
Chuck Lever