Re: [PATCH v1] fuse: support configurable number of uring queues

Bernd Schubert <bernd.schubert@xxxxxxxxxxx> · Tue, 1 Apr 2025 19:06:32 +0200

On 4/1/25 02:22, Joanne Koong wrote:
> On Tue, Mar 18, 2025 at 4:16 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
>>
>> On Tue, Mar 18, 2025 at 3:33 AM Bernd Schubert
>> <bernd.schubert@xxxxxxxxxxx> wrote:
>>>
>>> On 3/18/25 01:55, Joanne Koong wrote:
>>>> Hi Bernd,
>>>> Thanks for the quick turnaround on the review!
>>>>
>>>> On Fri, Mar 14, 2025 at 4:11 PM Bernd Schubert
>>>> <bernd.schubert@xxxxxxxxxxx> wrote:
>>>>>
>>>>> Thanks Joanne! That is rather close to what I wanted to add,
>>>>> just a few comments.
>>>>>
>>>>> On 3/14/25 21:44, Joanne Koong wrote:
>>>>>> In the current uring design, the number of queues is equal to the number
>>>>>> of cores on a system. However, on high-scale machines where there are
>>>>>> hundreds of cores, having such a high number of queues is often
>>>>>> overkill and resource-intensive. As well, in the current design where
>>>>>> the queue for the request is set to the cpu the task is currently
>>>>>> executing on (see fuse_uring_task_to_queue()), there is no guarantee
>>>>>> that requests for the same file will be sent to the same queue (eg if a
>>>>>> task is preempted and moved to a different cpu) which may be problematic
>>>>>> for some servers (eg if the server is append-only and does not support
>>>>>> unordered writes).
>>>>>>
>>>>>> In this commit, the server can configure the number of uring queues
>>>>>> (passed to the kernel through the init reply). The number of queues must
>>>>>> be a power of two, in order to make queue assignment for a request
>>>>>> efficient. If the server specifies a non-power of two, then it will be
>>>>>> automatically rounded down to the nearest power of two. If the server
>>>>>> does not specify the number of queues, then this will automatically
>>>>>> default to the current behavior where the number of queues will be equal
>>>>>> to the number of cores with core and numa affinity. The queue id hash
>>>>>> is computed on the nodeid, which ensures that requests for the same file
>>>>>> will be forwarded to the same queue.
>>>>>>
>>>>>> Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx>
>>>>>> ---
>>>>>>  fs/fuse/dev_uring.c       | 48 +++++++++++++++++++++++++++++++++++----
>>>>>>  fs/fuse/dev_uring_i.h     | 11 +++++++++
>>>>>>  fs/fuse/fuse_i.h          |  1 +
>>>>>>  fs/fuse/inode.c           |  4 +++-
>>>>>>  include/uapi/linux/fuse.h |  6 ++++-
>>>>>>  5 files changed, 63 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>>>>>> index 64f1ae308dc4..f173f9e451ac 100644
>>>>>> --- a/fs/fuse/dev_uring.c
>>>>>> +++ b/fs/fuse/dev_uring.c
>>>>>> @@ -209,9 +209,10 @@ void fuse_uring_destruct(struct fuse_conn *fc)
>>>>>>  static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
>>>>>>  {
>>>>>>       struct fuse_ring *ring;
>>>>>> -     size_t nr_queues = num_possible_cpus();
>>>>>> +     size_t nr_queues = fc->uring_nr_queues;
>>>>>>       struct fuse_ring *res = NULL;
>>>>>>       size_t max_payload_size;
>>>>>> +     unsigned int nr_cpus = num_possible_cpus();
>>>>>>
>>>>>>       ring = kzalloc(sizeof(*fc->ring), GFP_KERNEL_ACCOUNT);
>>>>>>       if (!ring)
>>>>>> @@ -237,6 +238,13 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
>>>>>>
>>>>>>       fc->ring = ring;
>>>>>>       ring->nr_queues = nr_queues;
>>>>>> +     if (nr_queues == nr_cpus) {
>>>>>> +             ring->core_affinity = 1;
>>>>>> +     } else {
>>>>>> +             WARN_ON(!nr_queues || nr_queues > nr_cpus ||
>>>>>> +                     !is_power_of_2(nr_queues));
>>>>>> +             ring->qid_hash_bits = ilog2(nr_queues);
>>>>>> +     }
>>>>>>       ring->fc = fc;
>>>>>>       ring->max_payload_sz = max_payload_size;
>>>>>>       atomic_set(&ring->queue_refs, 0);
>>>>>> @@ -1217,12 +1225,24 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
>>>>>>       fuse_uring_send(ent, cmd, err, issue_flags);
>>>>>>  }
>>>>>>
>>>>>> -static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
>>>>>> +static unsigned int hash_qid(struct fuse_ring *ring, u64 nodeid)
>>>>>> +{
>>>>>> +     if (ring->nr_queues == 1)
>>>>>> +             return 0;
>>>>>> +
>>>>>> +     return hash_long(nodeid, ring->qid_hash_bits);
>>>>>> +}
>>>>>> +
>>>>>> +static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring,
>>>>>> +                                                     struct fuse_req *req)
>>>>>>  {
>>>>>>       unsigned int qid;
>>>>>>       struct fuse_ring_queue *queue;
>>>>>>
>>>>>> -     qid = task_cpu(current);
>>>>>> +     if (ring->core_affinity)
>>>>>> +             qid = task_cpu(current);
>>>>>> +     else
>>>>>> +             qid = hash_qid(ring, req->in.h.nodeid);
>>>>>
>>>>> I think we need to handle numa affinity.
>>>>>
>>>>
>>>> Could you elaborate more on this? I'm not too familiar with how to
>>>> enforce this in practice. As I understand it, the main goal of numa
>>>> affinity is to make sure processes access memory that's physically
>>>> closer to the CPU it's executing on. How does this usually get
>>>> enforced at the kernel level?
>>>
>>> The request comes on a specific core and that is on a numa node -
>>> we should try to avoid switching. If there is no queue for the
>>> current core we should try to stay on the same numa node.
>>> And we should probably also consider the waiting requests per
>>> queue and distribute between that, although that is a bit
>>> independent.
>>>
>>
>> In that case then, there's no guarantee that requests on the same file
>> will get sent to the same queue. But thinking more about this, maybe
>> it doesn't matter after all if they're sent to different queues. I
>> need to think some more about this. But I agree, if we don't care
>> about requests for the same inode getting routed to the same queue,
>> then we should aim for numa affinity. I'll look more into this.
>>
> 
> Thought about this some more... is this even worth doing? AFAICT,
> there's no guarantee that the same number of CPUs are distributed
> evenly across numa nodes. For example, one numa node may have CPUs 0
> to 5 on them, then another numa node might have CPU 6 and 7. If
> there's two queues, each associated with a numa node, then requests
> will be disproportionately / unevenly allocated. Eg most of the
> workload will be queued on the numa node with CPUs 0 to 5. Moreover, I
> don't think there's a good way to enforce this in the cases where
> number of queues < number of numa nodes. For example if there's 3 numa
> nodes with say 3 CPUs each and there's 2 queues. The logic for which
> cpu gets sent to which queue gets a little messy here.
> 
> imo, this is an optimization that could be added in the future if the
> need for this comes up. WDYT?

I will eventually come to this this week. My plan is to use queue
lengths for distribution. We should do that for background requests
anyway. 
I.e. first try the local core, if queue length to large or no queue.
try queues within the same numa domain, if all queues are busy check
if foreign queues are more suitable.

Thanks,
Bernd