Re: Fuse over io_uring mode cannot handle iodepth > 1 case properly like the default mode

Bernd Schubert <bernd@xxxxxxxxxxx> · Mon, 18 Aug 2025 19:31:10 +0200

On 8/18/25 03:39, Gang He wrote:
> Hi Bernd,
> 
> Bernd Schubert <bernd@xxxxxxxxxxx> 于2025年8月16日周六 04:56写道：
>>
>> On August 15, 2025 9:45:34 AM GMT+02:00, Gang He <dchg2000@xxxxxxxxx> wrote:
>>> Hi Bernd,
>>>
>>> Sorry for interruption.
>>> I tested your fuse over io_uring patch set with libfuse null example,
>>> the fuse over io_uring mode has better performance than the default
>>> mode. e.g., the fio command is as below,
>>> fio -direct=1 --filename=/mnt/singfile --rw=read  -iodepth=1
>>> --ioengine=libaio --bs=4k --size=4G --runtime=60 --numjobs=1
>>> -name=test_fuse1
>>>
>>> But, if I increased fio iodepth option, the fuse over io_uring mode
>>> has worse performance than the default mode. e.g., the fio command is
>>> as below,
>>> fio -direct=1 --filename=/mnt/singfile --rw=read  -iodepth=4
>>> --ioengine=libaio --bs=4k --size=4G --runtime=60 --numjobs=1
>>> -name=test_fuse2
>>>
>>> The test result showed the fuse over io_uring mode cannot handle this
>>> case properly. could you take a look at this issue? or this is design
>>> issue?
>>>
>>> I went through the related source code, I do not understand each
>>> fuse_ring_queue thread has only one  available ring entry? this design
>>> will cause the above issue?
>>> the related code is as follows,
>>> dev_uring.c
>>> 1099
>>> 1100     queue = ring->queues[qid];
>>> 1101     if (!queue) {
>>> 1102         queue = fuse_uring_create_queue(ring, qid);
>>> 1103         if (!queue)
>>> 1104             return err;
>>> 1105     }
>>> 1106
>>> 1107     /*
>>> 1108      * The created queue above does not need to be destructed in
>>> 1109      * case of entry errors below, will be done at ring destruction time.
>>> 1110      */
>>> 1111
>>> 1112     ent = fuse_uring_create_ring_ent(cmd, queue);
>>> 1113     if (IS_ERR(ent))
>>> 1114         return PTR_ERR(ent);
>>> 1115
>>> 1116     fuse_uring_do_register(ent, cmd, issue_flags);
>>> 1117
>>> 1118     return 0;
>>> 1119 }
>>>
>>>
>>> Thanks
>>> Gang
>>
>>
>> Hi Gang,
>>
>> we are just slowly traveling back with my family from Germany to France - sorry for delayed responses.
>>
>> Each queue can have up to N ring entries - I think I put in max 65535.
>>
>> The code you are looking at will just add new entries to per queue lists.
>>
>> I don't know why higher fio io-depth results in lower performance. A possible reason is that /dev/fuse request get distributed to multiple threads, while fuse-io-uring might all go the same thread/ring. I had posted patches recently that add request  balancing between queues.
> Io-depth > 1 case means asynchronous IO implementation, but from the
> code in the fuse_uring_commit_fetch() function, this function
> completes one IO request, then fetches the next request. This logic
> will block handling more IO requests before the last request is being
> processed in this thread. Can each thread accept more IO requests
> before the last request in the thread is being processed? Maybe this
> is the root cause for fio (iodepth>1) test case.

Well, there is a missing io-uring kernel feature - io_uring_cmd_done() 
can only complete one SQE at a time. There is no way right now
to to batch multiple "struct io_uring_cmd". Although I personally
doubt that this is the limit you are running into.