Re: Fuse over io_uring mode cannot handle iodepth > 1 case properly like the default mode

Gang He <dchg2000@xxxxxxxxx> · Tue, 19 Aug 2025 10:16:17 +0800

Hi Bernd,

Bernd Schubert <bernd@xxxxxxxxxxx> 于2025年8月19日周二 01:31写道：
>
>
>
> On 8/18/25 03:39, Gang He wrote:
> > Hi Bernd,
> >
> > Bernd Schubert <bernd@xxxxxxxxxxx> 于2025年8月16日周六 04:56写道：
> >>
> >> On August 15, 2025 9:45:34 AM GMT+02:00, Gang He <dchg2000@xxxxxxxxx> wrote:
> >>> Hi Bernd,
> >>>
> >>> Sorry for interruption.
> >>> I tested your fuse over io_uring patch set with libfuse null example,
> >>> the fuse over io_uring mode has better performance than the default
> >>> mode. e.g., the fio command is as below,
> >>> fio -direct=1 --filename=/mnt/singfile --rw=read  -iodepth=1
> >>> --ioengine=libaio --bs=4k --size=4G --runtime=60 --numjobs=1
> >>> -name=test_fuse1
> >>>
> >>> But, if I increased fio iodepth option, the fuse over io_uring mode
> >>> has worse performance than the default mode. e.g., the fio command is
> >>> as below,
> >>> fio -direct=1 --filename=/mnt/singfile --rw=read  -iodepth=4
> >>> --ioengine=libaio --bs=4k --size=4G --runtime=60 --numjobs=1
> >>> -name=test_fuse2
> >>>
> >>> The test result showed the fuse over io_uring mode cannot handle this
> >>> case properly. could you take a look at this issue? or this is design
> >>> issue?
> >>>
> >>> I went through the related source code, I do not understand each
> >>> fuse_ring_queue thread has only one  available ring entry? this design
> >>> will cause the above issue?
> >>> the related code is as follows,
> >>> dev_uring.c
> >>> 1099
> >>> 1100     queue = ring->queues[qid];
> >>> 1101     if (!queue) {
> >>> 1102         queue = fuse_uring_create_queue(ring, qid);
> >>> 1103         if (!queue)
> >>> 1104             return err;
> >>> 1105     }
> >>> 1106
> >>> 1107     /*
> >>> 1108      * The created queue above does not need to be destructed in
> >>> 1109      * case of entry errors below, will be done at ring destruction time.
> >>> 1110      */
> >>> 1111
> >>> 1112     ent = fuse_uring_create_ring_ent(cmd, queue);
> >>> 1113     if (IS_ERR(ent))
> >>> 1114         return PTR_ERR(ent);
> >>> 1115
> >>> 1116     fuse_uring_do_register(ent, cmd, issue_flags);
> >>> 1117
> >>> 1118     return 0;
> >>> 1119 }
> >>>
> >>>
> >>> Thanks
> >>> Gang
> >>
> >>
> >> Hi Gang,
> >>
> >> we are just slowly traveling back with my family from Germany to France - sorry for delayed responses.
> >>
> >> Each queue can have up to N ring entries - I think I put in max 65535.
> >>
> >> The code you are looking at will just add new entries to per queue lists.
> >>
> >> I don't know why higher fio io-depth results in lower performance. A possible reason is that /dev/fuse request get distributed to multiple threads, while fuse-io-uring might all go the same thread/ring. I had posted patches recently that add request  balancing between queues.
> > Io-depth > 1 case means asynchronous IO implementation, but from the
> > code in the fuse_uring_commit_fetch() function, this function
> > completes one IO request, then fetches the next request. This logic
> > will block handling more IO requests before the last request is being
> > processed in this thread. Can each thread accept more IO requests
> > before the last request in the thread is being processed? Maybe this
> > is the root cause for fio (iodepth>1) test case.
>
>
> Well, there is a missing io-uring kernel feature - io_uring_cmd_done()
> can only complete one SQE at a time. There is no way right now
> to to batch multiple "struct io_uring_cmd". Although I personally
> doubt that this is the limit you are running into.
OK, I got this design background, but I want to know if we can handle
the next fuse request immediately after the
io_uring_cmd_done() function is called in the kernel space, rather
than in the fuse_uring_commit_fetch() function.
I do not know my suggestion makes sense in the io_uring mechanism, but
from the coding view, we should handle
the next fuse request as soon as possible.
Second, if we cannot change this design, we should send the fuse
request to another thread queue when the current thread queue is
handling the fuse request, but maybe this change will bring
performance drop in the numa case for the single thread fio testing.

Thanks
Gang