Hi Bernd, Bernd Schubert <bernd@xxxxxxxxxxx> 于2025年8月19日周二 01:31写道: > > > > On 8/18/25 03:39, Gang He wrote: > > Hi Bernd, > > > > Bernd Schubert <bernd@xxxxxxxxxxx> 于2025年8月16日周六 04:56写道: > >> > >> On August 15, 2025 9:45:34 AM GMT+02:00, Gang He <dchg2000@xxxxxxxxx> wrote: > >>> Hi Bernd, > >>> > >>> Sorry for interruption. > >>> I tested your fuse over io_uring patch set with libfuse null example, > >>> the fuse over io_uring mode has better performance than the default > >>> mode. e.g., the fio command is as below, > >>> fio -direct=1 --filename=/mnt/singfile --rw=read -iodepth=1 > >>> --ioengine=libaio --bs=4k --size=4G --runtime=60 --numjobs=1 > >>> -name=test_fuse1 > >>> > >>> But, if I increased fio iodepth option, the fuse over io_uring mode > >>> has worse performance than the default mode. e.g., the fio command is > >>> as below, > >>> fio -direct=1 --filename=/mnt/singfile --rw=read -iodepth=4 > >>> --ioengine=libaio --bs=4k --size=4G --runtime=60 --numjobs=1 > >>> -name=test_fuse2 > >>> > >>> The test result showed the fuse over io_uring mode cannot handle this > >>> case properly. could you take a look at this issue? or this is design > >>> issue? > >>> > >>> I went through the related source code, I do not understand each > >>> fuse_ring_queue thread has only one available ring entry? this design > >>> will cause the above issue? > >>> the related code is as follows, > >>> dev_uring.c > >>> 1099 > >>> 1100 queue = ring->queues[qid]; > >>> 1101 if (!queue) { > >>> 1102 queue = fuse_uring_create_queue(ring, qid); > >>> 1103 if (!queue) > >>> 1104 return err; > >>> 1105 } > >>> 1106 > >>> 1107 /* > >>> 1108 * The created queue above does not need to be destructed in > >>> 1109 * case of entry errors below, will be done at ring destruction time. > >>> 1110 */ > >>> 1111 > >>> 1112 ent = fuse_uring_create_ring_ent(cmd, queue); > >>> 1113 if (IS_ERR(ent)) > >>> 1114 return PTR_ERR(ent); > >>> 1115 > >>> 1116 fuse_uring_do_register(ent, cmd, issue_flags); > >>> 1117 > >>> 1118 return 0; > >>> 1119 } > >>> > >>> > >>> Thanks > >>> Gang > >> > >> > >> Hi Gang, > >> > >> we are just slowly traveling back with my family from Germany to France - sorry for delayed responses. > >> > >> Each queue can have up to N ring entries - I think I put in max 65535. > >> > >> The code you are looking at will just add new entries to per queue lists. > >> > >> I don't know why higher fio io-depth results in lower performance. A possible reason is that /dev/fuse request get distributed to multiple threads, while fuse-io-uring might all go the same thread/ring. I had posted patches recently that add request balancing between queues. > > Io-depth > 1 case means asynchronous IO implementation, but from the > > code in the fuse_uring_commit_fetch() function, this function > > completes one IO request, then fetches the next request. This logic > > will block handling more IO requests before the last request is being > > processed in this thread. Can each thread accept more IO requests > > before the last request in the thread is being processed? Maybe this > > is the root cause for fio (iodepth>1) test case. > > > Well, there is a missing io-uring kernel feature - io_uring_cmd_done() > can only complete one SQE at a time. There is no way right now > to to batch multiple "struct io_uring_cmd". Although I personally > doubt that this is the limit you are running into. OK, I got this design background, but I want to know if we can handle the next fuse request immediately after the io_uring_cmd_done() function is called in the kernel space, rather than in the fuse_uring_commit_fetch() function. I do not know my suggestion makes sense in the io_uring mechanism, but from the coding view, we should handle the next fuse request as soon as possible. Second, if we cannot change this design, we should send the fuse request to another thread queue when the current thread queue is handling the fuse request, but maybe this change will bring performance drop in the numa case for the single thread fio testing. Thanks Gang