On 8/18/25 03:39, Gang He wrote: > Hi Bernd, > > Bernd Schubert <bernd@xxxxxxxxxxx> 于2025年8月16日周六 04:56写道: >> >> On August 15, 2025 9:45:34 AM GMT+02:00, Gang He <dchg2000@xxxxxxxxx> wrote: >>> Hi Bernd, >>> >>> Sorry for interruption. >>> I tested your fuse over io_uring patch set with libfuse null example, >>> the fuse over io_uring mode has better performance than the default >>> mode. e.g., the fio command is as below, >>> fio -direct=1 --filename=/mnt/singfile --rw=read -iodepth=1 >>> --ioengine=libaio --bs=4k --size=4G --runtime=60 --numjobs=1 >>> -name=test_fuse1 >>> >>> But, if I increased fio iodepth option, the fuse over io_uring mode >>> has worse performance than the default mode. e.g., the fio command is >>> as below, >>> fio -direct=1 --filename=/mnt/singfile --rw=read -iodepth=4 >>> --ioengine=libaio --bs=4k --size=4G --runtime=60 --numjobs=1 >>> -name=test_fuse2 >>> >>> The test result showed the fuse over io_uring mode cannot handle this >>> case properly. could you take a look at this issue? or this is design >>> issue? >>> >>> I went through the related source code, I do not understand each >>> fuse_ring_queue thread has only one available ring entry? this design >>> will cause the above issue? >>> the related code is as follows, >>> dev_uring.c >>> 1099 >>> 1100 queue = ring->queues[qid]; >>> 1101 if (!queue) { >>> 1102 queue = fuse_uring_create_queue(ring, qid); >>> 1103 if (!queue) >>> 1104 return err; >>> 1105 } >>> 1106 >>> 1107 /* >>> 1108 * The created queue above does not need to be destructed in >>> 1109 * case of entry errors below, will be done at ring destruction time. >>> 1110 */ >>> 1111 >>> 1112 ent = fuse_uring_create_ring_ent(cmd, queue); >>> 1113 if (IS_ERR(ent)) >>> 1114 return PTR_ERR(ent); >>> 1115 >>> 1116 fuse_uring_do_register(ent, cmd, issue_flags); >>> 1117 >>> 1118 return 0; >>> 1119 } >>> >>> >>> Thanks >>> Gang >> >> >> Hi Gang, >> >> we are just slowly traveling back with my family from Germany to France - sorry for delayed responses. >> >> Each queue can have up to N ring entries - I think I put in max 65535. >> >> The code you are looking at will just add new entries to per queue lists. >> >> I don't know why higher fio io-depth results in lower performance. A possible reason is that /dev/fuse request get distributed to multiple threads, while fuse-io-uring might all go the same thread/ring. I had posted patches recently that add request balancing between queues. > Io-depth > 1 case means asynchronous IO implementation, but from the > code in the fuse_uring_commit_fetch() function, this function > completes one IO request, then fetches the next request. This logic > will block handling more IO requests before the last request is being > processed in this thread. Can each thread accept more IO requests > before the last request in the thread is being processed? Maybe this > is the root cause for fio (iodepth>1) test case. Well, there is a missing io-uring kernel feature - io_uring_cmd_done() can only complete one SQE at a time. There is no way right now to to batch multiple "struct io_uring_cmd". Although I personally doubt that this is the limit you are running into.