On Mon, Mar 31, 2025 at 2:48 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > On Mon, Mar 31, 2025 at 1:43 PM Jaco Kroon <jaco@xxxxxxxxx> wrote: > > > > Hi, > > > > On 2025/03/31 18:41, Joanne Koong wrote: > > > On Fri, Mar 14, 2025 at 3:39 PM Jaco Kroon<jaco@xxxxxxxxx> wrote: > > >> Clamp to min 1 page (4KB) and max 128 pages (512KB). > > >> > > >> Glusterfs trial using strace ls -l. > > >> > > >> Before: > > >> > > >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600 > > >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 616 > > >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 624 > > >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600 > > >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600 > > >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 624 > > >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600 > > >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600 > > >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600 > > >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600 > > >> getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 608 > > >> getdents64(3, 0x7f2d7d7a7040 /* 1 entries */, 131072) = 24 > > >> getdents64(3, 0x7f2d7d7a7040 /* 0 entries */, 131072) = 0 > > >> > > >> After: > > >> > > >> getdents64(3, 0x7ffae8eed040 /* 276 entries */, 131072) = 6696 > > >> getdents64(3, 0x7ffae8eed040 /* 0 entries */, 131072) = 0 > > >> > > >> Signed-off-by: Jaco Kroon<jaco@xxxxxxxxx> > > >> --- > > >> fs/fuse/readdir.c | 22 ++++++++++++++++++---- > > >> 1 file changed, 18 insertions(+), 4 deletions(-) > > >> > > >> diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c > > >> index 17ce9636a2b1..a0ccbc84b000 100644 > > >> --- a/fs/fuse/readdir.c > > >> +++ b/fs/fuse/readdir.c > > >> @@ -337,11 +337,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx) > > >> struct fuse_mount *fm = get_fuse_mount(inode); > > >> struct fuse_io_args ia = {}; > > >> struct fuse_args_pages *ap = &ia.ap; > > >> - struct fuse_folio_desc desc = { .length = PAGE_SIZE }; > > >> + struct fuse_folio_desc desc = { .length = ctx->count }; > > >> u64 attr_version = 0, evict_ctr = 0; > > >> bool locked; > > >> + int order; > > >> > > >> - folio = folio_alloc(GFP_KERNEL, 0); > > >> + if (desc.length < PAGE_SIZE) > > >> + desc.length = PAGE_SIZE; > > >> + else if (desc.length > (PAGE_SIZE << 7)) /* 128 pages, typically 512KB */ > > >> + desc.length = PAGE_SIZE << 7; > > >> + > > > Just wondering, how did 128 pages get decided as the upper bound? It > > > seems to me to make more sense if the upper bound is fc->max_pages. > > > > Best answer ... random/guess at something which may be sensible. > > > > > Also btw, I think you can just use the clamp() helper from > > > <linux/minmax.h> to do the clamping > > > > Thanks. Not a regular contributor to the kernel, not often that I've > > got an itch that needs scratching here :). > > > > So something like this then: > > > > 345 > > 346 desc.length = clamp(desc.length, PAGE_SIZE, fm->fc->max_pages << > > CONFIG_PAGE_SHIFT); > > 347 order = get_count_order(desc.length >> CONFIG_PAGE_SHIFT); > > 348 > > > > You can just use PAGE_SHIFT here instead of CONFIG_PAGE_SHIFT > > > Note: Can use ctx->count here in clamp directly due to it being signed, > > where desc.length is unsigned. > > > > I'm *assuming* get_count_order will round-up, so if max_pages is 7 (if > > non-power of two is even possible) we will really get 8 pages here? > > Yes, if you have a max_pages of 7, this will round up and return to > you an order of 3 > > > > > Compile tested only. Will perform basic run-time test before re-submit. > > > > >> + order = get_count_order(desc.length >> CONFIG_PAGE_SHIFT); > > >> + > > >> + do { > > >> + folio = folio_alloc(GFP_KERNEL, order); > > > Folios can now be larger than one page size for readdir requests with > > > your change but I don't believe the current page copying code in fuse > > > supports this yet. For example, I think the kmapping will be > > > insufficient in fuse_copy_page() where in the current code we kmap > > > only the first page in the folio. I sent a patch for supporting large > > > folios page copying [1] and am trying to get this merged in but > > > haven't heard back about this patchset yet. In your local tests that > > > used multiple pages for the readdir request, did you run into any > > > issues or it worked fine? > > > > My tests boiled down to running strace as per above, and then some basic > > time trials using find /path/to/mount/point with and without the patch > > over a fairly large structure containing about 170m inodes. No problems > > observed. That said ... I've done similar before, and then introduced a > > major memory leak that under load destroyed 100GB of RAM in minutes. > > Thus why I'm looking for a few eyeballs on this before going to > > production (what we have works, it's just on an older kernel). > > > > If further improvements are possible that would be great, but based on > > testing this is already at least a 10x improvement on readdir() performance. > > > > I think you need the patch I linked to or this could cause crashes. > The patch adds support for copying folios larger than 1 page size in > fuse. Maybe you're not running into the crash because it's going > through splice which will circumvent copying, but in the non-splice > case, I believe the kmap is insufficient when you go to do the actual > copying. IOW, I think that patch is a dependency for this one. > Update: I looked more into this and I was wrong, you don't need that patch as a dependency. In fuse_copy_do(), the number of bytes copied is still done page by page regardless of how large the folio is (eg see ncpy = min(*size, cs->len)). So please ignore my earlier comment about this potentially accessing memory that hasn't been kmapped properly. Thanks, Joanne > Thanks, > Joanne > > > > [1]https://lore.kernel.org/linux-fsdevel/20250123012448.2479372-2-joannelkoong@xxxxxxxxx/ > > Took a quick look, wish I could provide you some feedback but that's > > beyond my kernel skill set to just eyeball. > > > > Looks like you're primarily getting rid of the code that references the > > pages inside the folio's and just operating on the folio's directly? A > > side effect of which (your goal) is to enable larger copies rather than > > small ones? > > > > Thank you, > > Jaco