Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree

Joanne Koong <joannelkoong@xxxxxxxxx> · Thu, 10 Apr 2025 08:07:22 -0700

On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>
>
>
> On 4/10/25 7:47 AM, Joanne Koong wrote:
> >   On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
> >>
> >> Hi Joanne,
> >>
> >> On 4/5/25 2:14 AM, Joanne Koong wrote:
> >>> In the current FUSE writeback design (see commit 3be5a52b30aa
> >>> ("fuse: support writable mmap")), a temp page is allocated for every
> >>> dirty page to be written back, the contents of the dirty page are copied over
> >>> to the temp page, and the temp page gets handed to the server to write back.
> >>>
> >>> This is done so that writeback may be immediately cleared on the dirty page,
> >>> and this in turn is done in order to mitigate the following deadlock scenario
> >>> that may arise if reclaim waits on writeback on the dirty page to complete:
> >>> * single-threaded FUSE server is in the middle of handling a request
> >>>   that needs a memory allocation
> >>> * memory allocation triggers direct reclaim
> >>> * direct reclaim waits on a folio under writeback
> >>> * the FUSE server can't write back the folio since it's stuck in
> >>>   direct reclaim
> >>>
> >>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
> >>> the situations described above, FUSE writeback does not need to use
> >>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
> >>>
> >>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
> >>> and removes the temporary pages + extra copying and the internal rb
> >>> tree.
> >>>
> >>> fio benchmarks --
> >>> (using averages observed from 10 runs, throwing away outliers)
> >>>
> >>> Setup:
> >>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >>>
> >>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> >>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >>>
> >>>         bs =  1k          4k            1M
> >>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> >>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
> >>> % diff        -3%          23%         45%
> >>>
> >>> Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx>
> >>> Reviewed-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx>
> >>> Acked-by: Miklos Szeredi <mszeredi@xxxxxxxxxx>
> >>
> >
> > Hi Jingbo,
> >
> > Thanks for sharing your analysis for this.
> >
> >> Overall this patch LGTM.
> >>
> >> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
> >> also unneeded then, at least the DIRECT IO routine (i.e.
> >
> > I took a look at fi->writectr and fi->queued_writes and my
> > understanding is that we do still need this. For example, for
> > truncates (I'm looking at fuse_do_setattr()), I think we still need to
> > prevent concurrent writeback or else the setattr request and the
> > writeback request could race which would result in a mismatch between
> > the file's reported size and the actual data written to disk.
>
> I haven't looked into the truncate routine yet.  I will see it later.
>
> >
> >> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
> >> because after removing the temp page, the DIRECT IO routine has already
> >> been waiting for all inflight WRITE requests, see
> >>
> >> # DIRECT read
> >> generic_file_read_iter
> >>   kiocb_write_and_wait
> >>     filemap_write_and_wait_range
> >
> > Where do you see generic_file_read_iter() getting called for direct io reads?
>
> # DIRECT read
> fuse_file_read_iter
>   fuse_cache_read_iter
>     generic_file_read_iter
>       kiocb_write_and_wait
>        filemap_write_and_wait_range
>       a_ops->direct_IO(),i.e. fuse_direct_IO()
>

Oh I see, I thought files opened with O_DIRECT automatically call the
.direct_IO handler for reads/writes but you're right, it first goes
through .read_iter / .write_iter handlers, and the .direct_IO handler
only gets invoked through generic_file_read_iter() /
generic_file_direct_write() in mm/filemap.c

There's two paths for direct io in FUSE:
a) fuse server sets fi->direct_io = true when a file is opened, which
will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
b) fuse server doesn't set fi->direct_io = true, but the client opens
the file with O_DIRECT

We only go through the stack trace you listed above for the b) case.
For the a) case, we'll hit

        if (ff->open_flags & FOPEN_DIRECT_IO)
                return fuse_direct_read_iter(iocb, to);

and

        if (ff->open_flags & FOPEN_DIRECT_IO)
                return fuse_direct_write_iter(iocb, from);

which will invoke fuse_direct_IO() / fuse_direct_io() without going
through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
above.

So for the a) case I think we'd still need the fuse_sync_writes() in
case there's still pending writeback.

Do you agree with this analysis or am I missing something here?

>
> > Similarly, where do you see generic_file_write_iter() getting called
> > for direct io writes?
>
> # DIRECT read
> fuse_file_write_iter
>   fuse_cache_write_iter
>     generic_file_write_iter
>       generic_file_direct_write
>         kiocb_invalidate_pages
>          filemap_invalidate_pages
>            filemap_write_and_wait_range
>       a_ops->direct_IO(),i.e. fuse_direct_IO()
>
>
> > Where do you see fi->writectr / fi->queued-writes preventing this
> > race?
>
> IMO overall fi->writectr / fi->queued-writes are introduced to prevent
> DIRECT IO and writeback from sending duplicate (inflight) WRITE requests
> for the same page.
>
> For the DIRECT write routine:
>
> # non-FOPEN_DIRECT_IO DIRECT write
> fuse_cache_write_iter
>   fuse_direct_IO
>     fuse_direct_io
>       fuse_sync_writes
>
>
> # FOPEN_DIRECT_IO DIRECT write
> fuse_direct_write_iter
>   fuse_direct_IO
>     fuse_direct_io
>       fuse_sync_writes
>
>
> For the writeback routine:
> fuse_writepages()
>   fuse_writepages_fill
>     fuse_writepages_send
>       # buffer the WRITE request in queued_writes list
>       fuse_flush_writepages
>         # flush WRITE only when fi->writectr >= 0
>
>
>
> > It looks to me like in the existing code, this race condition
> > you described of direct write invalidating the page cache, then
> > another buffer write reads the page cache and dirties it, then
> > writeback is called on that, and the 2 write requests racing, could
> > still happen?
> >
> >
> >> However it seems that the writeback
> >> won't wait for previous inflight DIRECT WRITE requests, so I'm not much
> >> sure about that.  Maybe other folks could offer more insights...
> >
> > My understanding is that these lines
> >
> > if (!cuse && filemap_range_has_writeback(...)) {
> >    ...
> >    fuse_sync_writes(inode);
> >    ...
> > }
> >
> > in fuse_direct_io() is what waits on previous inflight direct write
> > requests to complete before the direct io happens.
>
> Right.
>
> >
> >
> >>
> >> Also fuse_sync_writes() is not needed in fuse_flush() anymore, with
> >> which I'm pretty sure.
> >
> > Why don't we still need this for fuse_flush()?
> >
> > If a caller calls close(), this will call
> >
> > filp_close()
> >   filp_flush()
> >       filp->f_op->flush()
> >           fuse_flush()
> >
> > it seems like we should still be waiting for all writebacks to finish
> > before sending the fuse server the fuse_flush request, no?
> >
>
> filp_close()
>    filp_flush()
>        filp->f_op->flush()
>            fuse_flush()
>              write_inode_now
>                 writeback_single_inode(WB_SYNC_ALL)
>                   do_writepages
>                     # flush dirty page
>                   filemap_fdatawait
>                     # wait for WRITE completion

Nice. I missed that write_inode_now() will invoke filemap_fdatawait().
This seems pretty straightforward. I'll remove the fuse_sync_writes()
call in fuse_flush() when I send out v8.

The direct io one above is less straight-forward. I won't add that to
v8 but that can be done in a separate future patch when we figure that
out.

Thanks,
Joanne

>
> >>
> >>> ---
> >>>  fs/fuse/file.c   | 360 ++++-------------------------------------------
> >>>  fs/fuse/fuse_i.h |   3 -
> >>>  2 files changed, 28 insertions(+), 335 deletions(-)
> >>>
> >>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> >>> index 754378dd9f71..91ada0208863 100644
> >>> --- a/fs/fuse/file.c
> >>> +++ b/fs/fuse/file.c
> >>> @@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
> >>>
> >>>  struct fuse_writepage_args {
> >>>       struct fuse_io_args ia;
> >>> -     struct rb_node writepages_entry;
> >>>       struct list_head queue_entry;
> >>> -     struct fuse_writepage_args *next;
> >>>       struct inode *inode;
> >>>       struct fuse_sync_bucket *bucket;
> >>>  };
> >>>
> >>> -static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi,
> >>> -                                         pgoff_t idx_from, pgoff_t idx_to)
> >>> -{
> >>> -     struct rb_node *n;
> >>> -
> >>> -     n = fi->writepages.rb_node;
> >>> -
> >>> -     while (n) {
> >>> -             struct fuse_writepage_args *wpa;
> >>> -             pgoff_t curr_index;
> >>> -
> >>
> >> --
> >> Thanks,
> >> Jingbo
>
> --
> Thanks,
> Jingbo