Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Apr 15, 2025 at 6:40 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>
> On 4/15/25 11:59 PM, Joanne Koong wrote:
> > On Tue, Apr 15, 2025 at 12:49 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
> >>
> >> Hi Joanne,
> >>
> >> Sorry for the late reply...
> >
> > Hi Jingbo,
> >
> > No worries at all.
> >>
> >>
> >> On 4/11/25 12:11 AM, Joanne Koong wrote:
> >>> On Thu, Apr 10, 2025 at 8:11 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
> >>>>
> >>>> On 4/10/25 11:07 PM, Joanne Koong wrote:
> >>>>> On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 4/10/25 7:47 AM, Joanne Koong wrote:
> >>>>>>>   On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
> >>>>>>>>
> >>>>>>>> Hi Joanne,
> >>>>>>>>
> >>>>>>>> On 4/5/25 2:14 AM, Joanne Koong wrote:
> >>>>>>>>> In the current FUSE writeback design (see commit 3be5a52b30aa
> >>>>>>>>> ("fuse: support writable mmap")), a temp page is allocated for every
> >>>>>>>>> dirty page to be written back, the contents of the dirty page are copied over
> >>>>>>>>> to the temp page, and the temp page gets handed to the server to write back.
> >>>>>>>>>
> >>>>>>>>> This is done so that writeback may be immediately cleared on the dirty page,
> >>>>>>>>> and this in turn is done in order to mitigate the following deadlock scenario
> >>>>>>>>> that may arise if reclaim waits on writeback on the dirty page to complete:
> >>>>>>>>> * single-threaded FUSE server is in the middle of handling a request
> >>>>>>>>>   that needs a memory allocation
> >>>>>>>>> * memory allocation triggers direct reclaim
> >>>>>>>>> * direct reclaim waits on a folio under writeback
> >>>>>>>>> * the FUSE server can't write back the folio since it's stuck in
> >>>>>>>>>   direct reclaim
> >>>>>>>>>
> >>>>>>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
> >>>>>>>>> the situations described above, FUSE writeback does not need to use
> >>>>>>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
> >>>>>>>>>
> >>>>>>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
> >>>>>>>>> and removes the temporary pages + extra copying and the internal rb
> >>>>>>>>> tree.
> >>>>>>>>>
> >>>>>>>>> fio benchmarks --
> >>>>>>>>> (using averages observed from 10 runs, throwing away outliers)
> >>>>>>>>>
> >>>>>>>>> Setup:
> >>>>>>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >>>>>>>>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >>>>>>>>>
> >>>>>>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> >>>>>>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >>>>>>>>>
> >>>>>>>>>         bs =  1k          4k            1M
> >>>>>>>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> >>>>>>>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
> >>>>>>>>> % diff        -3%          23%         45%
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx>
> >>>>>>>>> Reviewed-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx>
> >>>>>>>>> Acked-by: Miklos Szeredi <mszeredi@xxxxxxxxxx>
> >>>>>>>>
> >>>>>>>
> >>>>>>> Hi Jingbo,
> >>>>>>>
> >>>>>>> Thanks for sharing your analysis for this.
> >>>>>>>
> >>>>>>>> Overall this patch LGTM.
> >>>>>>>>
> >>>>>>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
> >>>>>>>> also unneeded then, at least the DIRECT IO routine (i.e.
> >>>>>>>
> >>>>>>> I took a look at fi->writectr and fi->queued_writes and my
> >>>>>>> understanding is that we do still need this. For example, for
> >>>>>>> truncates (I'm looking at fuse_do_setattr()), I think we still need to
> >>>>>>> prevent concurrent writeback or else the setattr request and the
> >>>>>>> writeback request could race which would result in a mismatch between
> >>>>>>> the file's reported size and the actual data written to disk.
> >>>>>>
> >>>>>> I haven't looked into the truncate routine yet.  I will see it later.
> >>>>>>
> >>>>>>>
> >>>>>>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
> >>>>>>>> because after removing the temp page, the DIRECT IO routine has already
> >>>>>>>> been waiting for all inflight WRITE requests, see
> >>>>>>>>
> >>>>>>>> # DIRECT read
> >>>>>>>> generic_file_read_iter
> >>>>>>>>   kiocb_write_and_wait
> >>>>>>>>     filemap_write_and_wait_range
> >>>>>>>
> >>>>>>> Where do you see generic_file_read_iter() getting called for direct io reads?
> >>>>>>
> >>>>>> # DIRECT read
> >>>>>> fuse_file_read_iter
> >>>>>>   fuse_cache_read_iter
> >>>>>>     generic_file_read_iter
> >>>>>>       kiocb_write_and_wait
> >>>>>>        filemap_write_and_wait_range
> >>>>>>       a_ops->direct_IO(),i.e. fuse_direct_IO()
> >>>>>>
> >>>>>
> >>>>> Oh I see, I thought files opened with O_DIRECT automatically call the
> >>>>> .direct_IO handler for reads/writes but you're right, it first goes
> >>>>> through .read_iter / .write_iter handlers, and the .direct_IO handler
> >>>>> only gets invoked through generic_file_read_iter() /
> >>>>> generic_file_direct_write() in mm/filemap.c
> >>>>>
> >>>>> There's two paths for direct io in FUSE:
> >>>>> a) fuse server sets fi->direct_io = true when a file is opened, which
> >>>>> will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
> >>>>> b) fuse server doesn't set fi->direct_io = true, but the client opens
> >>>>> the file with O_DIRECT
> >>>>>
> >>>>> We only go through the stack trace you listed above for the b) case.
> >>>>> For the a) case, we'll hit
> >>>>>
> >>>>>         if (ff->open_flags & FOPEN_DIRECT_IO)
> >>>>>                 return fuse_direct_read_iter(iocb, to);
> >>>>>
> >>>>> and
> >>>>>
> >>>>>         if (ff->open_flags & FOPEN_DIRECT_IO)
> >>>>>                 return fuse_direct_write_iter(iocb, from);
> >>>>>
> >>>>> which will invoke fuse_direct_IO() / fuse_direct_io() without going
> >>>>> through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
> >>>>> kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
> >>>>> above.
> >>>>>
> >>>>> So for the a) case I think we'd still need the fuse_sync_writes() in
> >>>>> case there's still pending writeback.
> >>>>>
> >>>>> Do you agree with this analysis or am I missing something here?
> >>>>
> >>>> Yeah, that's true.  But instead of calling fuse_sync_writes(), we can
> >>>> call filemap_wait_range() or something similar here.
> >>>>
> >>>
> >>> Agreed. Actually, the more I look at this, the more I think we can
> >>> replace all fuse_sync_writes() and get rid of it entirely.
> >>
> >>
> >> I have seen your latest reply that this cleaning up won't be included in
> >> this series, which is okay.
> >>
> >>
> >>> fuse_sync_writes() is called in:
> >>>
> >>> fuse_fsync():
> >>>         /*
> >>>          * Start writeback against all dirty pages of the inode, then
> >>>          * wait for all outstanding writes, before sending the FSYNC
> >>>          * request.
> >>>          */
> >>>         err = file_write_and_wait_range(file, start, end);
> >>>         if (err)
> >>>                 goto out;
> >>>
> >>>         fuse_sync_writes(inode);
> >>>
> >>>         /*
> >>>          * Due to implementation of fuse writeback
> >>>          * file_write_and_wait_range() does not catch errors.
> >>>          * We have to do this directly after fuse_sync_writes()
> >>>          */
> >>>         err = file_check_and_advance_wb_err(file);
> >>>         if (err)
> >>>                 goto out;
> >>>
> >>>
> >>>       We can get rid of the fuse_sync_writes() and
> >>> file_check_and_advance_wb_err() entirely since now without temp pages,
> >>> the file_write_and_wait_range() call actually ensures that writeback
> >>> is completed
> >>>
> >>>
> >>>
> >>> fuse_writeback_range():
> >>>         static int fuse_writeback_range(struct inode *inode, loff_t
> >>> start, loff_t end)
> >>>         {
> >>>                 int err =
> >>> filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
> >>>
> >>>                 if (!err)
> >>>                         fuse_sync_writes(inode);
> >>>
> >>>                 return err;
> >>>         }
> >>>
> >>>
> >>>       We can replace fuse_writeback_range() entirely with
> >>> filemap_write_and_wait_range().
> >>>
> >>>
> >>>
> >>> fuse_direct_io():
> >>>         if (fopen_direct_io && fc->direct_io_allow_mmap) {
> >>>                 res = filemap_write_and_wait_range(mapping, pos, pos +
> >>> count - 1);
> >>>                 if (res) {
> >>>                         fuse_io_free(ia);
> >>>                         return res;
> >>>                 }
> >>>         }
> >>>         if (!cuse && filemap_range_has_writeback(mapping, pos, (pos +
> >>> count - 1))) {
> >>>                 if (!write)
> >>>                         inode_lock(inode);
> >>>                 fuse_sync_writes(inode);
> >>>                 if (!write)
> >>>                         inode_unlock(inode);
> >>>         }
> >>>
> >>>
> >>>        I think this can just replaced with
> >>>                 if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse)) {
> >>>                         res = filemap_write_and_wait_range(mapping,
> >>> pos, pos + count - 1);
> >>>                         if (res) {
> >>>                                 fuse_io_free(ia);
> >>>                                 return res;
> >>>                         }
> >>>                 }
> >>
> >> Alright. But I would prefer doing this filemap_write_and_wait_range() in
> >> fuse_direct_write_iter() rather than fuse_direct_io() if possible.
> >>
> >>>        since for the !fopen_direct_io case, it will already go through
> >>> filemap_write_and_wait_range(), as you mentioned in your previous
> >>> message. I think this also fixes a bug (?) in the original code - in
> >>> the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we
> >>> still need to write out dirty pages first, which we don't currently
> >>> do.
> >>
> >> Nope.  In case of fopen_direct_io && !fc->direct_io_allow_mmap, there
> >> won't be any page cache at all, right?
> >>
> >
> > Isn't there still a page cache if the file was previously opened
> > without direct io and then the client opens another handle to that
> > file with direct io? In that case, the pages could still be dirty in
> > the page cache and would need to be written back first, no?
>
> Do you mean that when the inode is firstly opened, FOPEN_DIRECT_IO is
> not set by the FUSE server, while it is secondly opened, the flag is set?
>
> Though the behavior of the FUSE daemon is quite confusing in this case,
> it is completely possible in real life.  So yes we'd better add
> filemap_write_and_wait_range() unconditionally in fopen_direct_io case.
>

I think this behavior on the server side is pretty common. From what
I've seen on most servers, the server when handling the open sets
fi->direct_io depending on if the client opens with O_DIRECT, eg

        if (fi->flags & O_DIRECT)
                fi->direct_io = 1;

If a client opens a file without O_DIRECT and then opens the same file
with O_DIRECT, then we run into this case. Though I'm not sure how
common it generally is for clients to do this.

Thanks,
Joanne
>
> --
> Thanks,
> Jingbo





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux