On Thu, Apr 10, 2025 at 9:11 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > On Thu, Apr 10, 2025 at 8:11 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > > > > On 4/10/25 11:07 PM, Joanne Koong wrote: > > > On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > > >> > > >> > > >> > > >> On 4/10/25 7:47 AM, Joanne Koong wrote: > > >>> On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > > >>>> > > >>>> Hi Joanne, > > >>>> > > >>>> On 4/5/25 2:14 AM, Joanne Koong wrote: > > >>>>> In the current FUSE writeback design (see commit 3be5a52b30aa > > >>>>> ("fuse: support writable mmap")), a temp page is allocated for every > > >>>>> dirty page to be written back, the contents of the dirty page are copied over > > >>>>> to the temp page, and the temp page gets handed to the server to write back. > > >>>>> > > >>>>> This is done so that writeback may be immediately cleared on the dirty page, > > >>>>> and this in turn is done in order to mitigate the following deadlock scenario > > >>>>> that may arise if reclaim waits on writeback on the dirty page to complete: > > >>>>> * single-threaded FUSE server is in the middle of handling a request > > >>>>> that needs a memory allocation > > >>>>> * memory allocation triggers direct reclaim > > >>>>> * direct reclaim waits on a folio under writeback > > >>>>> * the FUSE server can't write back the folio since it's stuck in > > >>>>> direct reclaim > > >>>>> > > >>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates > > >>>>> the situations described above, FUSE writeback does not need to use > > >>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings. > > >>>>> > > >>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings > > >>>>> and removes the temporary pages + extra copying and the internal rb > > >>>>> tree. > > >>>>> > > >>>>> fio benchmarks -- > > >>>>> (using averages observed from 10 runs, throwing away outliers) > > >>>>> > > >>>>> Setup: > > >>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount > > >>>>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount > > >>>>> > > >>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G > > >>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount > > >>>>> > > >>>>> bs = 1k 4k 1M > > >>>>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s > > >>>>> After 341 MiB/s 2246 MiB/s 2685 MiB/s > > >>>>> % diff -3% 23% 45% > > >>>>> > > >>>>> Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> > > >>>>> Reviewed-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> > > >>>>> Acked-by: Miklos Szeredi <mszeredi@xxxxxxxxxx> > > >>>> > > >>> > > >>> Hi Jingbo, > > >>> > > >>> Thanks for sharing your analysis for this. > > >>> > > >>>> Overall this patch LGTM. > > >>>> > > >>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is > > >>>> also unneeded then, at least the DIRECT IO routine (i.e. > > >>> > > >>> I took a look at fi->writectr and fi->queued_writes and my > > >>> understanding is that we do still need this. For example, for > > >>> truncates (I'm looking at fuse_do_setattr()), I think we still need to > > >>> prevent concurrent writeback or else the setattr request and the > > >>> writeback request could race which would result in a mismatch between > > >>> the file's reported size and the actual data written to disk. > > >> > > >> I haven't looked into the truncate routine yet. I will see it later. > > >> > > >>> > > >>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore. That is > > >>>> because after removing the temp page, the DIRECT IO routine has already > > >>>> been waiting for all inflight WRITE requests, see > > >>>> > > >>>> # DIRECT read > > >>>> generic_file_read_iter > > >>>> kiocb_write_and_wait > > >>>> filemap_write_and_wait_range > > >>> > > >>> Where do you see generic_file_read_iter() getting called for direct io reads? > > >> > > >> # DIRECT read > > >> fuse_file_read_iter > > >> fuse_cache_read_iter > > >> generic_file_read_iter > > >> kiocb_write_and_wait > > >> filemap_write_and_wait_range > > >> a_ops->direct_IO(),i.e. fuse_direct_IO() > > >> > > > > > > Oh I see, I thought files opened with O_DIRECT automatically call the > > > .direct_IO handler for reads/writes but you're right, it first goes > > > through .read_iter / .write_iter handlers, and the .direct_IO handler > > > only gets invoked through generic_file_read_iter() / > > > generic_file_direct_write() in mm/filemap.c > > > > > > There's two paths for direct io in FUSE: > > > a) fuse server sets fi->direct_io = true when a file is opened, which > > > will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side > > > b) fuse server doesn't set fi->direct_io = true, but the client opens > > > the file with O_DIRECT > > > > > > We only go through the stack trace you listed above for the b) case. > > > For the a) case, we'll hit > > > > > > if (ff->open_flags & FOPEN_DIRECT_IO) > > > return fuse_direct_read_iter(iocb, to); > > > > > > and > > > > > > if (ff->open_flags & FOPEN_DIRECT_IO) > > > return fuse_direct_write_iter(iocb, from); > > > > > > which will invoke fuse_direct_IO() / fuse_direct_io() without going > > > through the kiocb_write_and_wait() -> filemap_write_and_wait_range() / > > > kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed > > > above. > > > > > > So for the a) case I think we'd still need the fuse_sync_writes() in > > > case there's still pending writeback. > > > > > > Do you agree with this analysis or am I missing something here? > > > > Yeah, that's true. But instead of calling fuse_sync_writes(), we can > > call filemap_wait_range() or something similar here. > > > > Agreed. Actually, the more I look at this, the more I think we can > replace all fuse_sync_writes() and get rid of it entirely. Right now, > fuse_sync_writes() is called in: This is nontrivial so I'll leave this optimization to be done as a separate future patchset so as to not slow this one down. Thanks, Joanne > > fuse_fsync(): > /* > * Start writeback against all dirty pages of the inode, then > * wait for all outstanding writes, before sending the FSYNC > * request. > */ > err = file_write_and_wait_range(file, start, end); > if (err) > goto out; > > fuse_sync_writes(inode); > > /* > * Due to implementation of fuse writeback > * file_write_and_wait_range() does not catch errors. > * We have to do this directly after fuse_sync_writes() > */ > err = file_check_and_advance_wb_err(file); > if (err) > goto out; > > > We can get rid of the fuse_sync_writes() and > file_check_and_advance_wb_err() entirely since now without temp pages, > the file_write_and_wait_range() call actually ensures that writeback > is completed > > > > fuse_writeback_range(): > static int fuse_writeback_range(struct inode *inode, loff_t > start, loff_t end) > { > int err = > filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX); > > if (!err) > fuse_sync_writes(inode); > > return err; > } > > > We can replace fuse_writeback_range() entirely with > filemap_write_and_wait_range(). > > > > fuse_direct_io(): > if (fopen_direct_io && fc->direct_io_allow_mmap) { > res = filemap_write_and_wait_range(mapping, pos, pos + > count - 1); > if (res) { > fuse_io_free(ia); > return res; > } > } > if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + > count - 1))) { > if (!write) > inode_lock(inode); > fuse_sync_writes(inode); > if (!write) > inode_unlock(inode); > } > > > I think this can just replaced with > if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse)) { > res = filemap_write_and_wait_range(mapping, > pos, pos + count - 1); > if (res) { > fuse_io_free(ia); > return res; > } > } > since for the !fopen_direct_io case, it will already go through > filemap_write_and_wait_range(), as you mentioned in your previous > message. I think this also fixes a bug (?) in the original code - in > the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we > still need to write out dirty pages first, which we don't currently > do. > > > What do you think? > > Thanks for all your careful review on this patchset throughout all of > its iterations. > > > > > > > >> filp_close() > > >> filp_flush() > > >> filp->f_op->flush() > > >> fuse_flush() > > >> write_inode_now > > >> writeback_single_inode(WB_SYNC_ALL) > > >> do_writepages > > >> # flush dirty page > > >> filemap_fdatawait > > >> # wait for WRITE completion > > > > > > Nice. I missed that write_inode_now() will invoke filemap_fdatawait(). > > > This seems pretty straightforward. I'll remove the fuse_sync_writes() > > > call in fuse_flush() when I send out v8. > > > > > > The direct io one above is less straight-forward. I won't add that to > > > v8 but that can be done in a separate future patch when we figure that > > > out. > > > > Thanks for keep working on this. Appreciated. > > > > -- > > Thanks, > > Jingbo