On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > > > > On 4/10/25 7:47 AM, Joanne Koong wrote: > > On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > >> > >> Hi Joanne, > >> > >> On 4/5/25 2:14 AM, Joanne Koong wrote: > >>> In the current FUSE writeback design (see commit 3be5a52b30aa > >>> ("fuse: support writable mmap")), a temp page is allocated for every > >>> dirty page to be written back, the contents of the dirty page are copied over > >>> to the temp page, and the temp page gets handed to the server to write back. > >>> > >>> This is done so that writeback may be immediately cleared on the dirty page, > >>> and this in turn is done in order to mitigate the following deadlock scenario > >>> that may arise if reclaim waits on writeback on the dirty page to complete: > >>> * single-threaded FUSE server is in the middle of handling a request > >>> that needs a memory allocation > >>> * memory allocation triggers direct reclaim > >>> * direct reclaim waits on a folio under writeback > >>> * the FUSE server can't write back the folio since it's stuck in > >>> direct reclaim > >>> > >>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates > >>> the situations described above, FUSE writeback does not need to use > >>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings. > >>> > >>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings > >>> and removes the temporary pages + extra copying and the internal rb > >>> tree. > >>> > >>> fio benchmarks -- > >>> (using averages observed from 10 runs, throwing away outliers) > >>> > >>> Setup: > >>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount > >>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount > >>> > >>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G > >>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount > >>> > >>> bs = 1k 4k 1M > >>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s > >>> After 341 MiB/s 2246 MiB/s 2685 MiB/s > >>> % diff -3% 23% 45% > >>> > >>> Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> > >>> Reviewed-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> > >>> Acked-by: Miklos Szeredi <mszeredi@xxxxxxxxxx> > >> > > > > Hi Jingbo, > > > > Thanks for sharing your analysis for this. > > > >> Overall this patch LGTM. > >> > >> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is > >> also unneeded then, at least the DIRECT IO routine (i.e. > > > > I took a look at fi->writectr and fi->queued_writes and my > > understanding is that we do still need this. For example, for > > truncates (I'm looking at fuse_do_setattr()), I think we still need to > > prevent concurrent writeback or else the setattr request and the > > writeback request could race which would result in a mismatch between > > the file's reported size and the actual data written to disk. > > I haven't looked into the truncate routine yet. I will see it later. > > > > >> fuse_direct_io()) doesn't need fuse_sync_writes() anymore. That is > >> because after removing the temp page, the DIRECT IO routine has already > >> been waiting for all inflight WRITE requests, see > >> > >> # DIRECT read > >> generic_file_read_iter > >> kiocb_write_and_wait > >> filemap_write_and_wait_range > > > > Where do you see generic_file_read_iter() getting called for direct io reads? > > # DIRECT read > fuse_file_read_iter > fuse_cache_read_iter > generic_file_read_iter > kiocb_write_and_wait > filemap_write_and_wait_range > a_ops->direct_IO(),i.e. fuse_direct_IO() > Oh I see, I thought files opened with O_DIRECT automatically call the .direct_IO handler for reads/writes but you're right, it first goes through .read_iter / .write_iter handlers, and the .direct_IO handler only gets invoked through generic_file_read_iter() / generic_file_direct_write() in mm/filemap.c There's two paths for direct io in FUSE: a) fuse server sets fi->direct_io = true when a file is opened, which will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side b) fuse server doesn't set fi->direct_io = true, but the client opens the file with O_DIRECT We only go through the stack trace you listed above for the b) case. For the a) case, we'll hit if (ff->open_flags & FOPEN_DIRECT_IO) return fuse_direct_read_iter(iocb, to); and if (ff->open_flags & FOPEN_DIRECT_IO) return fuse_direct_write_iter(iocb, from); which will invoke fuse_direct_IO() / fuse_direct_io() without going through the kiocb_write_and_wait() -> filemap_write_and_wait_range() / kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed above. So for the a) case I think we'd still need the fuse_sync_writes() in case there's still pending writeback. Do you agree with this analysis or am I missing something here? > > > Similarly, where do you see generic_file_write_iter() getting called > > for direct io writes? > > # DIRECT read > fuse_file_write_iter > fuse_cache_write_iter > generic_file_write_iter > generic_file_direct_write > kiocb_invalidate_pages > filemap_invalidate_pages > filemap_write_and_wait_range > a_ops->direct_IO(),i.e. fuse_direct_IO() > > > > Where do you see fi->writectr / fi->queued-writes preventing this > > race? > > IMO overall fi->writectr / fi->queued-writes are introduced to prevent > DIRECT IO and writeback from sending duplicate (inflight) WRITE requests > for the same page. > > For the DIRECT write routine: > > # non-FOPEN_DIRECT_IO DIRECT write > fuse_cache_write_iter > fuse_direct_IO > fuse_direct_io > fuse_sync_writes > > > # FOPEN_DIRECT_IO DIRECT write > fuse_direct_write_iter > fuse_direct_IO > fuse_direct_io > fuse_sync_writes > > > For the writeback routine: > fuse_writepages() > fuse_writepages_fill > fuse_writepages_send > # buffer the WRITE request in queued_writes list > fuse_flush_writepages > # flush WRITE only when fi->writectr >= 0 > > > > > It looks to me like in the existing code, this race condition > > you described of direct write invalidating the page cache, then > > another buffer write reads the page cache and dirties it, then > > writeback is called on that, and the 2 write requests racing, could > > still happen? > > > > > >> However it seems that the writeback > >> won't wait for previous inflight DIRECT WRITE requests, so I'm not much > >> sure about that. Maybe other folks could offer more insights... > > > > My understanding is that these lines > > > > if (!cuse && filemap_range_has_writeback(...)) { > > ... > > fuse_sync_writes(inode); > > ... > > } > > > > in fuse_direct_io() is what waits on previous inflight direct write > > requests to complete before the direct io happens. > > Right. > > > > > > >> > >> Also fuse_sync_writes() is not needed in fuse_flush() anymore, with > >> which I'm pretty sure. > > > > Why don't we still need this for fuse_flush()? > > > > If a caller calls close(), this will call > > > > filp_close() > > filp_flush() > > filp->f_op->flush() > > fuse_flush() > > > > it seems like we should still be waiting for all writebacks to finish > > before sending the fuse server the fuse_flush request, no? > > > > filp_close() > filp_flush() > filp->f_op->flush() > fuse_flush() > write_inode_now > writeback_single_inode(WB_SYNC_ALL) > do_writepages > # flush dirty page > filemap_fdatawait > # wait for WRITE completion Nice. I missed that write_inode_now() will invoke filemap_fdatawait(). This seems pretty straightforward. I'll remove the fuse_sync_writes() call in fuse_flush() when I send out v8. The direct io one above is less straight-forward. I won't add that to v8 but that can be done in a separate future patch when we figure that out. Thanks, Joanne > > >> > >>> --- > >>> fs/fuse/file.c | 360 ++++------------------------------------------- > >>> fs/fuse/fuse_i.h | 3 - > >>> 2 files changed, 28 insertions(+), 335 deletions(-) > >>> > >>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c > >>> index 754378dd9f71..91ada0208863 100644 > >>> --- a/fs/fuse/file.c > >>> +++ b/fs/fuse/file.c > >>> @@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id) > >>> > >>> struct fuse_writepage_args { > >>> struct fuse_io_args ia; > >>> - struct rb_node writepages_entry; > >>> struct list_head queue_entry; > >>> - struct fuse_writepage_args *next; > >>> struct inode *inode; > >>> struct fuse_sync_bucket *bucket; > >>> }; > >>> > >>> -static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi, > >>> - pgoff_t idx_from, pgoff_t idx_to) > >>> -{ > >>> - struct rb_node *n; > >>> - > >>> - n = fi->writepages.rb_node; > >>> - > >>> - while (n) { > >>> - struct fuse_writepage_args *wpa; > >>> - pgoff_t curr_index; > >>> - > >> > >> -- > >> Thanks, > >> Jingbo > > -- > Thanks, > Jingbo