On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > > Hi Joanne, > > On 4/5/25 2:14 AM, Joanne Koong wrote: > > In the current FUSE writeback design (see commit 3be5a52b30aa > > ("fuse: support writable mmap")), a temp page is allocated for every > > dirty page to be written back, the contents of the dirty page are copied over > > to the temp page, and the temp page gets handed to the server to write back. > > > > This is done so that writeback may be immediately cleared on the dirty page, > > and this in turn is done in order to mitigate the following deadlock scenario > > that may arise if reclaim waits on writeback on the dirty page to complete: > > * single-threaded FUSE server is in the middle of handling a request > > that needs a memory allocation > > * memory allocation triggers direct reclaim > > * direct reclaim waits on a folio under writeback > > * the FUSE server can't write back the folio since it's stuck in > > direct reclaim > > > > With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates > > the situations described above, FUSE writeback does not need to use > > temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings. > > > > This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings > > and removes the temporary pages + extra copying and the internal rb > > tree. > > > > fio benchmarks -- > > (using averages observed from 10 runs, throwing away outliers) > > > > Setup: > > sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount > > ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount > > > > fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G > > --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount > > > > bs = 1k 4k 1M > > Before 351 MiB/s 1818 MiB/s 1851 MiB/s > > After 341 MiB/s 2246 MiB/s 2685 MiB/s > > % diff -3% 23% 45% > > > > Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> > > Reviewed-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> > > Acked-by: Miklos Szeredi <mszeredi@xxxxxxxxxx> > Hi Jingbo, Thanks for sharing your analysis for this. > Overall this patch LGTM. > > Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is > also unneeded then, at least the DIRECT IO routine (i.e. I took a look at fi->writectr and fi->queued_writes and my understanding is that we do still need this. For example, for truncates (I'm looking at fuse_do_setattr()), I think we still need to prevent concurrent writeback or else the setattr request and the writeback request could race which would result in a mismatch between the file's reported size and the actual data written to disk. > fuse_direct_io()) doesn't need fuse_sync_writes() anymore. That is > because after removing the temp page, the DIRECT IO routine has already > been waiting for all inflight WRITE requests, see > > # DIRECT read > generic_file_read_iter > kiocb_write_and_wait > filemap_write_and_wait_range Where do you see generic_file_read_iter() getting called for direct io reads? For direct io reads, I'm only seeing fuse_direct_IO() __fuse_direct_read() fuse_direct_io() and fuse_file_read_iter() fuse_direct_read_iter() fuse_direct_IO() / __fuse_direct_read() > > # DIRECT write > generic_file_write_iter > generic_file_direct_write > kiocb_invalidate_pages > filemap_invalidate_pages > filemap_write_and_wait_range Similarly, where do you see generic_file_write_iter() getting called for direct io writes? My understanding is that it'd either go through fuse_file_write_iter() -> fuse_direct_write_iter() or through the fuse_direct_IO() callback. > > The DIRECT write routine will also invalidate the page cache in the > range that is written to, so that the following buffer write needs to > read the page cache back first. The writeback following the buffer write > is much likely after the DIRECT write, so that the writeback won't > conflict with the DIRECT write (i.e. there won't be duplicate WRITE > requests for the same page that are initiated from DIRECT write and > writeback at the same time), which is exactly why fi->writectr and > fi->queued_writes are introduced. Where do you see fi->writectr / fi->queued-writes preventing this race? It looks to me like in the existing code, this race condition you described of direct write invalidating the page cache, then another buffer write reads the page cache and dirties it, then writeback is called on that, and the 2 write requests racing, could still happen? > However it seems that the writeback > won't wait for previous inflight DIRECT WRITE requests, so I'm not much > sure about that. Maybe other folks could offer more insights... My understanding is that these lines if (!cuse && filemap_range_has_writeback(...)) { ... fuse_sync_writes(inode); ... } in fuse_direct_io() is what waits on previous inflight direct write requests to complete before the direct io happens. > > Also fuse_sync_writes() is not needed in fuse_flush() anymore, with > which I'm pretty sure. Why don't we still need this for fuse_flush()? If a caller calls close(), this will call filp_close() filp_flush() filp->f_op->flush() fuse_flush() it seems like we should still be waiting for all writebacks to finish before sending the fuse server the fuse_flush request, no? > > The potential cleanup for fi->writectr and fi->queued_writes could be > offered as following separate patches (if any). > Thanks, Joanne > > > --- > > fs/fuse/file.c | 360 ++++------------------------------------------- > > fs/fuse/fuse_i.h | 3 - > > 2 files changed, 28 insertions(+), 335 deletions(-) > > > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > > index 754378dd9f71..91ada0208863 100644 > > --- a/fs/fuse/file.c > > +++ b/fs/fuse/file.c > > @@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id) > > > > struct fuse_writepage_args { > > struct fuse_io_args ia; > > - struct rb_node writepages_entry; > > struct list_head queue_entry; > > - struct fuse_writepage_args *next; > > struct inode *inode; > > struct fuse_sync_bucket *bucket; > > }; > > > > -static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi, > > - pgoff_t idx_from, pgoff_t idx_to) > > -{ > > - struct rb_node *n; > > - > > - n = fi->writepages.rb_node; > > - > > - while (n) { > > - struct fuse_writepage_args *wpa; > > - pgoff_t curr_index; > > - > > -- > Thanks, > Jingbo