Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 4/10/25 11:07 PM, Joanne Koong wrote:
> On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>>
>>
>>
>> On 4/10/25 7:47 AM, Joanne Koong wrote:
>>>   On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>>>>
>>>> Hi Joanne,
>>>>
>>>> On 4/5/25 2:14 AM, Joanne Koong wrote:
>>>>> In the current FUSE writeback design (see commit 3be5a52b30aa
>>>>> ("fuse: support writable mmap")), a temp page is allocated for every
>>>>> dirty page to be written back, the contents of the dirty page are copied over
>>>>> to the temp page, and the temp page gets handed to the server to write back.
>>>>>
>>>>> This is done so that writeback may be immediately cleared on the dirty page,
>>>>> and this in turn is done in order to mitigate the following deadlock scenario
>>>>> that may arise if reclaim waits on writeback on the dirty page to complete:
>>>>> * single-threaded FUSE server is in the middle of handling a request
>>>>>   that needs a memory allocation
>>>>> * memory allocation triggers direct reclaim
>>>>> * direct reclaim waits on a folio under writeback
>>>>> * the FUSE server can't write back the folio since it's stuck in
>>>>>   direct reclaim
>>>>>
>>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
>>>>> the situations described above, FUSE writeback does not need to use
>>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
>>>>>
>>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
>>>>> and removes the temporary pages + extra copying and the internal rb
>>>>> tree.
>>>>>
>>>>> fio benchmarks --
>>>>> (using averages observed from 10 runs, throwing away outliers)
>>>>>
>>>>> Setup:
>>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>>>>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>>>>>
>>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
>>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>>>>>
>>>>>         bs =  1k          4k            1M
>>>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
>>>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
>>>>> % diff        -3%          23%         45%
>>>>>
>>>>> Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx>
>>>>> Reviewed-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx>
>>>>> Acked-by: Miklos Szeredi <mszeredi@xxxxxxxxxx>
>>>>
>>>
>>> Hi Jingbo,
>>>
>>> Thanks for sharing your analysis for this.
>>>
>>>> Overall this patch LGTM.
>>>>
>>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
>>>> also unneeded then, at least the DIRECT IO routine (i.e.
>>>
>>> I took a look at fi->writectr and fi->queued_writes and my
>>> understanding is that we do still need this. For example, for
>>> truncates (I'm looking at fuse_do_setattr()), I think we still need to
>>> prevent concurrent writeback or else the setattr request and the
>>> writeback request could race which would result in a mismatch between
>>> the file's reported size and the actual data written to disk.
>>
>> I haven't looked into the truncate routine yet.  I will see it later.
>>
>>>
>>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
>>>> because after removing the temp page, the DIRECT IO routine has already
>>>> been waiting for all inflight WRITE requests, see
>>>>
>>>> # DIRECT read
>>>> generic_file_read_iter
>>>>   kiocb_write_and_wait
>>>>     filemap_write_and_wait_range
>>>
>>> Where do you see generic_file_read_iter() getting called for direct io reads?
>>
>> # DIRECT read
>> fuse_file_read_iter
>>   fuse_cache_read_iter
>>     generic_file_read_iter
>>       kiocb_write_and_wait
>>        filemap_write_and_wait_range
>>       a_ops->direct_IO(),i.e. fuse_direct_IO()
>>
> 
> Oh I see, I thought files opened with O_DIRECT automatically call the
> .direct_IO handler for reads/writes but you're right, it first goes
> through .read_iter / .write_iter handlers, and the .direct_IO handler
> only gets invoked through generic_file_read_iter() /
> generic_file_direct_write() in mm/filemap.c
> 
> There's two paths for direct io in FUSE:
> a) fuse server sets fi->direct_io = true when a file is opened, which
> will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
> b) fuse server doesn't set fi->direct_io = true, but the client opens
> the file with O_DIRECT
> 
> We only go through the stack trace you listed above for the b) case.
> For the a) case, we'll hit
> 
>         if (ff->open_flags & FOPEN_DIRECT_IO)
>                 return fuse_direct_read_iter(iocb, to);
> 
> and
> 
>         if (ff->open_flags & FOPEN_DIRECT_IO)
>                 return fuse_direct_write_iter(iocb, from);
> 
> which will invoke fuse_direct_IO() / fuse_direct_io() without going
> through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
> kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
> above.
> 
> So for the a) case I think we'd still need the fuse_sync_writes() in
> case there's still pending writeback.
> 
> Do you agree with this analysis or am I missing something here?

Yeah, that's true.  But instead of calling fuse_sync_writes(), we can
call filemap_wait_range() or something similar here.



>> filp_close()
>>    filp_flush()
>>        filp->f_op->flush()
>>            fuse_flush()
>>              write_inode_now
>>                 writeback_single_inode(WB_SYNC_ALL)
>>                   do_writepages
>>                     # flush dirty page
>>                   filemap_fdatawait
>>                     # wait for WRITE completion
> 
> Nice. I missed that write_inode_now() will invoke filemap_fdatawait().
> This seems pretty straightforward. I'll remove the fuse_sync_writes()
> call in fuse_flush() when I send out v8.
> 
> The direct io one above is less straight-forward. I won't add that to
> v8 but that can be done in a separate future patch when we figure that
> out.

Thanks for keep working on this. Appreciated.

-- 
Thanks,
Jingbo




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux