On 4/16/25 18:43, Joanne Koong wrote:
On Tue, Apr 15, 2025 at 6:40 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
On 4/15/25 11:59 PM, Joanne Koong wrote:
On Tue, Apr 15, 2025 at 12:49 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
Hi Joanne,
Sorry for the late reply...
Hi Jingbo,
No worries at all.
On 4/11/25 12:11 AM, Joanne Koong wrote:
On Thu, Apr 10, 2025 at 8:11 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
On 4/10/25 11:07 PM, Joanne Koong wrote:
On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
On 4/10/25 7:47 AM, Joanne Koong wrote:
On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
Hi Joanne,
On 4/5/25 2:14 AM, Joanne Koong wrote:
In the current FUSE writeback design (see commit 3be5a52b30aa
("fuse: support writable mmap")), a temp page is allocated for every
dirty page to be written back, the contents of the dirty page are copied over
to the temp page, and the temp page gets handed to the server to write back.
This is done so that writeback may be immediately cleared on the dirty page,
and this in turn is done in order to mitigate the following deadlock scenario
that may arise if reclaim waits on writeback on the dirty page to complete:
* single-threaded FUSE server is in the middle of handling a request
that needs a memory allocation
* memory allocation triggers direct reclaim
* direct reclaim waits on a folio under writeback
* the FUSE server can't write back the folio since it's stuck in
direct reclaim
With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
the situations described above, FUSE writeback does not need to use
temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
and removes the temporary pages + extra copying and the internal rb
tree.
fio benchmarks --
(using averages observed from 10 runs, throwing away outliers)
Setup:
sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
--numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
bs = 1k 4k 1M
Before 351 MiB/s 1818 MiB/s 1851 MiB/s
After 341 MiB/s 2246 MiB/s 2685 MiB/s
% diff -3% 23% 45%
Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx>
Reviewed-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx>
Acked-by: Miklos Szeredi <mszeredi@xxxxxxxxxx>
Hi Jingbo,
Thanks for sharing your analysis for this.
Overall this patch LGTM.
Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
also unneeded then, at least the DIRECT IO routine (i.e.
I took a look at fi->writectr and fi->queued_writes and my
understanding is that we do still need this. For example, for
truncates (I'm looking at fuse_do_setattr()), I think we still need to
prevent concurrent writeback or else the setattr request and the
writeback request could race which would result in a mismatch between
the file's reported size and the actual data written to disk.
I haven't looked into the truncate routine yet. I will see it later.
fuse_direct_io()) doesn't need fuse_sync_writes() anymore. That is
because after removing the temp page, the DIRECT IO routine has already
been waiting for all inflight WRITE requests, see
# DIRECT read
generic_file_read_iter
kiocb_write_and_wait
filemap_write_and_wait_range
Where do you see generic_file_read_iter() getting called for direct io reads?
# DIRECT read
fuse_file_read_iter
fuse_cache_read_iter
generic_file_read_iter
kiocb_write_and_wait
filemap_write_and_wait_range
a_ops->direct_IO(),i.e. fuse_direct_IO()
Oh I see, I thought files opened with O_DIRECT automatically call the
.direct_IO handler for reads/writes but you're right, it first goes
through .read_iter / .write_iter handlers, and the .direct_IO handler
only gets invoked through generic_file_read_iter() /
generic_file_direct_write() in mm/filemap.c
There's two paths for direct io in FUSE:
a) fuse server sets fi->direct_io = true when a file is opened, which
will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
b) fuse server doesn't set fi->direct_io = true, but the client opens
the file with O_DIRECT
We only go through the stack trace you listed above for the b) case.
For the a) case, we'll hit
if (ff->open_flags & FOPEN_DIRECT_IO)
return fuse_direct_read_iter(iocb, to);
and
if (ff->open_flags & FOPEN_DIRECT_IO)
return fuse_direct_write_iter(iocb, from);
which will invoke fuse_direct_IO() / fuse_direct_io() without going
through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
above.
So for the a) case I think we'd still need the fuse_sync_writes() in
case there's still pending writeback.
Do you agree with this analysis or am I missing something here?
Yeah, that's true. But instead of calling fuse_sync_writes(), we can
call filemap_wait_range() or something similar here.
Agreed. Actually, the more I look at this, the more I think we can
replace all fuse_sync_writes() and get rid of it entirely.
I have seen your latest reply that this cleaning up won't be included in
this series, which is okay.
fuse_sync_writes() is called in:
fuse_fsync():
/*
* Start writeback against all dirty pages of the inode, then
* wait for all outstanding writes, before sending the FSYNC
* request.
*/
err = file_write_and_wait_range(file, start, end);
if (err)
goto out;
fuse_sync_writes(inode);
/*
* Due to implementation of fuse writeback
* file_write_and_wait_range() does not catch errors.
* We have to do this directly after fuse_sync_writes()
*/
err = file_check_and_advance_wb_err(file);
if (err)
goto out;
We can get rid of the fuse_sync_writes() and
file_check_and_advance_wb_err() entirely since now without temp pages,
the file_write_and_wait_range() call actually ensures that writeback
is completed
fuse_writeback_range():
static int fuse_writeback_range(struct inode *inode, loff_t
start, loff_t end)
{
int err =
filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
if (!err)
fuse_sync_writes(inode);
return err;
}
We can replace fuse_writeback_range() entirely with
filemap_write_and_wait_range().
fuse_direct_io():
if (fopen_direct_io && fc->direct_io_allow_mmap) {
res = filemap_write_and_wait_range(mapping, pos, pos +
count - 1);
if (res) {
fuse_io_free(ia);
return res;
}
}
if (!cuse && filemap_range_has_writeback(mapping, pos, (pos +
count - 1))) {
if (!write)
inode_lock(inode);
fuse_sync_writes(inode);
if (!write)
inode_unlock(inode);
}
I think this can just replaced with
if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse)) {
res = filemap_write_and_wait_range(mapping,
pos, pos + count - 1);
if (res) {
fuse_io_free(ia);
return res;
}
}
Alright. But I would prefer doing this filemap_write_and_wait_range() in
fuse_direct_write_iter() rather than fuse_direct_io() if possible.
since for the !fopen_direct_io case, it will already go through
filemap_write_and_wait_range(), as you mentioned in your previous
message. I think this also fixes a bug (?) in the original code - in
the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we
still need to write out dirty pages first, which we don't currently
do.
Nope. In case of fopen_direct_io && !fc->direct_io_allow_mmap, there
won't be any page cache at all, right?
Isn't there still a page cache if the file was previously opened
without direct io and then the client opens another handle to that
file with direct io? In that case, the pages could still be dirty in
the page cache and would need to be written back first, no?
Do you mean that when the inode is firstly opened, FOPEN_DIRECT_IO is
not set by the FUSE server, while it is secondly opened, the flag is set?
Though the behavior of the FUSE daemon is quite confusing in this case,
it is completely possible in real life. So yes we'd better add
filemap_write_and_wait_range() unconditionally in fopen_direct_io case.
I think this behavior on the server side is pretty common. From what
I've seen on most servers, the server when handling the open sets
fi->direct_io depending on if the client opens with O_DIRECT, eg
if (fi->flags & O_DIRECT)
fi->direct_io = 1;
One should do that, actually, to get a shared lock on the inode.
With the additional
fi.parallel_direct_writes = 1;
If a client opens a file without O_DIRECT and then opens the same file
with O_DIRECT, then we run into this case. Though I'm not sure how
common it generally is for clients to do this.
I guess the common case is
application1 does mmap
application2 does normal read/write
fuse-server might set fi->direct_io = 1 on all opens, with the
additional
fuse_set_feature_flag(connp, FUSE_CAP_DIRECT_IO_ALLOW_MMAP);
Probably will only come to it tomorrow, but will review,
especially to check about cached/uncached io-modes.
Thanks,
Bernd