Re: [LSF/MM/BPF TOPIC] vfs write barriers

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 27 Mar 2025 19:23:19 +0100

On Thu, Mar 20, 2025 at 6:00 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
>
> On Tue, Feb 11, 2025 at 5:22 PM Jan Kara <jack@xxxxxxx> wrote:
> >
> > On Thu 23-01-25 13:14:11, Jeff Layton wrote:
> > > On Mon, 2025-01-20 at 12:41 +0100, Amir Goldstein wrote:
> > > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > >
> > > > > On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > I would like to present the idea of vfs write barriers that was proposed by Jan
> > > > > > and prototyped for the use of fanotify HSM change tracking events [1].
> > > > > >
> > > > > > The historical records state that I had mentioned the idea briefly at the end of
> > > > > > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss
> > > > > > its wider implications at the time.
> > > > > >
> > > > > > The vfs write barriers are implemented by taking a per-sb srcu read side
> > > > > > lock for the scope of {mnt,file}_{want,drop}_write().
> > > > > >
> > > > > > This could be used by users - in the case of the prototype - an HSM service -
> > > > > > to wait for all in-flight write syscalls, without blocking new write syscalls
> > > > > > as the stricter fsfreeze() does.
> > > > > >
> > > > > > This ability to wait for in-flight write syscalls is used by the prototype to
> > > > > > implement a crash consistent change tracking method [3] without the
> > > > > > need to use the heavy fsfreeze() hammer.
> > > > >
> > > > > How does this provide anything guarantee at all? It doesn't order or
> > > > > wait for physical IOs in any way, so writeback can be active on a
> > > > > file and writing data from both sides of a syscall write "barrier".
> > > > > i.e. there is no coherency between what is on disk, the cmtime of
> > > > > the inode and the write barrier itself.
> > > > >
> > > > > Freeze is an actual physical write barrier. A very heavy handed
> > > > > physical right barrier, yes, but it has very well defined and
> > > > > bounded physical data persistence semantics.
> > > >
> > > > Yes. Freeze is a "write barrier to persistence storage".
> > > > This is not what "vfs write barrier" is about.
> > > > I will try to explain better.
> > > >
> > > > Some syscalls modify the data/metadata of filesystem objects in memory
> > > > (a.k.a "in-core") and some syscalls query in-core data/metadata
> > > > of filesystem objects.
> > > >
> > > > It is often the case that in-core data/metadata readers are not fully
> > > > synchronized with in-core data/metadata writers and it is often that
> > > > in-core data and metadata are not modified atomically w.r.t the
> > > > in-core data/metadata readers.
> > > > Even related metadata attributes are often not modified atomically
> > > > w.r.t to their readers (e.g. statx()).
> > > >
> > > > When it comes to "observing changes" multigrain ctime/mtime has
> > > > improved things a lot for observing a change in ctime/mtime since
> > > > last sampled and for observing an order of ctime/mtime changes
> > > > on different inodes, but it hasn't changed the fact that ctime/mtime
> > > > changes can be observed *before* the respective metadata/data
> > > > changes can be observed.
> > > >
> > > > An example problem is that a naive backup or indexing program can
> > > > read old data/metadata with new timestamp T and wrongly conclude
> > > > that it read all changes up to time T.
> > > >
> > > > It is true that "real" backup programs know that applications and
> > > > filesystem needs to be quisences before backup, but actual
> > > > day to day cloud storage sync programs and indexers cannot
> > > > practically freeze the filesystem for their work.
> > > >
> > >
> > > Right. That is still a known problem. For directory operations, the
> > > i_rwsem keeps things consistent, but for regular files, it's possible
> > > to see new timestamps alongside with old file contents. That's a
> > > problem since caching algorithms that watch for timestamp changes can
> > > end up not seeing the new contents until the _next_ change occurs,
> > > which might not ever happen.
> > >
> > > It would be better to change the file write code to update the
> > > timestamps after copying data to the pagecache. It would still be
> > > possible in that case to see old attributes + new contents, but that's
> > > preferable to the reverse for callers that are watching for changes to
> > > attributes.
> > >
> > > Would fixing that help your use-case at all?
> >
> > I think Amir wanted to make here a point in the other direction: I.e., if
> > the application did:
> >  * sample inode timestamp
> >  * vfs_write_barrier()
> >  * read file data
> >
> > then it is *guaranteed* it will never see old data & new timestamp and hence
> > the caching problem is solved. No need to update timestamp after the write.
> >
> > Now I agree updating timestamps after write is much nicer from usability
> > POV (given how common pattern above it) but this is just a simple example
> > demonstrating possible uses for vfs_write_barrier().
> >
>
> I was trying to figure out if updating timestamp after write would be enough
> to deal with file writes and I think that it is not enough when adding
> signalling
> (events) into the picture.
> In this case, the consumer is expected to act on changes (e.g. index/backup)
> soon after they happen.
> I think this case is different from NFS cache which only cares about cache
> invalidation on file access(?).
>
> In any case, we need a FAN_PRE_MODIFY blocking event to store a
> persistent change intent record before the write - that is needed to find
> changes after a crash.
>
> Now unless we want to start polling ctime (and we do not want that),
> we need a signal to wake the consumer after the write to page cache
>
> One way is to rely on the FAN_MODIFY async event post write.
> But there is ambiguity in the existing FAN_MODIFY events:
>
>     Thread A starts write on file F (no listener for FAN_PRE_MODIFY)
> Event consumer starts
>         Thread B starts write on file F
>         FAN_PRE_MODIFY(F) reported from thread B
>     Thread A completes write on file F
>     FAN_MODIFY(F) reported from thread A (or from aio completion thread)
> Event consumer believes it got the last event and can read the final
> version of F
>
> So if we use this method we will need a unique cookie to
> associate the POST_MODIFY with the PRE_MODIFY event.
>
> Something like this:
>
> writer                                [fsnotifyd]
> -------                                -------------
> file_start_write_usn() => FAN_PRE_MODIFY[ fsid, usn, fhandle ]
> {                                 <= Record change intent before response
> …do some in-core changes
>    (e.g. data + mode + ctime)...
> } file_end_write_usn() => FAN_POST_MODIFY[ fsid, usn, fhandle ]
>                                          Consume changes after FAN_POST_MODIFY
>
> While this is a viable option, it adds yet more hooks and more
> events and it does not provide an easy way for consumers to
> wait for the completion of a batch of modifications.
>
> The vfs_write_barrier method provides a better way to wait for completion:
>
> writer                                [fsnotifyd]
> -------                                -------------
> file_start_write_srcu() => FAN_PRE_MODIFY[ fsid, usn, fhandle ]
> {                                  <= Record change intent before response
> …do some in-core changes under srcu read lock
>    (e.g. data + mode + ctime)...
> } file_end_write_srcu()
>      synchronize_srcu()   <= vfs_write_barrier();
>                     Consume a batch of recorded changes after write barrier
>                     act on the changes and clear the change intent records
>
> I am hoping to be able to argue for the case of vfs_write_barrier()
> in LSFMM, but if this will not be acceptable, I can work with the
> post modify events solution.
>

FYI, I had discussed it with some folks at LSFMM after my talk
and what was apparent to me from this chat and also from the questions
during my presentation, is that I did not succeed in explaining the problem.

I believe that the path forward for me, which is something that Jan
has told me from the beginning, is to implement a reference design
of persistent change journal, because this is too complex of an API
to discuss without the user code that uses it.

I am still on the fence about whether I want to do a userspace fsnotifyfd
or a kernel persistent change journal library/subsystem as a reference
design. I do already have a kernel subsystem (ovl watch) so I may end
up cleaning that one up to use a proper fanotify API and maybe that would
be the way to do it.

One more thing that I realised during LSFMM, is that some filesystems
(e.g. NTFS, Lustre) already have an internal persistent change journal.
If I implement a kernel persistent change journal subsystem, then
we could use the same fanotify API to read events from fs that implements
its own persistent change journal and from a fs that allows to use the
fs agnostic persistent change journal.

Thanks,
Amir.