Re: [RFC PATCH v2 0/3] fanotify HSM events for directories

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 3 Jul 2025 21:14:11 +0200

On Tue, Jun 17, 2025 at 11:43 AM Jan Kara <jack@xxxxxxx> wrote:
>
> On Mon 16-06-25 19:00:42, Amir Goldstein wrote:
> > On Mon, Jun 16, 2025 at 11:07 AM Jan Kara <jack@xxxxxxx> wrote:
> > > On Tue 10-06-25 17:25:48, Amir Goldstein wrote:
> > > > On Tue, Jun 10, 2025 at 3:49 PM Jan Kara <jack@xxxxxxx> wrote:
> > > > > On Wed 04-06-25 18:09:15, Amir Goldstein wrote:
> > > > > > If we decide that we want to support FAN_PATH_ACCESS from all the
> > > > > > path-less lookup_one*() helpers, then we need to support reporting
> > > > > > FAN_PATH_ACCESS event with directory fid.
> > > > > >
> > > > > > If we allow FAN_PATH_ACCESS event from path-less vfs helpers, we still
> > > > > > have to allow setting FAN_PATH_ACCESS in a mount mark/ignore mask, because
> > > > > > we need to provide a way for HSM to opt-out of FAN_PATH_ACCESS events
> > > > > > on its "work" mount - the path via which directories are populated.
> > > > > >
> > > > > > There may be a middle ground:
> > > > > > - Pass optional path arg to __lookup_slow() (i.e. from walk_component())
> > > > > > - Move fsnotify hook into __lookup_slow()
> > > > > > - fsnotify_lookup_perm() passes optional path data to fsnotify()
> > > > > > - fanotify_handle_event() returns -EPERM for FAN_PATH_ACCESS without
> > > > > >   path data
> > > > > >
> > > > > > This way, if HSM is enabled on an sb and not ignored on specific dir
> > > > > > after it was populated, path lookup from syscall will trigger
> > > > > > FAN_PATH_ACCESS events and overalyfs/nfsd will fail to lookup inside
> > > > > > non-populated directories.
> > > > >
> > > > > OK, but how will this manifest from the user POV? If we have say nfs
> > > > > exported filesystem that is HSM managed then there would have to be some
> > > > > knowledge in nfsd to know how to access needed files so that HSM can pull
> > > > > them? I guess I'm missing the advantage of this middle-ground solution...
> > > >
> > > > The advantage is that an admin is able to set up a "lazy populated fs"
> > > > with the guarantee that:
> > > > 1. Non-populated objects can never be accessed
> > > > 2. If the remote fetch service is up and the objects are accessed
> > > >     from a supported path (i.e. not overlayfs layer) then the objects
> > > >     will be populated on access
> > > >
> > > > This is stronger and more useful than silently serving invalid content IMO.
> > > >
> > > > This is related to the discussion about persistent marks and how to protect
> > > > against access to non-populated objects while service is down, but since
> > > > we have at least one case that can result in an EIO error (service down)
> > > > then another case (access from overlayfs) maybe is not a game changer(?)
> > >
> > > Yes, reporting error for unpopulated content would be acceptable behavior.
> > > I just don't see this would be all that useful.
> > >
> >
> > Regarding overlayfs, I think there is an even bigger problem.
> > There is the promise that we are not calling the blocking pre-content hook
> > with freeze protection held.
> > In overlayfs it is very common to take the upper layer freeze protection
> > for a relatively large scope (e.g. ovl_want_write() in ovl_create_object())
> > and perform lookups on upper fs or lower fs within this scope.
> > I am afraid that cleaning that up is not going to be realistic.
> >
> > IMO, it is perfectly reasonable that overlayfs and HSM (at least pre-dir-access)
> > will be mutually exclusive features.
> >
> > This is quite similar to overlayfs resulting in EIO if lower fs has an
> > auto mount point.
> >
> > Is it quite common for users to want overlayfs mounted over
> > /var/lib/docker/overlay2
> > on the root fs.
> > HSM is not likely to be running on / and /etc, but likely on a very
> > distinct lazy populated source dir or something.
> > We can easily document and deny mounting overlayfs over subtrees where
> > HSM is enabled (or just pre-path events).
> >
> > This way we can provide HSM lazy dir populate to the users that do not care
> > about overlayfs without having to solve very hard to unsolvable issues.
> >
> > I will need to audit all the other users of vfs lookup helpers other than
> > overlayfs and nfsd, to estimate how many of them are pre-content event
> > safe and how many are a hopeless case.
> >
> > On the top of my head, trying to make a cachefilesd directory an HSM
> > directory is absolutely insane, so not every user of vfs lookup helpers
> > should be able to populate HSM content - should should simply fail
> > (with a meaningful kmsg log).
>
> Right. What you write makes a lot of sense. You've convinced me that
> returning error from overlayfs (or similar users) when they try to access
> HSM managed dir is the least painful solution :).
>

Oh oh, now I need to try to convince you of a solution that is less painful
than the least painful solution ;)

I have been experimenting with some code and also did a first pass audit
of the vfs lookup callers.

First of all, Neil's work to categorize the callers into lookup_noperm*
and lookup_one* really helped this audit. (thanks Neil!)

The lookup_noperm* callers are not "vfs users" they are internal fs
callers that should not call fsnotify pre-content hooks IMO.

The lookup_one* callers are vfs callers like ovl,cachefiles, as well
as nfsd,ksmbd.

Some of the lookup_one() calls are made from locked context, so not
good for pre-content events, but most of them are not relevant anyway
because they are not first access to dir (e.g. readdirplus which already
started to iterate dir).

Adding lookup pre-content hooks to nfsd and ksmbd before the relevant
lookup_one* callers and before fs locks are taken looks doable.

But the more important observation I had is that allowing access to
dirs with unpopulated content is not that big of a deal.

Allowing access to files that are sparse files before their content is filled
could have led applications to fault and users to suffer.

Allowing access to unpopulated dirs, e.g. from overlayfs or even from
nfsd, just results in getting ENOENT or viewing an empty directory.

My conclusion is, that if we place the fsnotify lookup hook in
lookup_slow() then the only thing we need to do is:
When doing lookup_one*() from possibly unsafe context,
in a fs that has pre-dir-content watchers,
we always allow the lookup,
but we never let it leave a negative dcache entry.

If the lookup finds a child entry, then dir is anyway populated.
If dir is not populated, the -ENOENT result will not be cached,
so future lookups of the same name from safe context will call the hook again,
populate the file or entire directory and create positive/negative dentry,
and then following lookups of the same name will not call the hook.

The only flaw in this plan is that the users that do not populate
directories can create entries in those directories, which can violate
O_CREAT | O_EXCL and mkdir(), w.r.t to a remote file.

But this flaw can be reported properly by the HSM daemon when
trying to populate a directory which is marked as unpopulated and
finding files inside it.

HSM could auto-resolve those conflicts or prompt admin for action
and can return ESTALE error (or a like) to users.

Was I clear? Does that sound reasonable to you?

Thanks,
Amir.