Re: [RFC PATCH v2 0/3] fanotify HSM events for directories

Amir Goldstein <amir73il@xxxxxxxxx> · Wed, 9 Jul 2025 11:00:40 +0200

On Tue, Jul 8, 2025 at 5:32 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
>
> On Fri, Jul 4, 2025 at 12:58 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> >
> > On Fri, Jul 4, 2025 at 11:24 AM Jan Kara <jack@xxxxxxx> wrote:
> > >
> > > On Thu 03-07-25 21:14:11, Amir Goldstein wrote:
> > > > On Tue, Jun 17, 2025 at 11:43 AM Jan Kara <jack@xxxxxxx> wrote:
> > > > > On Mon 16-06-25 19:00:42, Amir Goldstein wrote:
> > > > > > On Mon, Jun 16, 2025 at 11:07 AM Jan Kara <jack@xxxxxxx> wrote:
> > > > > > > On Tue 10-06-25 17:25:48, Amir Goldstein wrote:
> > > > > > > > On Tue, Jun 10, 2025 at 3:49 PM Jan Kara <jack@xxxxxxx> wrote:
> > > > > > > > > On Wed 04-06-25 18:09:15, Amir Goldstein wrote:
> > > > > > > > > > If we decide that we want to support FAN_PATH_ACCESS from all the
> > > > > > > > > > path-less lookup_one*() helpers, then we need to support reporting
> > > > > > > > > > FAN_PATH_ACCESS event with directory fid.
> > > > > > > > > >
> > > > > > > > > > If we allow FAN_PATH_ACCESS event from path-less vfs helpers, we still
> > > > > > > > > > have to allow setting FAN_PATH_ACCESS in a mount mark/ignore mask, because
> > > > > > > > > > we need to provide a way for HSM to opt-out of FAN_PATH_ACCESS events
> > > > > > > > > > on its "work" mount - the path via which directories are populated.
> > > > > > > > > >
> > > > > > > > > > There may be a middle ground:
> > > > > > > > > > - Pass optional path arg to __lookup_slow() (i.e. from walk_component())
> > > > > > > > > > - Move fsnotify hook into __lookup_slow()
> > > > > > > > > > - fsnotify_lookup_perm() passes optional path data to fsnotify()
> > > > > > > > > > - fanotify_handle_event() returns -EPERM for FAN_PATH_ACCESS without
> > > > > > > > > >   path data
> > > > > > > > > >
> > > > > > > > > > This way, if HSM is enabled on an sb and not ignored on specific dir
> > > > > > > > > > after it was populated, path lookup from syscall will trigger
> > > > > > > > > > FAN_PATH_ACCESS events and overalyfs/nfsd will fail to lookup inside
> > > > > > > > > > non-populated directories.
> > > > > > > > >
> > > > > > > > > OK, but how will this manifest from the user POV? If we have say nfs
> > > > > > > > > exported filesystem that is HSM managed then there would have to be some
> > > > > > > > > knowledge in nfsd to know how to access needed files so that HSM can pull
> > > > > > > > > them? I guess I'm missing the advantage of this middle-ground solution...
> > > > > > > >
> > > > > > > > The advantage is that an admin is able to set up a "lazy populated fs"
> > > > > > > > with the guarantee that:
> > > > > > > > 1. Non-populated objects can never be accessed
> > > > > > > > 2. If the remote fetch service is up and the objects are accessed
> > > > > > > >     from a supported path (i.e. not overlayfs layer) then the objects
> > > > > > > >     will be populated on access
> > > > > > > >
> > > > > > > > This is stronger and more useful than silently serving invalid content IMO.
> > > > > > > >
> > > > > > > > This is related to the discussion about persistent marks and how to protect
> > > > > > > > against access to non-populated objects while service is down, but since
> > > > > > > > we have at least one case that can result in an EIO error (service down)
> > > > > > > > then another case (access from overlayfs) maybe is not a game changer(?)
> > > > > > >
> > > > > > > Yes, reporting error for unpopulated content would be acceptable behavior.
> > > > > > > I just don't see this would be all that useful.
> > > > > > >
> > > > > >
> > > > > > Regarding overlayfs, I think there is an even bigger problem.
> > > > > > There is the promise that we are not calling the blocking pre-content hook
> > > > > > with freeze protection held.
> > > > > > In overlayfs it is very common to take the upper layer freeze protection
> > > > > > for a relatively large scope (e.g. ovl_want_write() in ovl_create_object())
> > > > > > and perform lookups on upper fs or lower fs within this scope.
> > > > > > I am afraid that cleaning that up is not going to be realistic.
> > > > > >
> > > > > > IMO, it is perfectly reasonable that overlayfs and HSM (at least pre-dir-access)
> > > > > > will be mutually exclusive features.
> > > > > >
> > > > > > This is quite similar to overlayfs resulting in EIO if lower fs has an
> > > > > > auto mount point.
> > > > > >
> > > > > > Is it quite common for users to want overlayfs mounted over
> > > > > > /var/lib/docker/overlay2
> > > > > > on the root fs.
> > > > > > HSM is not likely to be running on / and /etc, but likely on a very
> > > > > > distinct lazy populated source dir or something.
> > > > > > We can easily document and deny mounting overlayfs over subtrees where
> > > > > > HSM is enabled (or just pre-path events).
> > > > > >
> > > > > > This way we can provide HSM lazy dir populate to the users that do not care
> > > > > > about overlayfs without having to solve very hard to unsolvable issues.
> > > > > >
> > > > > > I will need to audit all the other users of vfs lookup helpers other than
> > > > > > overlayfs and nfsd, to estimate how many of them are pre-content event
> > > > > > safe and how many are a hopeless case.
> > > > > >
> > > > > > On the top of my head, trying to make a cachefilesd directory an HSM
> > > > > > directory is absolutely insane, so not every user of vfs lookup helpers
> > > > > > should be able to populate HSM content - should should simply fail
> > > > > > (with a meaningful kmsg log).
> > > > >
> > > > > Right. What you write makes a lot of sense. You've convinced me that
> > > > > returning error from overlayfs (or similar users) when they try to access
> > > > > HSM managed dir is the least painful solution :).
> > > > >
> > > >
> > > > Oh oh, now I need to try to convince you of a solution that is less painful
> > > > than the least painful solution ;)
> > >
> > > :)
> > >
> > > > I have been experimenting with some code and also did a first pass audit
> > > > of the vfs lookup callers.
> > > >
> > > > First of all, Neil's work to categorize the callers into lookup_noperm*
> > > > and lookup_one* really helped this audit. (thanks Neil!)
> > > >
> > > > The lookup_noperm* callers are not "vfs users" they are internal fs
> > > > callers that should not call fsnotify pre-content hooks IMO.
> > > >
> > > > The lookup_one* callers are vfs callers like ovl,cachefiles, as well
> > > > as nfsd,ksmbd.
> > > >
> > > > Some of the lookup_one() calls are made from locked context, so not
> > > > good for pre-content events, but most of them are not relevant anyway
> > > > because they are not first access to dir (e.g. readdirplus which already
> > > > started to iterate dir).
> > > >
> > > > Adding lookup pre-content hooks to nfsd and ksmbd before the relevant
> > > > lookup_one* callers and before fs locks are taken looks doable.
> > > >
> > > > But the more important observation I had is that allowing access to
> > > > dirs with unpopulated content is not that big of a deal.
> > > >
> > > > Allowing access to files that are sparse files before their content is filled
> > > > could have led applications to fault and users to suffer.
> > > >
> > > > Allowing access to unpopulated dirs, e.g. from overlayfs or even from
> > > > nfsd, just results in getting ENOENT or viewing an empty directory.
> > >
> > > Right. Although if some important files would be missing, you'd still cause
> > > troubles to applications and possible crashes (or app shutdowns). But I
> > > take the ENOENT return in this case as a particular implementation of the
> > > "just return error to userspace if we have no chance to handle the lookup
> > > in this context" policy.
> > >
> > > > My conclusion is, that if we place the fsnotify lookup hook in
> > > > lookup_slow() then the only thing we need to do is:
> > > > When doing lookup_one*() from possibly unsafe context,
> > > > in a fs that has pre-dir-content watchers,
> > > > we always allow the lookup,
> > > > but we never let it leave a negative dcache entry.
> > > >
> > > > If the lookup finds a child entry, then dir is anyway populated.
> > > > If dir is not populated, the -ENOENT result will not be cached,
> > > > so future lookups of the same name from safe context will call the hook again,
> > > > populate the file or entire directory and create positive/negative dentry,
> > > > and then following lookups of the same name will not call the hook.
> > >
> > > Yes, this looks pretty much like what we've agreed on before, just now the
> > > implementation is getting more concrete shape. Or am I missing something?
> > >
> >
> > What (I think) we discussed before was to fail *any* lookup from
> > internal vfs callers to HSM moderated fs, so ovl would also not be able to
> > access a populated directory in that case.
> >
> > What I am suggesting is to always allow the lookup in HSM fs
> > and depending on a negative lookup result do "something".
> >
> > There is a nuance here.
> > Obviously, userspace will get ENOENT in this case, but
> > does lookup_one() succeed and returns a negative unhashed
> > dentry (e.g. to ovl) or does it drop the dentry and fail with -ENOENT?
> >
> > I was thinking of the former, but I think you are implying the latter,
> > which is indeed a bit closer to what we agreed on.
> >
> > For callers that use lookup_one_positive_unlocked()
> > like ovl_lookup(), it makes no difference, but for callers that
> > create new entries like ovl_create_upper() it means failure to create
> > and that is desirable IMO.
> >
> > I guess, if ovl_mkdir() fails with -ENOENT users would be a bit confused
> > but in a way, this parent directory does not fully exist yet, so
> > it may be good enough.
> >
> > We could also annotate those calls as lookup_one_for_create()
> > to return -EROFS in the negative lookup result in HSM moderated dir,
> > but not sure that this is needed or if it is less confusing to users.
> >
> > What's even nicer is that for overlayfs it is the more likely case that
> > only the lower layer fs is HSM moderated, e.g. for a composefs
> > "image repository".
> >
> > Adding safe pre-dir-content hooks for overlayfs lower layer lookup
> > may be possible down the road. A lot easier that supporting
> > lazy dir populates in a rw layer.
> >
> >
>
> FYI, here is a WIP branch for the scheme that we discussed here:
>
> https://github.com/amir73il/linux/commits/fan_pre_dir_access/
>

Some more commentary of design choices in this WIP patch set.

We have discussed in the context of page fault events the concept of
a "handle once" HSM event per file range, where events are not
generated if page cache pages are already populated, even if said
page cache pages were populated before the HSM marks were set up.

The pre-content events on page fault did not happen eventually,
but as far as API documentation is concerned, we are still allowed
to suppress multiple pre-content events on the same file range.

A similar concept was applied to pre-dir-content events w.r.t dcache,
but with a few nuances.

The pre-dir-content event from lookup on NAME in DIR is only generated
in lookup_slow() when the dcache entry of NAME is not populated.
If a positive/negative dentry was created before setup of HSM mark,
the event will not be generated.

***This is the new part that we did not discuss that I implemented:***
When a pre-dir-content event WITHOUT a NAME (i.e. from readdir)
is handled by HSM (i.e. FAN_ALLOW response), the dentry is marked
DCACHE_HSM_ONCE and no further pre-dir-content events are generated
on that directory.
The DCACHE_HSM_ONCE flag is not set when there are no HSM marks
or when a directory has an HSM ignore mark!
***

IMO this behavior is inline with the former "handle once" strategy.

> There is an LTP branch of the same name that passes tests.
>
> I also added two simple patches for nfsd support for pre-dir-content events
> but they are optional, to demonstrate that internal users could be
> supported later.
>

The "handle once" design helps with the implementation of nfsd
pre-dir-content hook.
For simplification, those hooks were implemented in the permission check inside
fh_verify(fhp, NFSD_MAY_EXEC) which is called before every nfsd lookup_one()
operation. fhp is the context for the entire "transaction", and this
hook position
is usually [*] safe w.r.t held fs locks.

The DCACHE_HSM_ONCE design, guarantees that there will be a single
"safe context" pre-dir-content event in fh_verify(fhp, NFSD_MAY_EXEC)
and the latter hooks in lookup_one() with freeze protection held will be
suppressed.

However, the nfsd "safe" pre-dir-content hooks are not a must for merging
the pre-dir-content event feature.

If the nfsd "safe" hooks are not merged, nfsd (as will overlayfs) will still be
able to safely do lookup_one() on directories where HSM marks exist
with some caveats:
- A directory that was already populated (DCACHE_HSM_ONCE)
  will be fully accessible (read/write) for nfsd
- An existing child name (positive dentry) can be looked up by nfsd
- A lookup of non-existing name lookup will return -ENOENT
  as it should (without leaving a negative dentry behind)
- An attempt to create a positive dentry (create/rename) will fail
  with -ENOENT (as if directory is IS_DEADDIR)

[*] in nfsd_create_locked() the fh_verify(fhp, NFSD_MAY_EXEC) is not
in safe context, so there is also a preemptive hook also in fh_want_write(fhp)
in nfsd_create().
CC nfsd guys to explain to me why nfsd_create() has a NFSD_MAY_NOP
access permission and not NFSD_MAY_EXEC.

Thanks,
Amir.