Re: Reseting pending fanotify events

Amir Goldstein <amir73il@xxxxxxxxx> · Wed, 2 Apr 2025 06:44:45 +0200

On Tue, Apr 1, 2025 at 11:16 PM Ibrahim Jirdeh <ibrahimjirdeh@xxxxxxxx> wrote:
>
> Hopefully the formatting works well now. Also including some replies to
> questions from earlier in the thread in case they were lost.
>

Ibrahim,

I think this is an important aspect of productizing HSM, so thank
you for bringing it to our attention.

> > But what confuses me is the following: You have fanotify instance to which
> > you've got fd from fanotify_init(). For any process to be hanging, this fd
> > must be still held open by some process. Otherwise the fanotify instance
> > gets destroyed and all processes are free to run (they get FAN_ALLOW reply
> > if they were already waiting). So the fact that you see processes hanging
> > when your fanotify listener crashes means that you have likely leaked the
> > fd to some other process (lsof should be able to tell you which process has
> > still handle to fanotify instance). And the kernel has no way to know this
> > is not the process that will eventually read these events and reply...
>
> I can clarify this further. In our case its important to not destroy the fanotify
> instance during daemon shutdown as giving FAN_ALLOW to waiting processes could
> enable accessing a file which has not actually been populated. To this end, we
> persist the fd from fanotify_init accross daemon restarts. In particular since
> the daemon is a systemd unit, we rely on the systemd fd store (https://systemd.io/FILE_DESCRIPTOR_STORE/)
> for this, which essentially will maintain a dup of the fanotify fd. This is why

I suspected that this might be the case.

I do not blame you for using the fd store, but I think this was a band aid
in absence of a better solution.

With a better solution, there will be no need to keep the fd alive and no
need to recover pending events.

See some suggestions below.

> we can run into the case of hanging events during planned restart or unintented
> crash. Heres a sample trace of D-state process I had linked in earlier reply:
>
> [<0>] fanotify_handle_event+0x8ac/0x10f0
> [<0>] fsnotify+0x5fb/0x8d0
> [<0>] __fsnotify_parent+0x17f/0x260
> [<0>] security_file_open+0x8f/0x130
> [<0>] vfs_open+0x109/0x4c0
> [<0>] path_openat+0x9a4/0x27d0
> [<0>] do_filp_open+0x91/0x120
> [<0>] bprm_execve+0x15c/0x690
> [<0>] do_execveat_common+0x22c/0x330
> [<0>] __x64_sys_execve+0x36/0x40
> [<0>] do_syscall_64+0x3d/0x90
> [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
>
> Confirmed it was killable per Jan's clarification.
>
> > > > In this case, any events that have been read but not yet responded to would be lost.
> > > > Initially we considered handling this internally by saving the file descriptors for pending events,
> > > > however this proved to be complex to do in a robust manner.
> > > >
> > > > A more robust solution is to add a kernel fanotify api which resets the fanotify pending event queue,
> > > > thereby allowing us to recover pending events in the case of daemon restart.
> > > > A strawman implementation of this approach is in
> > > > https://github.com/torvalds/linux/compare/master...ibrahim-jirdeh:linux:fanotify-reset-pending,
> > > > a new ioctl that resets `group->fanotify_data.access_list`.
> > > > One other alternative we considered is directly exposing the pending event queue itself
> > > > (https://github.com/torvalds/linux/commit/cd90ff006fa2732d28ff6bb5975ca5351ce009f1)
> > > > to support monitoring pending events, but simply resetting the queue is likely sufficient for our use-case.
> > > >
> > > > What do you think of exposing this functionality in fanotify?
> > > >
> > >
> > > Ignoring the pending events for start, how do you deal with access to
> > > non-populated files while the daemon is down?
> > >
> > > We were throwing some idea about having a mount option (something
> > > like a "moderate" mount) to determine the default response for specific
> > > permission events (e.g. FAN_OPEN_PERM) in the case that there is
> > > no listener watching this event.
> > >
> > > If you have a filesystem which may contain non-populated files, you
> > > mount it with as "moderated" mount and then access to all files is
> > > denied until the daemon is running and also denied if daemon is down.
> > >
> > > For restart, it might make sense to start a new daemon to start listening
> > > to events before stopping the old daemon.
> > > If the new daemon gets the events before the old daemon, things should
> > > be able to transition smoothly.
> >
> > I agree this would be a sensible protocol for updates. For unplanned crashes
> > I agree we need something like the "moderated" mount option.
>
> We can definitely try out the suggested approach of starting up new daemon
> instance  alongside old one to prevent downtime during planned restart.

Let me list a few approaches to this problem that were floated in the past.
You may choose bits and parts that you find useful to your use case.

1. Persistent marks
Some discussion here:
https://lore.kernel.org/linux-fsdevel/CAOQ4uxjY3eDtqXObbso1KtZTMB7+zYHBRiUANg12hO=T=vqJrw@xxxxxxxxxxxxxx/
a persistent mark in xattr to deny a certain operation -
quite hard to get this API right and probably an overkill

2. Fanotify filter
https://lore.kernel.org/linux-fsdevel/CAPhsuW4psFtCVqHe2wK4RO2boCbcyPtfsGzHzzNU_1D0gsVoaA@xxxxxxxxxxxxxx/
While it was proposed as a method for optimization, you can also use it
as a guard to prevent access to un populated content -
If you can implement a simple kmod/bpf program that checks if file
has content or not, then you can setup a "guard group" first that
only checks if a file has content or not and allows or denies access
in the kernel.
You then hand its fd to the fd store to keep the group alive.
After that you start the HSM group that would be called before the guard group
to populate file content when needed.
The HSM group can use the same kmod/bpf filter to decide is HSM usersapce
needs to be called for better performance.
A bit clumsy, but should work.

3. Change the default response to pending events on group fd close
Support writing a response with
.fd = FAN_NOFD
.response = FAN_DENY | FAN_DEFAULT
to set a group parameter fanotify_data.default_response.

Instead of setting pending events response to FAN_ALLOW,
could set it to FAN_DENY, or to descriptive error like
FAN_DENY(ECONNRESET).

You could also set it to a new defined response FAN_RETRY that
would result in ERESTARTSYS returned for the caller and then
syscall will hopefully be handled by the new HSM server instance.

The default response could be used in conjunction with fd store
instead of requeueing/resetting the pending events.
All the new service instance needs to do is to get the fd from
the fd store and close it after having opened a new group fd.

That should be quite simple to implement compared to
persistent marks/moderated mount/fanotify filters and should suffice
if it is acceptable that upon crash of HSM, processes will be blocked
(killable) until the new HSM service instance comes up and then
processes will be able to continue.
This is kind of similar to the way that processes accessing an NFS
mount will behave during network outage.

WDYT?

Amir.