Re: Reseting pending fanotify events

Ibrahim Jirdeh <ibrahimjirdeh@xxxxxxxx> · Tue, 1 Apr 2025 14:16:09 -0700

Hopefully the formatting works well now. Also including some replies to
questions from earlier in the thread in case they were lost.

> But what confuses me is the following: You have fanotify instance to which
> you've got fd from fanotify_init(). For any process to be hanging, this fd
> must be still held open by some process. Otherwise the fanotify instance
> gets destroyed and all processes are free to run (they get FAN_ALLOW reply
> if they were already waiting). So the fact that you see processes hanging
> when your fanotify listener crashes means that you have likely leaked the
> fd to some other process (lsof should be able to tell you which process has
> still handle to fanotify instance). And the kernel has no way to know this
> is not the process that will eventually read these events and reply...

I can clarify this further. In our case its important to not destroy the fanotify
instance during daemon shutdown as giving FAN_ALLOW to waiting processes could
enable accessing a file which has not actually been populated. To this end, we
persist the fd from fanotify_init accross daemon restarts. In particular since
the daemon is a systemd unit, we rely on the systemd fd store (https://systemd.io/FILE_DESCRIPTOR_STORE/)
for this, which essentially will maintain a dup of the fanotify fd. This is why
we can run into the case of hanging events during planned restart or unintented
crash. Heres a sample trace of D-state process I had linked in earlier reply:

[<0>] fanotify_handle_event+0x8ac/0x10f0
[<0>] fsnotify+0x5fb/0x8d0
[<0>] __fsnotify_parent+0x17f/0x260
[<0>] security_file_open+0x8f/0x130
[<0>] vfs_open+0x109/0x4c0
[<0>] path_openat+0x9a4/0x27d0
[<0>] do_filp_open+0x91/0x120
[<0>] bprm_execve+0x15c/0x690
[<0>] do_execveat_common+0x22c/0x330
[<0>] __x64_sys_execve+0x36/0x40
[<0>] do_syscall_64+0x3d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

Confirmed it was killable per Jan's clarification.

> > > In this case, any events that have been read but not yet responded to would be lost.
> > > Initially we considered handling this internally by saving the file descriptors for pending events,
> > > however this proved to be complex to do in a robust manner.
> > >
> > > A more robust solution is to add a kernel fanotify api which resets the fanotify pending event queue,
> > > thereby allowing us to recover pending events in the case of daemon restart.
> > > A strawman implementation of this approach is in
> > > https://github.com/torvalds/linux/compare/master...ibrahim-jirdeh:linux:fanotify-reset-pending,
> > > a new ioctl that resets `group->fanotify_data.access_list`.
> > > One other alternative we considered is directly exposing the pending event queue itself
> > > (https://github.com/torvalds/linux/commit/cd90ff006fa2732d28ff6bb5975ca5351ce009f1)
> > > to support monitoring pending events, but simply resetting the queue is likely sufficient for our use-case.
> > >
> > > What do you think of exposing this functionality in fanotify?
> > >
> >
> > Ignoring the pending events for start, how do you deal with access to
> > non-populated files while the daemon is down?
> >
> > We were throwing some idea about having a mount option (something
> > like a "moderate" mount) to determine the default response for specific
> > permission events (e.g. FAN_OPEN_PERM) in the case that there is
> > no listener watching this event.
> >
> > If you have a filesystem which may contain non-populated files, you
> > mount it with as "moderated" mount and then access to all files is
> > denied until the daemon is running and also denied if daemon is down.
> >
> > For restart, it might make sense to start a new daemon to start listening
> > to events before stopping the old daemon.
> > If the new daemon gets the events before the old daemon, things should
> > be able to transition smoothly.
>
> I agree this would be a sensible protocol for updates. For unplanned crashes
> I agree we need something like the "moderated" mount option.

We can definitely try out the suggested approach of starting up new daemon
instance  alongside old one to prevent downtime during planned restart.