On Mon 31-03-25 21:08:51, Amir Goldstein wrote: > [CC Jan and Josef] CCed fsdevel. Actually replying here because the quoting in Ibrahim's email got somehow broken which made it very hard to understand. > I am keeping this discussion private because you did not post it to > the public list, > but if you can CC fsdevel in your reply that would be great, because it seems > like a question with interest to a wider audience. > > On Mon, Mar 31, 2025 at 8:19 PM Ibrahim Jirdeh <ibrahimjirdeh@xxxxxxxx> wrote: > > > > Hi Amir, > > > > We have been using fanotify to support lazily loading file contents. > > We are struggling with the problem that pending permission events cannot be recovered on daemon restart. > > > > We have a long-lived daemon that marks files with FAN_OPEN_PERM and populates their contents on access. > > It needs a reliable path for updates & crash recovery. > > The happy path for fanotify event processing is as follows: > > > > A notification is read from fanotify file descriptor > > File contents are populated > > We write back FAN_ALLOW to fanotify file descriptor, or DENY if content population failed. > > > > We would like to guarantee that all file accesses receive an ALLOW or DENY response, and no events are lost. > > Makes sense. > > > Unfortunately, today a filesystem client can hang (in D state) > > if the event-handler daemon crashes or restarts at the wrong time. > > Can you provide exact stack traces for those cases? > > I wonder how process gets to D state with commit fabf7f29b3e2 > ("fanotify: Use interruptible wait when waiting for permission events") So D state is expected when waiting for response. We are using TASK_UNINTERRUPTIBLE sleep (the above commit had to be effectively reverted). But we are also setting TASK_KILLABLE and TASK_FREEZABLE so that we don't block hibernation and tasks can be killed when fanotify listener misbehaves. But what confuses me is the following: You have fanotify instance to which you've got fd from fanotify_init(). For any process to be hanging, this fd must be still held open by some process. Otherwise the fanotify instance gets destroyed and all processes are free to run (they get FAN_ALLOW reply if they were already waiting). So the fact that you see processes hanging when your fanotify listener crashes means that you have likely leaked the fd to some other process (lsof should be able to tell you which process has still handle to fanotify instance). And the kernel has no way to know this is not the process that will eventually read these events and reply... > > In this case, any events that have been read but not yet responded to would be lost. > > Initially we considered handling this internally by saving the file descriptors for pending events, > > however this proved to be complex to do in a robust manner. > > > > A more robust solution is to add a kernel fanotify api which resets the fanotify pending event queue, > > thereby allowing us to recover pending events in the case of daemon restart. > > A strawman implementation of this approach is in > > https://github.com/torvalds/linux/compare/master...ibrahim-jirdeh:linux:fanotify-reset-pending, > > a new ioctl that resets `group->fanotify_data.access_list`. > > One other alternative we considered is directly exposing the pending event queue itself > > (https://github.com/torvalds/linux/commit/cd90ff006fa2732d28ff6bb5975ca5351ce009f1) > > to support monitoring pending events, but simply resetting the queue is likely sufficient for our use-case. > > > > What do you think of exposing this functionality in fanotify? > > > > Ignoring the pending events for start, how do you deal with access to > non-populated files while the daemon is down? > > We were throwing some idea about having a mount option (something > like a "moderate" mount) to determine the default response for specific > permission events (e.g. FAN_OPEN_PERM) in the case that there is > no listener watching this event. > > If you have a filesystem which may contain non-populated files, you > mount it with as "moderated" mount and then access to all files is > denied until the daemon is running and also denied if daemon is down. > > For restart, it might make sense to start a new daemon to start listening > to events before stopping the old daemon. > If the new daemon gets the events before the old daemon, things should > be able to transition smoothly. I agree this would be a sensible protocol for updates. For unplanned crashes I agree we need something like the "moderated" mount option. > Of course, if an old daemon can shutdown and leave processes in > uninterruptible sleep, that is a critical bug that needs to be fixed > regardless of the handover problem. If the fanotify fd got closed and the instance shutdown, this would be indeed a serious bug (likely UAF issue). But so far I rather suspect the fd is just in some fd table somewhere... Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR