On Tue, Apr 01, 2025 at 06:28:11PM +0200, Jan Kara wrote: > On Mon 31-03-25 21:08:51, Amir Goldstein wrote: > > [CC Jan and Josef] > > CCed fsdevel. Actually replying here because the quoting in Ibrahim's email > got somehow broken which made it very hard to understand. > > > I am keeping this discussion private because you did not post it to > > the public list, > > but if you can CC fsdevel in your reply that would be great, because it seems > > like a question with interest to a wider audience. > > > > On Mon, Mar 31, 2025 at 8:19 PM Ibrahim Jirdeh <ibrahimjirdeh@xxxxxxxx> wrote: > > > > > > Hi Amir, > > > > > > We have been using fanotify to support lazily loading file contents. > > > We are struggling with the problem that pending permission events cannot be recovered on daemon restart. > > > > > > We have a long-lived daemon that marks files with FAN_OPEN_PERM and populates their contents on access. > > > It needs a reliable path for updates & crash recovery. > > > The happy path for fanotify event processing is as follows: > > > > > > A notification is read from fanotify file descriptor > > > File contents are populated > > > We write back FAN_ALLOW to fanotify file descriptor, or DENY if content population failed. > > > > > > We would like to guarantee that all file accesses receive an ALLOW or DENY response, and no events are lost. > > > > Makes sense. > > > > > Unfortunately, today a filesystem client can hang (in D state) > > > if the event-handler daemon crashes or restarts at the wrong time. > > > > Can you provide exact stack traces for those cases? > > > > I wonder how process gets to D state with commit fabf7f29b3e2 > > ("fanotify: Use interruptible wait when waiting for permission events") > > So D state is expected when waiting for response. We are using > TASK_UNINTERRUPTIBLE sleep (the above commit had to be effectively > reverted). But we are also setting TASK_KILLABLE and TASK_FREEZABLE so that > we don't block hibernation and tasks can be killed when fanotify listener > misbehaves. > > But what confuses me is the following: You have fanotify instance to which > you've got fd from fanotify_init(). For any process to be hanging, this fd > must be still held open by some process. Otherwise the fanotify instance > gets destroyed and all processes are free to run (they get FAN_ALLOW reply > if they were already waiting). So the fact that you see processes hanging > when your fanotify listener crashes means that you have likely leaked the > fd to some other process (lsof should be able to tell you which process has > still handle to fanotify instance). And the kernel has no way to know this > is not the process that will eventually read these events and reply... > > > > In this case, any events that have been read but not yet responded to would be lost. > > > Initially we considered handling this internally by saving the file descriptors for pending events, > > > however this proved to be complex to do in a robust manner. > > > > > > A more robust solution is to add a kernel fanotify api which resets the fanotify pending event queue, > > > thereby allowing us to recover pending events in the case of daemon restart. > > > A strawman implementation of this approach is in > > > https://github.com/torvalds/linux/compare/master...ibrahim-jirdeh:linux:fanotify-reset-pending, > > > a new ioctl that resets `group->fanotify_data.access_list`. > > > One other alternative we considered is directly exposing the pending event queue itself > > > (https://github.com/torvalds/linux/commit/cd90ff006fa2732d28ff6bb5975ca5351ce009f1) > > > to support monitoring pending events, but simply resetting the queue is likely sufficient for our use-case. > > > > > > What do you think of exposing this functionality in fanotify? > > > > > > > Ignoring the pending events for start, how do you deal with access to > > non-populated files while the daemon is down? > > > > We were throwing some idea about having a mount option (something > > like a "moderate" mount) to determine the default response for specific > > permission events (e.g. FAN_OPEN_PERM) in the case that there is > > no listener watching this event. > > > > If you have a filesystem which may contain non-populated files, you > > mount it with as "moderated" mount and then access to all files is > > denied until the daemon is running and also denied if daemon is down. > > > > For restart, it might make sense to start a new daemon to start listening > > to events before stopping the old daemon. > > If the new daemon gets the events before the old daemon, things should > > be able to transition smoothly. > > I agree this would be a sensible protocol for updates. For unplanned crashes > I agree we need something like the "moderated" mount option. I hope you mean as a superblock option, not an actual mount option. Because a per-mount option is not something I want us to do. We're not giving a specific subsystem a way to alter per-mount behavior.