On Thu, Apr 3, 2025 at 10:25 AM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > On Tue, Apr 01, 2025 at 06:28:11PM +0200, Jan Kara wrote: > > On Mon 31-03-25 21:08:51, Amir Goldstein wrote: > > > [CC Jan and Josef] > > > > CCed fsdevel. Actually replying here because the quoting in Ibrahim's email > > got somehow broken which made it very hard to understand. > > > > > I am keeping this discussion private because you did not post it to > > > the public list, > > > but if you can CC fsdevel in your reply that would be great, because it seems > > > like a question with interest to a wider audience. > > > > > > On Mon, Mar 31, 2025 at 8:19 PM Ibrahim Jirdeh <ibrahimjirdeh@xxxxxxxx> wrote: > > > > > > > > Hi Amir, > > > > > > > > We have been using fanotify to support lazily loading file contents. > > > > We are struggling with the problem that pending permission events cannot be recovered on daemon restart. > > > > > > > > We have a long-lived daemon that marks files with FAN_OPEN_PERM and populates their contents on access. > > > > It needs a reliable path for updates & crash recovery. > > > > The happy path for fanotify event processing is as follows: > > > > > > > > A notification is read from fanotify file descriptor > > > > File contents are populated > > > > We write back FAN_ALLOW to fanotify file descriptor, or DENY if content population failed. > > > > > > > > We would like to guarantee that all file accesses receive an ALLOW or DENY response, and no events are lost. > > > > > > Makes sense. > > > > > > > Unfortunately, today a filesystem client can hang (in D state) > > > > if the event-handler daemon crashes or restarts at the wrong time. > > > > > > Can you provide exact stack traces for those cases? > > > > > > I wonder how process gets to D state with commit fabf7f29b3e2 > > > ("fanotify: Use interruptible wait when waiting for permission events") > > > > So D state is expected when waiting for response. We are using > > TASK_UNINTERRUPTIBLE sleep (the above commit had to be effectively > > reverted). But we are also setting TASK_KILLABLE and TASK_FREEZABLE so that > > we don't block hibernation and tasks can be killed when fanotify listener > > misbehaves. > > > > But what confuses me is the following: You have fanotify instance to which > > you've got fd from fanotify_init(). For any process to be hanging, this fd > > must be still held open by some process. Otherwise the fanotify instance > > gets destroyed and all processes are free to run (they get FAN_ALLOW reply > > if they were already waiting). So the fact that you see processes hanging > > when your fanotify listener crashes means that you have likely leaked the > > fd to some other process (lsof should be able to tell you which process has > > still handle to fanotify instance). And the kernel has no way to know this > > is not the process that will eventually read these events and reply... > > > > > > In this case, any events that have been read but not yet responded to would be lost. > > > > Initially we considered handling this internally by saving the file descriptors for pending events, > > > > however this proved to be complex to do in a robust manner. > > > > > > > > A more robust solution is to add a kernel fanotify api which resets the fanotify pending event queue, > > > > thereby allowing us to recover pending events in the case of daemon restart. > > > > A strawman implementation of this approach is in > > > > https://github.com/torvalds/linux/compare/master...ibrahim-jirdeh:linux:fanotify-reset-pending, > > > > a new ioctl that resets `group->fanotify_data.access_list`. > > > > One other alternative we considered is directly exposing the pending event queue itself > > > > (https://github.com/torvalds/linux/commit/cd90ff006fa2732d28ff6bb5975ca5351ce009f1) > > > > to support monitoring pending events, but simply resetting the queue is likely sufficient for our use-case. > > > > > > > > What do you think of exposing this functionality in fanotify? > > > > > > > > > > Ignoring the pending events for start, how do you deal with access to > > > non-populated files while the daemon is down? > > > > > > We were throwing some idea about having a mount option (something > > > like a "moderate" mount) to determine the default response for specific > > > permission events (e.g. FAN_OPEN_PERM) in the case that there is > > > no listener watching this event. > > > > > > If you have a filesystem which may contain non-populated files, you > > > mount it with as "moderated" mount and then access to all files is > > > denied until the daemon is running and also denied if daemon is down. > > > > > > For restart, it might make sense to start a new daemon to start listening > > > to events before stopping the old daemon. > > > If the new daemon gets the events before the old daemon, things should > > > be able to transition smoothly. > > > > I agree this would be a sensible protocol for updates. For unplanned crashes > > I agree we need something like the "moderated" mount option. > > I hope you mean as a superblock option, not an actual mount option. > Because a per-mount option is not something I want us to do. We're not > giving a specific subsystem a way to alter per-mount behavior. I don't think that mount option is the right API for this. I think if we ever need to implement the semantics of "never allow unmoderated access to this fs or via this mount" to close the window of time until an fanotify permission mark is set then we will need to implement what you proposed once to allow setting an fanotify mark on a sb/mount before they are attached to the mount namespace. If I am ever saying "moderated mount" in an email, then this is actually what I mean. Thanks, Amir.