On Thu, 3 Jul 2025 at 17:47, Amir Goldstein <amir73il@xxxxxxxxx> wrote: > Do you mean deadlock in userspace or deadlock on some kernel > lock because the server is operating on the filesystem and the > permission event was in a locked context? Server is doing in a loop: 1) read perm event from from fanotify fd 2) do something to decide between allow/deny 3) write reply to fanotify It doesn't matter where 2) gets stuck, in userspace or in the kernel, the result is the same: a stuck open syscall. > There is a nuance here. > Your patch does not count the time from when the operation was queued > and blocked. > > It counts the time from when the AV software *reads* the event. > If AV software went off to take a nap and does not read events, > you will not get a watchdog. Right. But server is unlikely to be doing anything between 3) and 1), so in practice the head of the queue is unlikely to be stuck. But maybe the watchdog could be taught to handle that case as well (no outstanding requests ad no forward progress in the queue). > If it is, then I wonder if you could share some details about the > analysis of the deadlocks. It's very often plain recursion: handling the event triggers another permission event, in some roundabout way, obviously, otherwise it would have been found in testing. This is apparently a continuing problem for support teams, because when this happens the OS is the first to blame: "everything froze, this is broke" and sometimes it's hard to convince customers and even harder to convince AV vendors, that it's not in fact the OS. > So are the deadlocks that you found happen on fs with atomic_open()? > Technically, we can release the dir inode lock in finish_open() > I think it's just a matter of code architecture to make this happen. I don't think it's this. > I think that's one of those features where sysctl knob is more useful to > distros and admin than Kconfig. > Might as well be a sysctl knob to control perm_group_timeout > where 0 means off. Okay. > It is a bit odd to have a pid_t pid field here as well as > struct pid *pid field in fae.pid (the event generating pid). > So I think, either reuse fae.pid to keep reference to reader task_pid > or leave this pid_t field here with a more specific name. Yeah, I noticed the name conflict, then forgot about it. Will fix. > It would have been more natural and balanced to add group > to watchdog list on fanotify_init(). > > You can do that based on (group->priority > FSNOTIFY_PRIO_NORMAL) > because while a program could create an fanotify group with > priority FAN_CLASS_CONTENT and not add permission event > watches, there is absolutely no reason to optimize for this case and > not add this group to the "permission events capable" perm_group list. Yeah, it's aesthetically more pleasing, but I wonder if it's worth it having to explain why it's done this way. A more self explanatory solution is to just move the list_empty() inside the spinlock, and performance-wise it's not going to matter. Thanks, Miklos