On Thu, Sep 11, 2025 at 12:12 PM Jan Kara <jack@xxxxxxx> wrote: > > On Tue 09-09-25 16:30:47, Miklos Szeredi wrote: > > This is to make it easier to debug issues with AV software, which time and > > again deadlocks with no indication of where the issue comes from, and the > > kernel being blamed for the deadlock. Then we need to analyze dumps to > > prove that the kernel is not in fact at fault. > > > > The deadlock comes from recursion: handling the event triggers another > > permission event, in some roundabout way, obviously, otherwise it would > > have been found in testing. > > > > With this patch a warning is printed when permission event is received by > > userspace but not answered for more than the timeout specified in > > /proc/sys/fs/fanotify/watchdog_timeout. The watchdog can be turned off by > > setting the timeout to zero (which is the default). > > > > The timeout is very coarse (T <= t < 2T) but I guess it's good enough for > > the purpose. > > > > Overhead should be minimal. > > > > Signed-off-by: Miklos Szeredi <mszeredi@xxxxxxxxxx> > > Overall looks good. Just some nits below, I'll fix them up on commit if you > won't object. No comments beyond what you wrote. Feel free to add: Reviewed-by: Amir Goldstein <amir73il@xxxxxxxxx> > > > diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h > > index b78308975082..1a007e211bae 100644 > > --- a/fs/notify/fanotify/fanotify.h > > +++ b/fs/notify/fanotify/fanotify.h > > @@ -437,11 +437,13 @@ FANOTIFY_ME(struct fanotify_event *event) > > struct fanotify_perm_event { > > struct fanotify_event fae; > > struct path path; > > - const loff_t *ppos; /* optional file range info */ > > + const loff_t *ppos; /* optional file range info */ > > Stray modification. > > > size_t count; > > u32 response; /* userspace answer to the event */ > > unsigned short state; /* state of the event */ > > + unsigned short watchdog_cnt; /* already scanned by watchdog? */ > > int fd; /* fd we passed to userspace for this event */ > > + pid_t recv_pid; /* pid of task receiving the event */ > > union { > > struct fanotify_response_info_header hdr; > > struct fanotify_response_info_audit_rule audit_rule; > ... > > @@ -95,6 +104,84 @@ static void __init fanotify_sysctls_init(void) > > #define fanotify_sysctls_init() do { } while (0) > > #endif /* CONFIG_SYSCTL */ > > > > +static LIST_HEAD(perm_group_list); > > +static DEFINE_SPINLOCK(perm_group_lock); > > +static void perm_group_watchdog(struct work_struct *work); > > +static DECLARE_DELAYED_WORK(perm_group_work, perm_group_watchdog); > > + > > +static void perm_group_watchdog_schedule(void) > > +{ > > + schedule_delayed_work(&perm_group_work, secs_to_jiffies(perm_group_timeout)); > > +} > > + > > +static void perm_group_watchdog(struct work_struct *work) > > +{ > > + struct fsnotify_group *group; > > + struct fanotify_perm_event *event; > > + struct task_struct *task; > > + pid_t failed_pid = 0; > > + > > + guard(spinlock)(&perm_group_lock); > > + if (list_empty(&perm_group_list)) > > + return; > > + > > + list_for_each_entry(group, &perm_group_list, fanotify_data.perm_group) { > > + /* > > + * Ok to test without lock, racing with an addition is > > + * fine, will deal with it next round > > + */ > > + if (list_empty(&group->fanotify_data.access_list)) > > + continue; > > + > > + scoped_guard(spinlock, &group->notification_lock) { > > Frankly, I don't see the scoped guard bringing benefit here. It just shifts > indentation level by 1 which makes some of the lines below longer than I > like :) > > > + list_for_each_entry(event, &group->fanotify_data.access_list, fae.fse.list) { > > + if (likely(event->watchdog_cnt == 0)) { > > + event->watchdog_cnt = 1; > > + } else if (event->watchdog_cnt == 1) { > > + /* Report on event only once */ > > + event->watchdog_cnt = 2; > > + > > + /* Do not report same pid repeatedly */ > > + if (event->recv_pid == failed_pid) > > + continue; > > + > > + failed_pid = event->recv_pid; > > + rcu_read_lock(); > > + task = find_task_by_pid_ns(event->recv_pid, &init_pid_ns); > > + pr_warn_ratelimited("PID %u (%s) failed to respond to fanotify queue for more than %i seconds\n", > > Use %d instead of %i here? IMHO we use %d everywhere in the kernel. I had > to look up whether %i is really signed int. > > > + event->recv_pid, task ? task->comm : NULL, perm_group_timeout); > > + rcu_read_unlock(); > > + } > > + } > > I'm wondering if we should cond_resched() somewhere in these loops. There > could be *many* events pending... OTOH continuing the iteration afterwards > would be non-trivial so probably let's keep our fingers crossed that > softlockups won't trigger... > > > + } > > + } > > + perm_group_watchdog_schedule(); > > +} > > + > > +static void fanotify_perm_watchdog_group_remove(struct fsnotify_group *group) > > +{ > > + if (!list_empty(&group->fanotify_data.perm_group)) { > > + /* Perm event watchdog can no longer scan this group. */ > > + spin_lock(&perm_group_lock); > > + list_del(&group->fanotify_data.perm_group); > > list_del_init() here would give me a better peace of mind... It's not like > the performance matters here. > > > diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h > > index d4034ddaf392..7f7fe4f3aa34 100644 > > --- a/include/linux/fsnotify_backend.h > > +++ b/include/linux/fsnotify_backend.h > > @@ -273,6 +273,8 @@ struct fsnotify_group { > > int f_flags; /* event_f_flags from fanotify_init() */ > > struct ucounts *ucounts; > > mempool_t error_events_pool; > > + /* chained on perm_group_list */ > > + struct list_head perm_group; > > Can we call this perm_group_list, perm_list or simply something with 'list' > in the name, please? We follow this naming convention throughout the > fsnotify subsystem. > > Honza > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR