On Tue 09-09-25 16:30:47, Miklos Szeredi wrote: > This is to make it easier to debug issues with AV software, which time and > again deadlocks with no indication of where the issue comes from, and the > kernel being blamed for the deadlock. Then we need to analyze dumps to > prove that the kernel is not in fact at fault. > > The deadlock comes from recursion: handling the event triggers another > permission event, in some roundabout way, obviously, otherwise it would > have been found in testing. > > With this patch a warning is printed when permission event is received by > userspace but not answered for more than the timeout specified in > /proc/sys/fs/fanotify/watchdog_timeout. The watchdog can be turned off by > setting the timeout to zero (which is the default). > > The timeout is very coarse (T <= t < 2T) but I guess it's good enough for > the purpose. > > Overhead should be minimal. > > Signed-off-by: Miklos Szeredi <mszeredi@xxxxxxxxxx> Overall looks good. Just some nits below, I'll fix them up on commit if you won't object. > diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h > index b78308975082..1a007e211bae 100644 > --- a/fs/notify/fanotify/fanotify.h > +++ b/fs/notify/fanotify/fanotify.h > @@ -437,11 +437,13 @@ FANOTIFY_ME(struct fanotify_event *event) > struct fanotify_perm_event { > struct fanotify_event fae; > struct path path; > - const loff_t *ppos; /* optional file range info */ > + const loff_t *ppos; /* optional file range info */ Stray modification. > size_t count; > u32 response; /* userspace answer to the event */ > unsigned short state; /* state of the event */ > + unsigned short watchdog_cnt; /* already scanned by watchdog? */ > int fd; /* fd we passed to userspace for this event */ > + pid_t recv_pid; /* pid of task receiving the event */ > union { > struct fanotify_response_info_header hdr; > struct fanotify_response_info_audit_rule audit_rule; ... > @@ -95,6 +104,84 @@ static void __init fanotify_sysctls_init(void) > #define fanotify_sysctls_init() do { } while (0) > #endif /* CONFIG_SYSCTL */ > > +static LIST_HEAD(perm_group_list); > +static DEFINE_SPINLOCK(perm_group_lock); > +static void perm_group_watchdog(struct work_struct *work); > +static DECLARE_DELAYED_WORK(perm_group_work, perm_group_watchdog); > + > +static void perm_group_watchdog_schedule(void) > +{ > + schedule_delayed_work(&perm_group_work, secs_to_jiffies(perm_group_timeout)); > +} > + > +static void perm_group_watchdog(struct work_struct *work) > +{ > + struct fsnotify_group *group; > + struct fanotify_perm_event *event; > + struct task_struct *task; > + pid_t failed_pid = 0; > + > + guard(spinlock)(&perm_group_lock); > + if (list_empty(&perm_group_list)) > + return; > + > + list_for_each_entry(group, &perm_group_list, fanotify_data.perm_group) { > + /* > + * Ok to test without lock, racing with an addition is > + * fine, will deal with it next round > + */ > + if (list_empty(&group->fanotify_data.access_list)) > + continue; > + > + scoped_guard(spinlock, &group->notification_lock) { Frankly, I don't see the scoped guard bringing benefit here. It just shifts indentation level by 1 which makes some of the lines below longer than I like :) > + list_for_each_entry(event, &group->fanotify_data.access_list, fae.fse.list) { > + if (likely(event->watchdog_cnt == 0)) { > + event->watchdog_cnt = 1; > + } else if (event->watchdog_cnt == 1) { > + /* Report on event only once */ > + event->watchdog_cnt = 2; > + > + /* Do not report same pid repeatedly */ > + if (event->recv_pid == failed_pid) > + continue; > + > + failed_pid = event->recv_pid; > + rcu_read_lock(); > + task = find_task_by_pid_ns(event->recv_pid, &init_pid_ns); > + pr_warn_ratelimited("PID %u (%s) failed to respond to fanotify queue for more than %i seconds\n", Use %d instead of %i here? IMHO we use %d everywhere in the kernel. I had to look up whether %i is really signed int. > + event->recv_pid, task ? task->comm : NULL, perm_group_timeout); > + rcu_read_unlock(); > + } > + } I'm wondering if we should cond_resched() somewhere in these loops. There could be *many* events pending... OTOH continuing the iteration afterwards would be non-trivial so probably let's keep our fingers crossed that softlockups won't trigger... > + } > + } > + perm_group_watchdog_schedule(); > +} > + > +static void fanotify_perm_watchdog_group_remove(struct fsnotify_group *group) > +{ > + if (!list_empty(&group->fanotify_data.perm_group)) { > + /* Perm event watchdog can no longer scan this group. */ > + spin_lock(&perm_group_lock); > + list_del(&group->fanotify_data.perm_group); list_del_init() here would give me a better peace of mind... It's not like the performance matters here. > diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h > index d4034ddaf392..7f7fe4f3aa34 100644 > --- a/include/linux/fsnotify_backend.h > +++ b/include/linux/fsnotify_backend.h > @@ -273,6 +273,8 @@ struct fsnotify_group { > int f_flags; /* event_f_flags from fanotify_init() */ > struct ucounts *ucounts; > mempool_t error_events_pool; > + /* chained on perm_group_list */ > + struct list_head perm_group; Can we call this perm_group_list, perm_list or simply something with 'list' in the name, please? We follow this naming convention throughout the fsnotify subsystem. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR