Re: [PATCH v2] fanotify: add watchdog for permission events

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 11 Sep 2025 13:25:44 +0200

On Thu, Sep 11, 2025 at 12:12 PM Jan Kara <jack@xxxxxxx> wrote:
>
> On Tue 09-09-25 16:30:47, Miklos Szeredi wrote:
> > This is to make it easier to debug issues with AV software, which time and
> > again deadlocks with no indication of where the issue comes from, and the
> > kernel being blamed for the deadlock.  Then we need to analyze dumps to
> > prove that the kernel is not in fact at fault.
> >
> > The deadlock comes from recursion: handling the event triggers another
> > permission event, in some roundabout way, obviously, otherwise it would
> > have been found in testing.
> >
> > With this patch a warning is printed when permission event is received by
> > userspace but not answered for more than the timeout specified in
> > /proc/sys/fs/fanotify/watchdog_timeout.  The watchdog can be turned off by
> > setting the timeout to zero (which is the default).
> >
> > The timeout is very coarse (T <= t < 2T) but I guess it's good enough for
> > the purpose.
> >
> > Overhead should be minimal.
> >
> > Signed-off-by: Miklos Szeredi <mszeredi@xxxxxxxxxx>
>
> Overall looks good. Just some nits below, I'll fix them up on commit if you
> won't object.

No comments beyond what you wrote.

Feel free to add:
Reviewed-by: Amir Goldstein <amir73il@xxxxxxxxx>

>
> > diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
> > index b78308975082..1a007e211bae 100644
> > --- a/fs/notify/fanotify/fanotify.h
> > +++ b/fs/notify/fanotify/fanotify.h
> > @@ -437,11 +437,13 @@ FANOTIFY_ME(struct fanotify_event *event)
> >  struct fanotify_perm_event {
> >       struct fanotify_event fae;
> >       struct path path;
> > -     const loff_t *ppos;             /* optional file range info */
> > +     const loff_t *ppos;     /* optional file range info */
>
> Stray modification.
>
> >       size_t count;
> >       u32 response;                   /* userspace answer to the event */
> >       unsigned short state;           /* state of the event */
> > +     unsigned short watchdog_cnt;    /* already scanned by watchdog? */
> >       int fd;         /* fd we passed to userspace for this event */
> > +     pid_t recv_pid; /* pid of task receiving the event */
> >       union {
> >               struct fanotify_response_info_header hdr;
> >               struct fanotify_response_info_audit_rule audit_rule;
> ...
> > @@ -95,6 +104,84 @@ static void __init fanotify_sysctls_init(void)
> >  #define fanotify_sysctls_init() do { } while (0)
> >  #endif /* CONFIG_SYSCTL */
> >
> > +static LIST_HEAD(perm_group_list);
> > +static DEFINE_SPINLOCK(perm_group_lock);
> > +static void perm_group_watchdog(struct work_struct *work);
> > +static DECLARE_DELAYED_WORK(perm_group_work, perm_group_watchdog);
> > +
> > +static void perm_group_watchdog_schedule(void)
> > +{
> > +     schedule_delayed_work(&perm_group_work, secs_to_jiffies(perm_group_timeout));
> > +}
> > +
> > +static void perm_group_watchdog(struct work_struct *work)
> > +{
> > +     struct fsnotify_group *group;
> > +     struct fanotify_perm_event *event;
> > +     struct task_struct *task;
> > +     pid_t failed_pid = 0;
> > +
> > +     guard(spinlock)(&perm_group_lock);
> > +     if (list_empty(&perm_group_list))
> > +             return;
> > +
> > +     list_for_each_entry(group, &perm_group_list, fanotify_data.perm_group) {
> > +             /*
> > +              * Ok to test without lock, racing with an addition is
> > +              * fine, will deal with it next round
> > +              */
> > +             if (list_empty(&group->fanotify_data.access_list))
> > +                     continue;
> > +
> > +             scoped_guard(spinlock, &group->notification_lock) {
>
> Frankly, I don't see the scoped guard bringing benefit here. It just shifts
> indentation level by 1 which makes some of the lines below longer than I
> like :)
>
> > +                     list_for_each_entry(event, &group->fanotify_data.access_list, fae.fse.list) {
> > +                             if (likely(event->watchdog_cnt == 0)) {
> > +                                     event->watchdog_cnt = 1;
> > +                             } else if (event->watchdog_cnt == 1) {
> > +                                     /* Report on event only once */
> > +                                     event->watchdog_cnt = 2;
> > +
> > +                                     /* Do not report same pid repeatedly */
> > +                                     if (event->recv_pid == failed_pid)
> > +                                             continue;
> > +
> > +                                     failed_pid = event->recv_pid;
> > +                                     rcu_read_lock();
> > +                                     task = find_task_by_pid_ns(event->recv_pid, &init_pid_ns);
> > +                                     pr_warn_ratelimited("PID %u (%s) failed to respond to fanotify queue for more than %i seconds\n",
>
> Use %d instead of %i here? IMHO we use %d everywhere in the kernel. I had
> to look up whether %i is really signed int.
>
> > +                                                         event->recv_pid, task ? task->comm : NULL, perm_group_timeout);
> > +                                     rcu_read_unlock();
> > +                             }
> > +                     }
>
> I'm wondering if we should cond_resched() somewhere in these loops. There
> could be *many* events pending... OTOH continuing the iteration afterwards
> would be non-trivial so probably let's keep our fingers crossed that
> softlockups won't trigger...
>
> > +             }
> > +     }
> > +     perm_group_watchdog_schedule();
> > +}
> > +
> > +static void fanotify_perm_watchdog_group_remove(struct fsnotify_group *group)
> > +{
> > +     if (!list_empty(&group->fanotify_data.perm_group)) {
> > +             /* Perm event watchdog can no longer scan this group. */
> > +             spin_lock(&perm_group_lock);
> > +             list_del(&group->fanotify_data.perm_group);
>
> list_del_init() here would give me a better peace of mind... It's not like
> the performance matters here.
>
> > diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> > index d4034ddaf392..7f7fe4f3aa34 100644
> > --- a/include/linux/fsnotify_backend.h
> > +++ b/include/linux/fsnotify_backend.h
> > @@ -273,6 +273,8 @@ struct fsnotify_group {
> >                       int f_flags; /* event_f_flags from fanotify_init() */
> >                       struct ucounts *ucounts;
> >                       mempool_t error_events_pool;
> > +                     /* chained on perm_group_list */
> > +                     struct list_head perm_group;
>
> Can we call this perm_group_list, perm_list or simply something with 'list'
> in the name, please? We follow this naming convention throughout the
> fsnotify subsystem.
>
>                                                                 Honza
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR