Re: [PATCH v2] fanotify: add watchdog for permission events

Jan Kara <jack@xxxxxxx> · Thu, 11 Sep 2025 12:12:00 +0200

On Tue 09-09-25 16:30:47, Miklos Szeredi wrote:
> This is to make it easier to debug issues with AV software, which time and
> again deadlocks with no indication of where the issue comes from, and the
> kernel being blamed for the deadlock.  Then we need to analyze dumps to
> prove that the kernel is not in fact at fault.
> 
> The deadlock comes from recursion: handling the event triggers another
> permission event, in some roundabout way, obviously, otherwise it would
> have been found in testing.
> 
> With this patch a warning is printed when permission event is received by
> userspace but not answered for more than the timeout specified in
> /proc/sys/fs/fanotify/watchdog_timeout.  The watchdog can be turned off by
> setting the timeout to zero (which is the default).
> 
> The timeout is very coarse (T <= t < 2T) but I guess it's good enough for
> the purpose.
> 
> Overhead should be minimal.
> 
> Signed-off-by: Miklos Szeredi <mszeredi@xxxxxxxxxx>

Overall looks good. Just some nits below, I'll fix them up on commit if you
won't object.

> diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
> index b78308975082..1a007e211bae 100644
> --- a/fs/notify/fanotify/fanotify.h
> +++ b/fs/notify/fanotify/fanotify.h
> @@ -437,11 +437,13 @@ FANOTIFY_ME(struct fanotify_event *event)
>  struct fanotify_perm_event {
>  	struct fanotify_event fae;
>  	struct path path;
> -	const loff_t *ppos;		/* optional file range info */
> +	const loff_t *ppos;	/* optional file range info */

Stray modification.

>  	size_t count;
>  	u32 response;			/* userspace answer to the event */
>  	unsigned short state;		/* state of the event */
> +	unsigned short watchdog_cnt;	/* already scanned by watchdog? */
>  	int fd;		/* fd we passed to userspace for this event */
> +	pid_t recv_pid;	/* pid of task receiving the event */
>  	union {
>  		struct fanotify_response_info_header hdr;
>  		struct fanotify_response_info_audit_rule audit_rule;
...
> @@ -95,6 +104,84 @@ static void __init fanotify_sysctls_init(void)
>  #define fanotify_sysctls_init() do { } while (0)
>  #endif /* CONFIG_SYSCTL */
>  
> +static LIST_HEAD(perm_group_list);
> +static DEFINE_SPINLOCK(perm_group_lock);
> +static void perm_group_watchdog(struct work_struct *work);
> +static DECLARE_DELAYED_WORK(perm_group_work, perm_group_watchdog);
> +
> +static void perm_group_watchdog_schedule(void)
> +{
> +	schedule_delayed_work(&perm_group_work, secs_to_jiffies(perm_group_timeout));
> +}
> +
> +static void perm_group_watchdog(struct work_struct *work)
> +{
> +	struct fsnotify_group *group;
> +	struct fanotify_perm_event *event;
> +	struct task_struct *task;
> +	pid_t failed_pid = 0;
> +
> +	guard(spinlock)(&perm_group_lock);
> +	if (list_empty(&perm_group_list))
> +		return;
> +
> +	list_for_each_entry(group, &perm_group_list, fanotify_data.perm_group) {
> +		/*
> +		 * Ok to test without lock, racing with an addition is
> +		 * fine, will deal with it next round
> +		 */
> +		if (list_empty(&group->fanotify_data.access_list))
> +			continue;
> +
> +		scoped_guard(spinlock, &group->notification_lock) {

Frankly, I don't see the scoped guard bringing benefit here. It just shifts
indentation level by 1 which makes some of the lines below longer than I
like :)

> +			list_for_each_entry(event, &group->fanotify_data.access_list, fae.fse.list) {
> +				if (likely(event->watchdog_cnt == 0)) {
> +					event->watchdog_cnt = 1;
> +				} else if (event->watchdog_cnt == 1) {
> +					/* Report on event only once */
> +					event->watchdog_cnt = 2;
> +
> +					/* Do not report same pid repeatedly */
> +					if (event->recv_pid == failed_pid)
> +						continue;
> +
> +					failed_pid = event->recv_pid;
> +					rcu_read_lock();
> +					task = find_task_by_pid_ns(event->recv_pid, &init_pid_ns);
> +					pr_warn_ratelimited("PID %u (%s) failed to respond to fanotify queue for more than %i seconds\n",

Use %d instead of %i here? IMHO we use %d everywhere in the kernel. I had
to look up whether %i is really signed int.

> +							    event->recv_pid, task ? task->comm : NULL, perm_group_timeout);
> +					rcu_read_unlock();
> +				}
> +			}

I'm wondering if we should cond_resched() somewhere in these loops. There
could be *many* events pending... OTOH continuing the iteration afterwards
would be non-trivial so probably let's keep our fingers crossed that
softlockups won't trigger...

> +		}
> +	}
> +	perm_group_watchdog_schedule();
> +}
> +
> +static void fanotify_perm_watchdog_group_remove(struct fsnotify_group *group)
> +{
> +	if (!list_empty(&group->fanotify_data.perm_group)) {
> +		/* Perm event watchdog can no longer scan this group. */
> +		spin_lock(&perm_group_lock);
> +		list_del(&group->fanotify_data.perm_group);

list_del_init() here would give me a better peace of mind... It's not like
the performance matters here.

> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index d4034ddaf392..7f7fe4f3aa34 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -273,6 +273,8 @@ struct fsnotify_group {
>  			int f_flags; /* event_f_flags from fanotify_init() */
>  			struct ucounts *ucounts;
>  			mempool_t error_events_pool;
> +			/* chained on perm_group_list */
> +			struct list_head perm_group;

Can we call this perm_group_list, perm_list or simply something with 'list'
in the name, please? We follow this naming convention throughout the
fsnotify subsystem.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR