Re: Reseting pending fanotify events

Amir Goldstein <amir73il@xxxxxxxxx> · Fri, 4 Apr 2025 12:10:30 +0200

On Thu, Apr 3, 2025 at 8:04 PM Ibrahim Jirdeh <ibrahimjirdeh@xxxxxxxx> wrote:
>
> > Let me list a few approaches to this problem that were floated in the past.
> > You may choose bits and parts that you find useful to your use case.
>
> Thanks for sharing these approaches for smoothly recovering pending events.
>
> > 3. Change the default response to pending events on group fd close
> > Support writing a response with
> > .fd = FAN_NOFD
> > .response = FAN_DENY | FAN_DEFAULT
> > to set a group parameter fanotify_data.default_response.
>
> > Instead of setting pending events response to FAN_ALLOW,
> > could set it to FAN_DENY, or to descriptive error like
> > FAN_DENY(ECONNRESET).
>
>
> I think that the approach of customizing group close behavior would likely
> address the problem of pending events in case of daemon restart / crash
> encountered by our use case. It gives us the same guarantee of clearing
> out pending event queue that we wanted while preventing any access of
> unpopulated content. The one ask related to this approach would be around
> the handover from old to new group fd. Would it be possible to provide an easy
> way to initialize one group from another (ie an fanotify_mark option).
> In our case we have an interested mount as well as a set of ignore marks
> for populated files to avoid regenerating events for.
>

I think this case would be better handled by handing over the old fd to the
new server instance.

1. Start a new server instance
2. Set default response in case of new instance crash
3. Hand over a ref of the existing group fd to the new instance if the
old instance is running
4. Start handling events in new instance (*)
5. Stop handling new events in old instance, but complete pending events
6. Shutdown old instance

Is there some problem with that approach that I do not see?

(*) You have a multitude of choices on how to collaborate the
handover of event handling responsibilities from old to new server.
The handover can either be over a strong ordering barrier -
new start handling only after old completes pending,
or weaker ordering - old can start handling new events while
old is completing pending events.
In this case, you'd need some synchronization at file level (e.g. flock),
so the two instances will not try to populate the same file.

> The moderated mount functionality discussed in this thread would also be helpful
> for better handling when the daemon is down.

The terminology "moderated mount" freaks out some vfs maintainers ;)
So let me clarify what I think *might* be reasonable.
This untested sketch of a patch below demonstrates how we can use existing
fsnotify data structures to set up and implement a "moderated" sb.

There is currently no existing fanotify API to set this up, but it should
not be hard to implement such an API.

For example:

fanotify_mark(fd, FAN_MARK_ADD | FAN_MARK_FILESYSTEM | FAN_MARK_DEFAULT, ...

I might have had some patches similar to this floating around.
If you are interested in this feature, I could write and test a proper patch.

Doing this for the mount level would be possible, but TBH, it does not look
like the right object to be setting the moderation on, because even if we did
set a default mask on a mount, it would have been easy to escape it by
creating a bind mount, cloning a new mount namespace, etc.

What is the reason that you are marking the mount?
Is it so that you could have another "unmoderated" mount to
populate the file conten?
In that case you can opt-in for permission events on sb
and opt-out from permission events on the "unmoderated" mount
and you can also populate the file content with the FMODE_NONOTIFY
fd provided in the permission event.

WDYT?

Thanks,
Amir.

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 7b364f965650..9fc1235bbc47 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -585,7 +585,8 @@ int fsnotify(__u32 mask, const void *data, int
data_type, struct inode *dir,
         * SRCU because we have no references to any objects and do not
         * need SRCU to keep them "alive".
         */
-       if ((!sbinfo || !sbinfo->sb_marks) &&
+       if ((!sbinfo || (!sbinfo->sb_marks &&
+                        !READ_ONCE(sbinfo->default_mask))) &&
            (!mnt || !mnt->mnt_fsnotify_marks) &&
            (!inode || !inode->i_fsnotify_marks) &&
            (!inode2 || !inode2->i_fsnotify_marks) &&
@@ -641,6 +642,7 @@ int fsnotify(__u32 mask, const void *data, int
data_type, struct inode *dir,
         * ignore masks are properly reflected for mount/sb mark notifications.
         * That's why this traversal is so complicated...
         */
+       ret = 1;
        while (fsnotify_iter_select_report_types(&iter_info)) {
                ret = send_to_group(mask, data, data_type, dir, file_name,
                                    cookie, &iter_info);
@@ -650,6 +652,21 @@ int fsnotify(__u32 mask, const void *data, int
data_type, struct inode *dir,

                fsnotify_iter_next(&iter_info);
        }
+
+       /*
+        * The sb default mask has permission events and there currently no
+        * groups with marks handling permission events for this object.
+        * That could mean that an "access modertating" service was stopped
+        * or died without the chance or desire to allow sb access.
+        * Err on the side of caution and deny access until another access
+        * moderating service has started.
+        */
+       if (ret > 0 && (mask & ALL_FSNOTIFY_PERM_EVENTS) &&
+           sbinfo && (mask & READ_ONCE(sbinfo->default_mask))) {
+               ret = -EPERM;
+               goto out;
+       }
+
        ret = 0;
 out:
        srcu_read_unlock(&fsnotify_mark_srcu, iter_info.srcu_idx);
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index 798340db69d7..317e21581e0a 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -255,6 +255,14 @@ static void *__fsnotify_recalc_mask(struct
fsnotify_mark_connector *conn)
                    !(mark->flags & FSNOTIFY_MARK_FLAG_NO_IREF))
                        want_iref = true;
        }
+
+       if (conn->type == FSNOTIFY_OBJ_TYPE_SB) {
+               struct fsnotify_sb_info *sbinfo = fsnotify_sb_info(sb);
+
+               if (sbinfo)
+                       new_mask |= sbinfo->default_mask;
+       }
+
        /*
         * We use WRITE_ONCE() to prevent silly compiler optimizations from
         * confusing readers not holding conn->lock with partial updates.
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 396943093373..476b506d6b4a 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -560,6 +560,7 @@ struct fsnotify_mark_connector {
  */
 struct fsnotify_sb_info {
        struct fsnotify_mark_connector __rcu *sb_marks;
+       __u32 default_mask;
        /*
         * Number of inode/mount/sb objects that are being watched in this sb.
         * Note that inodes objects are currently double-accounted.