Re: [RFC PATCH 4/9] User-space API for creating a supervisor-fd

Tingmao Wang <m@xxxxxxxxxx> · Wed, 26 Mar 2025 00:06:11 +0000

On 3/11/25 19:28, Mickaël Salaün wrote:
On Mon, Mar 10, 2025 at 12:41:28AM +0000, Tingmao Wang wrote:
On 3/5/25 16:09, Mickaël Salaün wrote:
On Tue, Mar 04, 2025 at 01:13:00AM +0000, Tingmao Wang wrote:
We allow the user to pass in an additional flag to landlock_create_ruleset
which will make the ruleset operate in "supervise" mode, with a supervisor
attached. We create additional space in the landlock_ruleset_attr
structure to pass the newly created supervisor fd back to user-space.

The intention, while not implemented yet, is that the user-space will read
events from this fd and write responses back to it.

Note: need to investigate if fd clone on fork() is handled correctly, but
should be fine if it shares the struct file. We might also want to let the
user customize the flags on this fd, so that they can request no
O_CLOEXEC.

NOTE: despite this patch having a new uapi, I'm still very open to e.g.
re-using fanotify stuff instead (if that makes sense in the end). This is
just a PoC.

The main security risk of this feature is for this FD to leak and be
used by a sandboxed process to bypass all its restrictions.  This should
be highlighted in the UAPI documentation.

In particular, if for some reason the supervisor does a fork without 
exec, it must close this fd in the "about-to-be-untrusted" child.

(I wonder if it would be worth enforcing that the child calling 
landlock_restrict_self must not have any open supervisor fd that can 
supervise its own domain (returning an error if it does), but that can 
be difficult to implement so nevermind)



Signed-off-by: Tingmao Wang <m@xxxxxxxxxx>
---
   include/uapi/linux/landlock.h |  10 ++++
   security/landlock/syscalls.c  | 102 +++++++++++++++++++++++++++++-----
   2 files changed, 98 insertions(+), 14 deletions(-)

diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
index e1d2c27533b4..7bc1eb4859fb 100644
--- a/include/uapi/linux/landlock.h
+++ b/include/uapi/linux/landlock.h
@@ -50,6 +50,15 @@ struct landlock_ruleset_attr {
   	 * resources (e.g. IPCs).
   	 */
   	__u64 scoped;
+	/**
+	 * @supervisor_fd: Placeholder to store the supervisor file
+	 * descriptor when %LANDLOCK_CREATE_RULESET_SUPERVISE is set.
+	 */
+	__s32 supervisor_fd;

This interface would require the ruleset_attr becoming updatable by the
kernel, which might be OK in theory but requires current syscall wrapper
signature update, see sandboxer.c change.  It also creates a FD which
might not be useful (e.g. if an error occurs before the actual
enforcement).

I see a few alternatives.  We could just use/extend the ruleset FD
instead of creating a new one, but because leaking current rulesets is
not currently a security risk, we should be careful to not change that.

Another approach, similar to seccomp unotify, is to get a
"[landlock-domain]" FD returned by the landlock_restrict_self(2) when a
new LANDLOCK_RESTRICT_SELF_DOMAIN_FD flag is set.  This FD would be a
reference to the newly created domain, which is more specific than the
ruleset used to created this domain (and that can be used to create
other domains).  This domain FD could be used for introspection (i.e.
to get read-only properties such as domain ID), but being able to
directly supervise the referenced domain only with this FD would be a
risk that we should limit.

What we can do is to implement an IOCTL command for such domain FD that
would return a supervisor FD (if the LANDLOCK_RESTRICT_SELF_SUPERVISED
flag was also set).  The key point is to check (one time) that the
process calling this IOCTL is not restricted by the related domain (see
the scope helpers).

Is LANDLOCK_RESTRICT_SELF_DOMAIN_FD part of your (upcoming?) introspection
patch? (thinking about when will someone pass that only and not
LANDLOCK_RESTRICT_SELF_SUPERVISED, or vice versa)

I don't plan to work on such LANDLOCK_RESTRICT_SELF_DOMAIN_FD flag for
now, but the introspection feature(s) would help for this supervisor
feature.


By the way, is it alright to conceptually relate the supervisor to a domain?
It really would be a layer inside a domain - the domain could have earlier
or later layers which can deny access without supervision, or the supervisor
for earlier layers can deny access first. Therefore having supervisor fd
coming out of the ruleset felt sensible to me at first.

Good question.  I've been using the name "domain" to refer to the set of
restrictions enforced on a set of processes, but these restrictions are
composed of inherited ones plus the latest layer.  In this case, a
domain FD should refer to all the restrictions, but the supervisor FD
should indeed only refer to the latest layer of a domain (created by
landlock_restrict_self).


Also, isn't "check that process calling this IOCTL is not restricted by the
related domain" and the fact that the IOCTL is on the domain fd, which is a
return value of landlock_restrict_self, kind of contradictory?  I mean it is
a sensible check, but that kind of highlights that this interface is
slightly awkward - basically all callers are forced to have a setup where
the child sends the domain fd back to the parent.

I agree that its confusing.  I'd like to avoid the ruleset to gain any
control on domains after they are created.

Another approach would be to create a supervisor FD with the
landlock_create_ruleset() syscall, and pass this FD to the ruleset,
potentially with landlock_add_rule() calls to only request this
supervisor when matching specific rules (that could potentially be
catch-all rules)?

Maybe passing in a fd per landlock_add_rule calls, and thus potentially 
allowing different supervisor fd tied to different rules in the same 
ruleset, is a bit overkill (as now each rule needs to store a supervisor 
pointer?) and I don't really see the use of it.  I think it would be 
better to just pass it once in the landlock_ruleset_attr, which gets 
around the signature having const for the ruleset_attr problem. (I'm 
also open to the ioctl on domain fd idea, but I'm slightly wary of 
making this more complicated then necessary for the user space, as it 
now has to set up a socket (?) and pass a fd with scm_rights (?))

The other aspect of this is whether we want to have the supervisor mark 
specific rules as supervised, rather than having all denied access (from 
this layer) result in a supervisor invocation.  I also don't think this 
is necessary, as denials are supposed to be "abnormal" in some sense, 
and I would imagine most supervisors would want to find out about these 
(at least to print/show a warning of some sort, if it knows that the 
requested access is bad).  If a supervisor really wants to have the 
kernel just "silently" (from its perspective, but maybe there would be 
audit logs) deny any access outside of some known rules, it can also 
create a nested, unsupervised landlock domain that has the right effect. 
Avoiding having some sort of tri-state rules would simplify 
implementation, I imagine.


Overall, my main concern about this patch series is that the supervisor
could get a lot of requests, which will make the sandbox unusable
because always blocked by some thread/process.  This latest approach and
the ability to update the domain somehow could make it workable.



Relying on IOCTL commands (for all these FD types) instead of read/write
operations should also limit the risk of these FDs being misused through
a confused deputy attack (because such IOCTL command would convey an
explicit intent):
https://docs.kernel.org/security/credentials.html#open-file-credentials
https://lore.kernel.org/all/CAG48ez0HW-nScxn4G5p8UHtYy=T435ZkF3Tb1ARTyyijt_cNEg@xxxxxxxxxxxxxx/
We should get inspiration from seccomp unotify for this too:
https://lore.kernel.org/all/20181209182414.30862-1-tycho@xxxxxxxx/

I think in the seccomp unotify case the problem arises from what the setuid
binary thinks is just normal data getting interpreted by the kernel as a fd,
and thus having different effect if the attacker writes it vs. if the suid
app writes it.  In our case I *think* we should be alright, but maybe we
should go with ioctl anyway...

I don't see why Jann's attack scenario could work for this Landlock
supervisor too.  The main point that it the read/write interfaces are
used by a lot of different FDs, and we may not need them.

However, how does using netlink messages (a
suggestion from a different thread) affect this (if we do end up using it)?
Would we have to do netlink msgs via IOCTL?

Because all requests should be synchronous, one IOCTL could be used to
both acknowledge a previous event (or just start) and read the next one.

I was thinking about an IOCTL with these arguments:
1. supervisor FD
2. (extensible) IOCTL command (see PIDFD_GET_INFO for instance)
3. pointer to a fixed-size control structure

The fixed-size control structure could contain:
- handled access rights, used to only get event related to specific
   access.
- flags, to specify which kind of FD we would like to get (e.g. only
   directory FD, pidfd...)
- fd[6]: an array of received file descriptors.
- pointer to a variable-size data buffer that would contain all the
   records (e.g. source dir FD, source file name, destination dir FD,
   destination file name) for one event, potentially formatted with NLA.
- the size of this buffer

I'm not sure about the content of this buffer and the NLA format, and
the related API might not be usable without netlink sockets though.
Taking inspiration from the fanotify message format is another option.



+	/**
+	 * @pad: Unused, must be zero.
+	 */
+	__u32 pad;

In this case we should pack the struct instead.

   };
   /*
@@ -60,6 +69,7 @@ struct landlock_ruleset_attr {
    */
   /* clang-format off */
   #define LANDLOCK_CREATE_RULESET_VERSION			(1U << 0)
+#define LANDLOCK_CREATE_RULESET_SUPERVISE		(1U << 1)
   /* clang-format on */
   /**

[...]