Re: [RFC][BUG] ns_mkdir_op() locking is FUBAR

John Johansen <john@xxxxxxxx> · Tue, 24 Jun 2025 10:25:39 -0700

On 6/23/25 15:23, Al Viro wrote:
On Mon, Jun 23, 2025 at 10:37:47PM +0100, Al Viro wrote:

Could you explain what exclusion are you trying to get there?
The mechanism is currently broken, but what is it trying to achieve?
While we are at it:

root@kvm1:~# cd /sys/kernel/security/apparmor/policy
root@kvm1:/sys/kernel/security/apparmor/policy# (for i in `seq 270`; do mkdir namespaces/$i; cd namespaces/$i; done)
root@kvm1:/sys/kernel/security/apparmor/policy# rmdir namespaces/1
[   40.980453] Oops: stack guard page: 0000 [#1] PREEMPT SMP NOPTI
[   40.980457] CPU: 3 UID: 0 PID: 2223 Comm: rmdir Not tainted 6.12.27-amd64 #11
[   40.980459] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.164
[   40.980460] RIP: 0010:inode_set_ctime_current+0x2c/0x100
[   40.980490] Code: 1e fa 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 31 db 48 8f
[   40.980491] RSP: 0018:ffffc1cbc2cfbff8 EFLAGS: 00010292
[   40.980493] RAX: 0000000000400000 RBX: 0000000000000000 RCX: ffff9dbcc358ac70
[   40.980494] RDX: 0000000000000001 RSI: ffff9dbcc48c0300 RDI: ffffc1cbc2cfbff8
[   40.980495] RBP: ffffc1cbc2cfc028 R08: 0000000000000000 R09: ffffffffa484c6c0
[   40.980495] R10: ffff9dbcc0729cc0 R11: 0000000000000002 R12: ffff9dbcc4a75b28
[   40.980496] R13: ffff9dbcc4a75b28 R14: ffff9dbcc01fe600 R15: ffff9dbcc51a9e00
[   40.980498] FS:  00007ffb70ea4740(0000) GS:ffff9dbfefd80000(0000) knlGS:00000
[   40.980499] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.980499] CR2: ffffc1cbc2cfbfe8 CR3: 000000010619a000 CR4: 00000000000006f0
[   40.980501] Call Trace:
[   40.980510]  <TASK>
[   40.980513]  simple_unlink+0x24/0x50
[   40.980526]  aafs_remove+0x9a/0xb0
[   40.980543]  __aafs_ns_rmdir+0x2ec/0x3b0
[   40.980548]  destroy_ns.part.0+0x9f/0xc0
[   40.980558]  __aa_remove_ns+0x44/0x90
[   40.980560]  destroy_ns.part.0+0x40/0xc0
[   40.980562]  __aa_remove_ns+0x44/0x90
[   40.980563]  destroy_ns.part.0+0x40/0xc0
.....
[   40.981324]  ns_rmdir_op+0x189/0x300
[   40.981327]  vfs_rmdir+0x9b/0x200
[   40.981335]  do_rmdir+0x1ac/0x1c0
[   40.981340]  __x64_sys_rmdir+0x3f/0x70
[   40.981342]  do_syscall_64+0x82/0x190
[   40.981360]  ? do_fault+0x31a/0x550
[   40.981372]  ? __handle_mm_fault+0x7c2/0xf70
[   40.981373]  ? syscall_exit_to_user_mode_prepare+0x149/0x170
[   40.981388]  ? __count_memcg_events+0x53/0xf0
[   40.981392]  ? count_memcg_events.constprop.0+0x1a/0x30
[   40.981394]  ? handle_mm_fault+0x1bb/0x2c0
[   40.981396]  ? do_user_addr_fault+0x36c/0x620
[   40.981408]  ? exc_page_fault+0x7e/0x180
[   40.981412]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
.....
[   40.981486] Kernel panic - not syncing: Fatal exception in interrupt

I realize that anyone who can play with apparmor config can screw the
box into the ground in a lot of ways, but... when you have a recursion
kernel-side, it would be nice to have its depth bounded.  Not even root
should be able to panic the box with a single call of rmdir(2)...

Indeed, we can cap this something about ~3x what realistically we would see with

nesting of user namespaces. Some where between between 8-16 deep should be enough.

Something like

diff --git a/security/apparmor/include/policy_ns.h b/security/apparmor/include/policy_ns.h
index d646070fd966..081a0d5988d4 100644
--- a/security/apparmor/include/policy_ns.h
+++ b/security/apparmor/include/policy_ns.h
@@ -19,6 +19,9 @@
 #include "policy.h"


+/* maximum nesting of policy namespaces */
+#define MAX_NS_DEPTH 8
+
 /* struct aa_ns_acct - accounting of profiles in namespace
  * @max_size: maximum space allowed for all profiles in namespace
  * @max_count: maximum number of profiles that can be in this namespace
diff --git a/security/apparmor/policy_ns.c b/security/apparmor/policy_ns.c
index 64783ca3b0f2..89a6fecfd39a 100644
--- a/security/apparmor/policy_ns.c
+++ b/security/apparmor/policy_ns.c
@@ -223,6 +223,9 @@ static struct aa_ns *__aa_create_ns(struct aa_ns *parent, const char *name,
     AA_BUG(!name);
     AA_BUG(!mutex_is_locked(&parent->lock));

+    if (ns->level >= MAX_NS_LEVEL)
+        return ERR_PTR(-EPERM);
+
     ns = alloc_ns(parent->base.hname, name);
     if (!ns)
         return ERR_PTR(-ENOMEM);

would do for the creation side, which should be enough. But We could also

throw in a bug check and bail against the the ns->level on the rmdir as well.