Hi, we’re experiencing the same issue with a number of NixOS tests that are heavy in operations copying from the v9fs mounted nix store. > On 13. Jun 2025, at 00:24, Ryan Lahfa <ryan@xxxxxxxxx> wrote: > > Hi everyone, > > Le Wed, Oct 23, 2024 at 09:38:39PM +0200, Antony Antony a écrit : >> On Wed, Oct 23, 2024 at 11:07:05 +0100, David Howells wrote: >>> Hi Antony, >>> >>> I think the attached should fix it properly rather than working around it as >>> the previous patch did. If you could give it a whirl? >> >> Yes this also fix the crash. >> >> Tested-by: Antony Antony <antony.antony@xxxxxxxxxxx> > > I cannot confirm this fixes the crash for me. My reproducer is slightly > more complicated than Max's original one, albeit, still on NixOS and > probably uses 9p more intensively than the automated NixOS testings > workload. > > Here is how to reproduce it: > > $ git clone https://gerrit.lix.systems/lix > $ cd lix > $ git fetch https://gerrit.lix.systems/lix refs/changes/29/3329/8 && git checkout FETCH_HEAD > $ nix-build -A hydraJobs.tests.local-releng > > I suspect the reason for why Antony considers the crash to be fixed is > that the workload used to test it requires a significant amount of > chance and retries to trigger the bug. > > On my end, you can see our CI showing the symptoms: > https://buildkite.com/organizations/lix-project/pipelines/lix/builds/2357/jobs/019761e7-784e-4790-8c1b-f609270d9d19/log. > > We retried probably hundreds of times and saw different corruption > patterns, Python getting confused, ld.so getting confused, systemd > sometimes too. Python had a much higher chance of crashing in many of > our tests. We reproduced it over aarch64-linux (Ampere Altra Q80-30) but > also Intel and AMD CPUs (~5 different systems). Yeah. We’re on AMD CPUs and it wasn’t hardware-bound. The errors we saw where: - malloc(): unaligned tcache chunk detected - segfaulting java processes - misbehaving filesystems (errors about internal structures in ext4, incorrect file content in xfs) - crashing kernels when dealing with the outfall of those errors > As soon as we reverted to Linux 6.6 series, the bug went away. Same here, the otherway around: we came from 6.6.94 and updated to 6.12.34 and immediately saw a number of tests failing, all of which were heavy in copying data from v9fs to the root filesystem in the VM. > We bisected but we started to have weirder problems, this is because we > encountered the original regression mentioned in October 2024 and for a > certain range of commits, we were unable to bisect anything further. I had already found the issue from last October when started bisecting, I later got in touch with Ryan who recognized that we were chasing the same issue. I stopped bisecting at that point - the bisect was already homing in around the time of the changes in last October. > So I switched my bisection strategy to understand when the bug was > fixed, this lead me on the commit > e65a0dc1cabe71b91ef5603e5814359451b74ca7 which is the proper fix > mentioned here and on this discussion. > > Reverting this on the top of 6.12 cause indeed a massive amount of > traces, see this gist [1] for examples. Yeah. During bisect I noticed it flapping around with the original October issues crashing immediately during boot. > Applying the "workaround patch" aka "[PATCH] 9p: Don't revert the I/O > iterator after reading" after reverting e65a0dc1cabe makes the problem > go away after 5 tries (5 tries were sufficient to trigger with the > proper fix). Yup, I applied the revert and workaround patch on top of 6.12.34 and the reliably broken test became reliably green again. Our test can be reproduced, too: $ git clone https://github.com/flyingcircusio/fc-nixos.git $ cd fc-nixos $ eval $(./dev-setup) $ nix-build tests/matomo.nix The test will fail with ext4 complaining something like this: machine # [ 42.596728] vn2haz1283lxz6iy0rai850a7jlgxbja-matomo-setup-update-pre[1233]: Copied files, updating package link in /var/lib/matomo/current-package. machine # [ 42.788956] EXT4-fs error (device vda): htree_dirblock_to_tree:1109: inode #13138: block 5883: comm setfacl: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=606087968, rec_len=31074, size=4096 fake=0 machine # [ 42.958590] EXT4-fs error (device vda): htree_dirblock_to_tree:1109: inode #13138: block 5883: comm chown: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=606087968, rec_len=31074, size=4096 fake=0 machine # [ 43.068003] EXT4-fs error (device vda): htree_dirblock_to_tree:1109: inode #13138: block 5883: comm chmod: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=606087968, rec_len=31074, size=4096 fake=0 machine # [ 43.004098] vn2haz1283lxz6iy0rai850a7jlgxbja-matomo-setup-update-pre[1233]: Giving matomo read+write access to /var/lib/matomo/share/matomo.js, /var/lib/matomo/share/piwik.js, /var/lib/matomo/share/config, /var/lib/matomo/share/misc/user, /var/lib/matomo/share/js, /var/lib/matomo/share/tmp, /var/lib/matomo/share/misc machine # [ 43.201319] EXT4-fs error (device vda): htree_dirblock_to_tree:1109: inode #13138: block 5883: comm setfacl: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=606087968, rec_len=31074, size=4096 fake=0 I’m also available for testing and further diagnosis. Christian -- Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0 Flying Circus Internet Operations GmbH · https://flyingcircus.io Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick