Re: [REGRESSION] 9pfs issues on 6.12-rc1

Christian Theune <ct@xxxxxxxxxxxxxxx> · Fri, 27 Jun 2025 07:44:31 +0200

Hi,

we’re experiencing the same issue with a number of NixOS tests that are heavy in operations copying from the v9fs mounted nix store.

> On 13. Jun 2025, at 00:24, Ryan Lahfa <ryan@xxxxxxxxx> wrote:
> 
> Hi everyone,
> 
> Le Wed, Oct 23, 2024 at 09:38:39PM +0200, Antony Antony a écrit :
>> On Wed, Oct 23, 2024 at 11:07:05 +0100, David Howells wrote:
>>> Hi Antony,
>>> 
>>> I think the attached should fix it properly rather than working around it as
>>> the previous patch did.  If you could give it a whirl?
>> 
>> Yes this also fix the crash.
>> 
>> Tested-by: Antony Antony <antony.antony@xxxxxxxxxxx>
> 
> I cannot confirm this fixes the crash for me. My reproducer is slightly
> more complicated than Max's original one, albeit, still on NixOS and
> probably uses 9p more intensively than the automated NixOS testings
> workload.
> 
> Here is how to reproduce it:
> 
> $ git clone https://gerrit.lix.systems/lix
> $ cd lix
> $ git fetch https://gerrit.lix.systems/lix refs/changes/29/3329/8 && git checkout FETCH_HEAD
> $ nix-build -A hydraJobs.tests.local-releng
> 
> I suspect the reason for why Antony considers the crash to be fixed is
> that the workload used to test it requires a significant amount of
> chance and retries to trigger the bug.
> 
> On my end, you can see our CI showing the symptoms:
> https://buildkite.com/organizations/lix-project/pipelines/lix/builds/2357/jobs/019761e7-784e-4790-8c1b-f609270d9d19/log.
> 
> We retried probably hundreds of times and saw different corruption
> patterns, Python getting confused, ld.so getting confused, systemd
> sometimes too. Python had a much higher chance of crashing in many of
> our tests. We reproduced it over aarch64-linux (Ampere Altra Q80-30) but
> also Intel and AMD CPUs (~5 different systems).

Yeah. We’re on AMD CPUs and it wasn’t hardware-bound. 

The errors we saw where: 

- malloc(): unaligned tcache chunk detected
- segfaulting java processes
- misbehaving filesystems (errors about internal structures in ext4, incorrect file content in xfs)
- crashing kernels when dealing with the outfall of those errors

> As soon as we reverted to Linux 6.6 series, the bug went away.

Same here, the otherway around: we came from 6.6.94 and updated to 6.12.34 and immediately saw a number of tests failing, all of which were heavy in copying data from v9fs to the root filesystem in the VM.

> We bisected but we started to have weirder problems, this is because we
> encountered the original regression mentioned in October 2024 and for a
> certain range of commits, we were unable to bisect anything further.

I had already found the issue from last October when started bisecting, I later got in touch with Ryan who recognized that we were chasing the same issue. I stopped bisecting at that point - the bisect was already homing in around the time of the changes in last October.

> So I switched my bisection strategy to understand when the bug was
> fixed, this lead me on the commit
> e65a0dc1cabe71b91ef5603e5814359451b74ca7 which is the proper fix
> mentioned here and on this discussion.
> 
> Reverting this on the top of 6.12 cause indeed a massive amount of
> traces, see this gist [1] for examples.

Yeah. During bisect I noticed it flapping around with the original October issues crashing immediately during boot.

> Applying the "workaround patch" aka "[PATCH] 9p: Don't revert the I/O
> iterator after reading" after reverting e65a0dc1cabe makes the problem
> go away after 5 tries (5 tries were sufficient to trigger with the
> proper fix).

Yup, I applied the revert and workaround patch on top of 6.12.34 and the reliably broken test became reliably green again.

Our test can be reproduced, too:

$ git clone https://github.com/flyingcircusio/fc-nixos.git
$ cd fc-nixos
$ eval $(./dev-setup)
$ nix-build tests/matomo.nix

The test will fail with ext4 complaining something like this:

machine # [ 42.596728] vn2haz1283lxz6iy0rai850a7jlgxbja-matomo-setup-update-pre[1233]: Copied files, updating package link in /var/lib/matomo/current-package.
machine # [ 42.788956] EXT4-fs error (device vda): htree_dirblock_to_tree:1109: inode #13138: block 5883: comm setfacl: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=606087968, rec_len=31074, size=4096 fake=0
machine # [ 42.958590] EXT4-fs error (device vda): htree_dirblock_to_tree:1109: inode #13138: block 5883: comm chown: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=606087968, rec_len=31074, size=4096 fake=0
machine # [ 43.068003] EXT4-fs error (device vda): htree_dirblock_to_tree:1109: inode #13138: block 5883: comm chmod: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=606087968, rec_len=31074, size=4096 fake=0
machine # [ 43.004098] vn2haz1283lxz6iy0rai850a7jlgxbja-matomo-setup-update-pre[1233]: Giving matomo read+write access to /var/lib/matomo/share/matomo.js, /var/lib/matomo/share/piwik.js, /var/lib/matomo/share/config, /var/lib/matomo/share/misc/user, /var/lib/matomo/share/js, /var/lib/matomo/share/tmp, /var/lib/matomo/share/misc
machine # [ 43.201319] EXT4-fs error (device vda): htree_dirblock_to_tree:1109: inode #13138: block 5883: comm setfacl: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=606087968, rec_len=31074, size=4096 fake=0

I’m also available for testing and further diagnosis.

Christian

-- 
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick