Re: [REGRESSION] 9pfs issues on 6.12-rc1

Ryan Lahfa <ryan@xxxxxxxxx> · Fri, 13 Jun 2025 00:24:13 +0200

Hi everyone,

Le Wed, Oct 23, 2024 at 09:38:39PM +0200, Antony Antony a écrit :
> On Wed, Oct 23, 2024 at 11:07:05 +0100, David Howells wrote:
> > Hi Antony,
> > 
> > I think the attached should fix it properly rather than working around it as
> > the previous patch did.  If you could give it a whirl?
> 
> Yes this also fix the crash.
> 
> Tested-by: Antony Antony <antony.antony@xxxxxxxxxxx>

I cannot confirm this fixes the crash for me. My reproducer is slightly
more complicated than Max's original one, albeit, still on NixOS and
probably uses 9p more intensively than the automated NixOS testings
workload.

Here is how to reproduce it:

$ git clone https://gerrit.lix.systems/lix
$ cd lix
$ git fetch https://gerrit.lix.systems/lix refs/changes/29/3329/8 && git checkout FETCH_HEAD
$ nix-build -A hydraJobs.tests.local-releng

I suspect the reason for why Antony considers the crash to be fixed is
that the workload used to test it requires a significant amount of
chance and retries to trigger the bug.

On my end, you can see our CI showing the symptoms:
https://buildkite.com/organizations/lix-project/pipelines/lix/builds/2357/jobs/019761e7-784e-4790-8c1b-f609270d9d19/log.

We retried probably hundreds of times and saw different corruption
patterns, Python getting confused, ld.so getting confused, systemd
sometimes too. Python had a much higher chance of crashing in many of
our tests. We reproduced it over aarch64-linux (Ampere Altra Q80-30) but
also Intel and AMD CPUs (~5 different systems).

As soon as we reverted to Linux 6.6 series, the bug went away.

We bisected but we started to have weirder problems, this is because we
encountered the original regression mentioned in October 2024 and for a
certain range of commits, we were unable to bisect anything further.

So I switched my bisection strategy to understand when the bug was
fixed, this lead me on the commit
e65a0dc1cabe71b91ef5603e5814359451b74ca7 which is the proper fix
mentioned here and on this discussion.

Reverting this on the top of 6.12 cause indeed a massive amount of
traces, see this gist [1] for examples.

Applying the "workaround patch" aka "[PATCH] 9p: Don't revert the I/O
iterator after reading" after reverting e65a0dc1cabe makes the problem
go away after 5 tries (5 tries were sufficient to trigger with the
proper fix).

If this can be helpful, the nature of the test above is to copy a
significant amount of assets to an S3 implementation (Garage) running
inside of the VM. Many of these assets comes from the Nix store which
sits over 9p.

Anyhow, I see three patterns:

- Kernel panic when starting the /init, this is the crash Max reported
  back in October 2024 and the one we started to encounter while
  bisecting this problem in the range between v6.11 and v6.12.
- systemd crashing very quickly, 
  this is what we see when reverting e65a0dc1cabe71b91ef5603e5814359451b74ca7
  on the top of v6.12 *OR* when we are around v6.12rc5.
- what the CI above shows which are userspace programs crashing after
  some serious I/O exercising has been done, which happens on the top of
  v6.12, v6.14, v6.15 (incl. stable kernels).

If you need me to test things, please let me know.

[1]: https://gist.dgnum.eu/raito/3d1fa61ebaf642218342ffe644fb6efd
-- 
Ryan Lahfa