Hi everyone, Le Wed, Oct 23, 2024 at 09:38:39PM +0200, Antony Antony a écrit : > On Wed, Oct 23, 2024 at 11:07:05 +0100, David Howells wrote: > > Hi Antony, > > > > I think the attached should fix it properly rather than working around it as > > the previous patch did. If you could give it a whirl? > > Yes this also fix the crash. > > Tested-by: Antony Antony <antony.antony@xxxxxxxxxxx> I cannot confirm this fixes the crash for me. My reproducer is slightly more complicated than Max's original one, albeit, still on NixOS and probably uses 9p more intensively than the automated NixOS testings workload. Here is how to reproduce it: $ git clone https://gerrit.lix.systems/lix $ cd lix $ git fetch https://gerrit.lix.systems/lix refs/changes/29/3329/8 && git checkout FETCH_HEAD $ nix-build -A hydraJobs.tests.local-releng I suspect the reason for why Antony considers the crash to be fixed is that the workload used to test it requires a significant amount of chance and retries to trigger the bug. On my end, you can see our CI showing the symptoms: https://buildkite.com/organizations/lix-project/pipelines/lix/builds/2357/jobs/019761e7-784e-4790-8c1b-f609270d9d19/log. We retried probably hundreds of times and saw different corruption patterns, Python getting confused, ld.so getting confused, systemd sometimes too. Python had a much higher chance of crashing in many of our tests. We reproduced it over aarch64-linux (Ampere Altra Q80-30) but also Intel and AMD CPUs (~5 different systems). As soon as we reverted to Linux 6.6 series, the bug went away. We bisected but we started to have weirder problems, this is because we encountered the original regression mentioned in October 2024 and for a certain range of commits, we were unable to bisect anything further. So I switched my bisection strategy to understand when the bug was fixed, this lead me on the commit e65a0dc1cabe71b91ef5603e5814359451b74ca7 which is the proper fix mentioned here and on this discussion. Reverting this on the top of 6.12 cause indeed a massive amount of traces, see this gist [1] for examples. Applying the "workaround patch" aka "[PATCH] 9p: Don't revert the I/O iterator after reading" after reverting e65a0dc1cabe makes the problem go away after 5 tries (5 tries were sufficient to trigger with the proper fix). If this can be helpful, the nature of the test above is to copy a significant amount of assets to an S3 implementation (Garage) running inside of the VM. Many of these assets comes from the Nix store which sits over 9p. Anyhow, I see three patterns: - Kernel panic when starting the /init, this is the crash Max reported back in October 2024 and the one we started to encounter while bisecting this problem in the range between v6.11 and v6.12. - systemd crashing very quickly, this is what we see when reverting e65a0dc1cabe71b91ef5603e5814359451b74ca7 on the top of v6.12 *OR* when we are around v6.12rc5. - what the CI above shows which are userspace programs crashing after some serious I/O exercising has been done, which happens on the top of v6.12, v6.14, v6.15 (incl. stable kernels). If you need me to test things, please let me know. [1]: https://gist.dgnum.eu/raito/3d1fa61ebaf642218342ffe644fb6efd -- Ryan Lahfa