[REGRESSION][BISECTED][PATCH 0/1] v6.16 panic/hang in zswap

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello crypto folks,

For some time now I've been battling a system hang that has been
sporadically affecting some of my machines. It seemed to be introduced
sometime during the v6.16 rc releases, but I couldn't quite pin it
down because it was reproducible (though not very easily) in some of
the versions I built, but not in others. The problem seemed to come
and go. At one point I thought I had bisected it to a netfilter fix,
but that ended up being wrong and led nowhere[1], and I was about
ready to give up on it. Then just yesterday one of the machines that
had been sporadically encountering the issue finally produced a panic
instead of just hanging. The panic indicated the problem was in swapd
in the LZ4 compression code (all of my machines that have been
affected use zswap).

Armed with the knowledge that it's a swap problem, I tried to find an
easier and more reliable method to reproduce the problem to try to
better bisect it. I found that I can use the stress-ng[2] tool's page
swapping stressor to immediately and reliably reproduce the
panic. Then, on another affected machine, I ran the page swapping
stressor and also immediately reproduced the hang (so I was fairly
confident that the hang and panic were both caused by the same
regression).

Then I tried bisecting the problem and immediately hit a new snag,
which was that suddenly on a new build of v6.16.0 I couldn't reproduce
the problem, even though I'd reproduced it on that version
before. Eventually I began to suspect that structure layout
randomization was causing the non-reproducibility (and was the reason
my previous attempt to bisect it failed). Using the randstruct.seed
file from one of the known bad kernel builds I had, I was able to
confirm that randstruct does indeed affect the issue. Now with a
"known bad" randstruct.seed I was able to successfully bisect it to
commit 42d9f6c77479 ("crypto: acomp - Move scomp stream allocation
code into acomp"). Instead of just reverting that commit, I tried to
understand why this change would be affected by randstruct, and I
believe I found the problem. Two related structs require the same
ordering of a couple of fields but one of the structs is new and would
be automatically randomized by randstruct (because it only contains
function pointers). I put together the following patch which resolves
the issue. It works by making the two related structs use a shared
struct which can be randomized and still ensure both of the structs
end up with the same layout.

-- Dan

[1] https://lore.kernel.org/regressions/20250731194901.7156-1-dan@xxxxxxxx/
[2] https://github.com/ColinIanKing/stress-ng

#regzbot introduced: 42d9f6c77479

Dan Moulding (1):
  crypto: acomp: Use shared struct for context alloc and free ops

 crypto/acompress.c                  |  6 +++---
 crypto/lz4.c                        |  6 ++++--
 include/crypto/internal/acompress.h | 10 +++++++---
 include/crypto/internal/scompress.h |  5 +----
 4 files changed, 15 insertions(+), 12 deletions(-)

-- 
2.49.1





[Index of Archives]     [Kernel]     [Gnu Classpath]     [Gnu Crypto]     [DM Crypt]     [Netfilter]     [Bugtraq]
  Powered by Linux