Re: temporary hung tasks on XFS since updating to 6.6.92

Carlos Maiolino <cem@xxxxxxxxxx> · Mon, 16 Jun 2025 14:15:23 +0200

On Mon, Jun 16, 2025 at 12:09:21PM +0200, Christian Theune wrote:
> 
> > On 16. Jun 2025, at 11:47, Carlos Maiolino <cem@xxxxxxxxxx> wrote:
> >
> > On Mon, Jun 16, 2025 at 10:59:34AM +0200, Christian Theune wrote:
> >>
> >>
> >>> On 16. Jun 2025, at 10:50, Carlos Maiolino <cem@xxxxxxxxxx> wrote:
> >>>
> >>> On Thu, Jun 12, 2025 at 03:37:10PM +0200, Christian Theune wrote:
> >>>> Hi,
> >>>>
> >>>> in the last week, after updating to 6.6.92, we’ve encountered a number of VMs reporting temporarily hung tasks blocking the whole system for a few minutes. They unblock by themselves and have similar tracebacks.
> >>>>
> >>>> The IO PSIs show 100% pressure for that time, but the underlying devices are still processing read and write IO (well within their capacity). I’ve eliminated the underlying storage (Ceph) as the source of problems as I couldn’t find any latency outliers or significant queuing during that time.
> >>>>
> >>>> I’ve seen somewhat similar reports on 6.6.88 and 6.6.77, but those might have been different outliers.
> >>>>
> >>>> I’m attaching 3 logs - my intuition and the data so far leads me to consider this might be a kernel bug. I haven’t found a way to reproduce this, yet.
> >>>
> >>> From a first glance, these machines are struggling because IO contention as you
> >>> mentioned, more often than not they seem to be stalling waiting for log space to
> >>> be freed, so any operation in the FS gets throttled while the journal isn't
> >>> written back. If you have a small enough journal it will need to issue IO often
> >>> enough to cause IO contention. So, I'd point it to a slow storage or small
> >>> enough log area (or both).
> >>
> >> Yeah, my current analysis didn’t show any storage performance issues. I’ll revisit this once more to make sure I’m not missing anything. We’ve previously had issues in this area that turned out to be kernel bugs. We didn’t change anything regarding journal sizes and only a recent kernel upgrade seemed to be relevant.
> >
> > You mentioned you saw PSI showing a huge pressure ration, during the time, which
> > might be generated by the journal writeback and giving it's a SYNC write, IOs
> > will stall if your storage can't write it fast enough. IIRC, most of the threads
> > from the logs you shared were waiting for log space to be able to continue,
> > which causes the log to flush things to disk and of course increase IO usage.
> > If your storage can't handle these IO bursts, then you'll get the stalls you're
> > seeing.
> > I am not discarding a possibility you are hitting a bug here, but it so far
> > seems your storage is not being able to service IOs fast enough to avoid such IO
> > stalls, or something else throttling IOs, XFS seems just the victim.
> 
> Yeah, it’s annoying, I know. To paraphrase "any sufficiently advanced bug is indistinguishable from slow storage”. ;)
> 
> As promised, I’ll dive deeper into the storage performance analysis, all telemetry so far was completely innocuous, but it’s a complex layering of SSDs → Ceph → Qemu … Usually if we have performance issues then the metrics reflect this quite obviously and will affect many machines at the same time. As this has always just affected one single VM at a time but spread over time my gut feeling is a bit more on the side of “it might be maybe a bug”. As those things tend to be hard/nasty to diagnose I wanted to raise the flag early on to see whether other’s might be having an “aha” moment if they’re experiencing something similar.
> 
> I’ll get back to you in 2-3 days with results from the storage analysis.
> 
> > Can you share the xfs_info of one of these filesystems? I'm curious about the FS
> > geometry.
> 
> Sure:
> 
> # xfs_info /
> meta-data=/dev/disk/by-label/root isize=512    agcount=21, agsize=655040 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=1    bigtime=1 inobtcount=1 nrext64=0
>          =                       exchange=0
> data     =                       bsize=4096   blocks=13106171, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
> log      =internal log           bsize=4096   blocks=16384, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0

This seems to have been extended a few times, but nothing out-of-reality. Looks
fine, but, is this one of the filesystems where you are facing the problem? I'm
surprised this is the root FS, do you have that much IO into the rootFS?

> 
> # xfs_info /tmp/
> meta-data=/dev/vdb1              isize=512    agcount=8, agsize=229376 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=0    bigtime=0 inobtcount=0 nrext64=0
>          =                       exchange=0
> data     =                       bsize=4096   blocks=1833979, imaxpct=25
>          =                       sunit=1024   swidth=1024 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
> log      =internal log           bsize=4096   blocks=2560, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0

This is worrisome. Your journal size is 10MiB, this can easily keep stalling IO
waiting for log space to be freed, depending on the nature of the machine this
can be easily triggered. I'm curious though how you made this FS, because 2560
is below the minimal log size that xfsprogs allows since (/me goes look
into git log) 2022, xfsprogs 5.15.

FWIW, one of the reasons the minimum journal log size has been increased is the
latency/stalls that happens when waiting for free log space, which is exactly
the symptom you've been seeing.

I'd suggest you to check the xfsprogs commit below if you want more details,
but if this is one of the filesystems where you see the stalls, this might very
well be the cause:

commit cdfa467edd2d1863de067680b0a3ec4458e5ff4a
Author: Eric Sandeen <sandeen@xxxxxxxxxx>
Date:   Wed Apr 6 16:50:31 2022 -0400

    mkfs: increase the minimum log size to 64MB when possible

> 
> 
> >
> >>
> >>> There has been a few improvements though during Linux 6.9 on the log performance,
> >>> but I can't tell if you have any of those improvements around.
> >>> I'd suggest you trying to run a newer upstream kernel, otherwise you'll get very
> >>> limited support from the upstream community. If you can't, I'd suggest you
> >>> reporting this issue to your vendor, so they can track what you are/are not
> >>> using in your current kernel.
> >>
> >> Yeah, we’ve started upgrading selected/affected projects to 6.12, to see whether this improves things.
> >
> > Good enough.
> >
> >>
> >>> FWIW, I'm not sure if NixOS uses linux-stable kernels or not. If that's the
> >>> case, running a newer kernel suggestion is still valid.
> >>
> >> We’re running the NixOS mainline versions which are very vanilla. There are very very 4 small patches that only fix up things around building and binary paths for helpers to call to adapt them to the nix environment.
> >
> > I see. There were some improvements in the newer versions, so if you can rule
> > out any possibly fixed bug is worth it.
> >
> >
> >>
> >> Christian
> >>
> >>
> >> --
> >> Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
> >> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> >> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> >> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
> 
> 
> --
> Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
> 
>