Re: temporary hung tasks on XFS since updating to 6.6.92

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On 16. Jun 2025, at 10:50, Carlos Maiolino <cem@xxxxxxxxxx> wrote:
> 
> On Thu, Jun 12, 2025 at 03:37:10PM +0200, Christian Theune wrote:
>> Hi,
>> 
>> in the last week, after updating to 6.6.92, we’ve encountered a number of VMs reporting temporarily hung tasks blocking the whole system for a few minutes. They unblock by themselves and have similar tracebacks.
>> 
>> The IO PSIs show 100% pressure for that time, but the underlying devices are still processing read and write IO (well within their capacity). I’ve eliminated the underlying storage (Ceph) as the source of problems as I couldn’t find any latency outliers or significant queuing during that time.
>> 
>> I’ve seen somewhat similar reports on 6.6.88 and 6.6.77, but those might have been different outliers.
>> 
>> I’m attaching 3 logs - my intuition and the data so far leads me to consider this might be a kernel bug. I haven’t found a way to reproduce this, yet.
> 
> From a first glance, these machines are struggling because IO contention as you
> mentioned, more often than not they seem to be stalling waiting for log space to
> be freed, so any operation in the FS gets throttled while the journal isn't
> written back. If you have a small enough journal it will need to issue IO often
> enough to cause IO contention. So, I'd point it to a slow storage or small
> enough log area (or both).

Yeah, my current analysis didn’t show any storage performance issues. I’ll revisit this once more to make sure I’m not missing anything. We’ve previously had issues in this area that turned out to be kernel bugs. We didn’t change anything regarding journal sizes and only a recent kernel upgrade seemed to be relevant.

> There has been a few improvements though during Linux 6.9 on the log performance,
> but I can't tell if you have any of those improvements around.
> I'd suggest you trying to run a newer upstream kernel, otherwise you'll get very
> limited support from the upstream community. If you can't, I'd suggest you
> reporting this issue to your vendor, so they can track what you are/are not
> using in your current kernel.

Yeah, we’ve started upgrading selected/affected projects to 6.12, to see whether this improves things.

> FWIW, I'm not sure if NixOS uses linux-stable kernels or not. If that's the
> case, running a newer kernel suggestion is still valid.

We’re running the NixOS mainline versions which are very vanilla. There are very very 4 small patches that only fix up things around building and binary paths for helpers to call to adapt them to the nix environment.

Christian


-- 
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick






[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux