Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem Suspend Resume

Eric Sandeen <sandeen@xxxxxxxxxxx> · Thu, 27 Mar 2025 09:55:21 -0500

On 3/24/25 2:28 PM, Jan Kara wrote:
> On Mon 24-03-25 10:34:56, James Bottomley wrote:
>> On Mon, 2025-03-24 at 12:38 +0100, Jan Kara wrote:
>>> On Fri 21-03-25 13:00:24, James Bottomley via Lsf-pc wrote:
>>>> On Fri, 2025-03-21 at 08:34 -0400, James Bottomley wrote:
>>>> [...]
>>>>> Let me digest all that and see if we have more hope this time
>>>>> around.
>>>>
>>>> OK, I think I've gone over it all.  The biggest problem with
>>>> resurrecting the patch was bugs in ext3, which isn't a problem now.
>>>> Most of the suspend system has been rearchitected to separate
>>>> suspending user space processes from kernel ones.  The sync it
>>>> currently does occurs before even user processes are frozen.  I
>>>> think
>>>> (as most of the original proposals did) that we just do freeze all
>>>> supers (using the reverse list) after user processes are frozen but
>>>> just before kernel threads are (this shouldn't perturb the image
>>>> allocation in hibernate, which was another source of bugs in xfs).
>>>
>>> So as far as my memory serves the fundamental problem with this
>>> approach was FUSE - once userspace is frozen, you cannot write to
>>> FUSE filesystems so filesystem freezing of FUSE would block if
>>> userspace is already suspended. You may even have a setup like:
>>>
>>> bdev <- fs <- FUSE filesystem <- loopback file <- loop device <-
>>> another fs
>>>
>>> So you really have to be careful to freeze this stack without causing
>>> deadlocks.
>>
>> Ah, so that explains why the sys_sync() sits in suspend resume *before*
>> freezing userspace ... that always appeared odd to me.
>>
>>>  So you need to be freezing userspace after filesystems are
>>> frozen but then you have to deal with the fact that parts of your
>>> userspace will be blocked in the kernel (trying to do some write)
>>> waiting for the filesystem to thaw. But it might be tractable these
>>> days since I have a vague recollection that system suspend is now
>>> able to gracefully handle even tasks in uninterruptible sleep.
>>
>> There is another thing I thought about: we don't actually have to
>> freeze across the sleep.  It might be possible simply to invoke
>> freeze/thaw where sys_sync() is now done to get a better on stable
>> storage image?  That should have fewer deadlock issues.
> 
> Well, there's not going to be a huge difference between doing sync(2) and
> doing freeze+thaw for each filesystem. After you thaw the filesystem
> drivers usually mark that the fs is in inconsistent state and that triggers
> journal replay / fsck on next mount.

For XFS, IIRC we only do that (mark the log dirty) so that we will process
orphan inodes if we crash while frozen, which today happens only during log
replay. I tried to remove that behavior long ago but didn't get very far.
(Since then maybe we have grown other reasons to mark dirty, not sure.)

https://lore.kernel.org/linux-xfs/83696ce6-4054-0e77-b4b8-e82a1a9fbbc3@xxxxxxxxxx/

Does ext4 mark it dirty too? I actually thought it left a clean journal when
freezing.

Thanks,
-Eric

> 								Honza