Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues

Zihuan Zhang <zhangzihuan@xxxxxxxxxx> · Fri, 8 Aug 2025 09:13:30 +0800

Hi,

在 2025/8/7 21:25, Michal Hocko 写道:
On Thu 07-08-25 20:14:09, Zihuan Zhang wrote:
The Linux task freezer was designed in a much earlier era, when userspace was relatively simple and flat.
Over the years, as modern desktop and mobile systems have become increasingly complex—with intricate IPC,
asynchronous I/O, and deep event loops—the original freezer model has shown its age.
A modern userspace might be more complex or convoluted but I do not
think the above statement is accurate or even correct.
You’re right — that statement may not be accurate. I’ll be more careful 
with the wording.
## Background

Currently, the freezer traverses the task list linearly and attempts to freeze all tasks equally.
It sends a signal and waits for `freezing()` to become true. While this model works well in many cases, it has several inherent limitations:

- Signal-based logic cannot freeze uninterruptible (D-state) tasks
- Dependencies between processes can cause freeze retries
- Retry-based recovery introduces unpredictable suspend latency

## Real-world problem illustration

Consider the following scenario during suspend:

Freeze Window Begins

     [process A] - epoll_wait()
         │
         ▼
     [process B] - event source (already frozen)

→ A enters D-state because of waiting for B
I thought opoll_wait was waiting in interruptible sleep.

Apologies — my description may not be entirely accurate.

But there are some dmesg logs:

[   62.880497] PM: suspend entry (deep)
[   63.130639] Filesystems sync: 0.249 seconds
[   63.130643] PM: Preparing system for sleep (deep)
[   63.226398] Freezing user space processes
[   63.227193] freeze round: 0, task to freeze: 681
[   63.228110] freeze round: 1, task to freeze: 1
[   63.230064] task:Xorg            state:D stack:0     pid:1404  tgid:1404  ppid:1348   task_flags:0x400100 flags:0x00004004
[   63.230068] Call Trace:
[   63.230069]  <TASK>
[   63.230071]  __schedule+0x52e/0xea0
[   63.230077]  schedule+0x27/0x80
[   63.230079]  schedule_timeout+0xf2/0x100
[   63.230082]  wait_for_completion+0x85/0x130
[   63.230085]  __flush_work+0x21f/0x310
[   63.230087]  ? __pfx_wq_barrier_func+0x10/0x10
[   63.230091]  drm_mode_rmfb+0x138/0x1b0
[   63.230093]  ? __pfx_drm_mode_rmfb_work_fn+0x10/0x10
[   63.230095]  ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10
[   63.230097]  drm_ioctl_kernel+0xa5/0x100
[   63.230099]  drm_ioctl+0x270/0x4b0
[   63.230101]  ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10
[   63.230104]  ? syscall_exit_work+0x108/0x140
[   63.230107]  radeon_drm_ioctl+0x4a/0x80 [radeon]
[   63.230141]  __x64_sys_ioctl+0x93/0xe0
[   63.230144]  ? syscall_trace_enter+0xfa/0x1c0
[   63.230146]  do_syscall_64+0x7d/0x2c0
[   63.230148]  ? do_syscall_64+0x1f3/0x2c0
[   63.230150]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   63.230153] RIP: 0033:0x7f1aa132550b
[   63.230154] RSP: 002b:00007ffebab69678 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   63.230156] RAX: ffffffffffffffda RBX: 00007ffebab696bc RCX: 00007f1aa132550b
[   63.230158] RDX: 00007ffebab696bc RSI: 00000000c00464af RDI: 000000000000000e
[   63.230159] RBP: 00000000c00464af R08: 00007f1aa0c41220 R09: 000055a71ce32310
[   63.230160] R10: 0000000000000087 R11: 0000000000000246 R12: 000055a71b813660
[   63.230161] R13: 000000000000000e R14: 0000000003a8f5cd R15: 000055a71b6bbfb0
[   63.230164]  </TASK>
[   63.230248] freeze round: 2, task to freeze: 1

You can find it in this patch

link: 
https://lore.kernel.org/all/20250619035355.33402-1-zhangzihuan@xxxxxxxxxx/

→ Cannot respond to freezing signal
→ Freezer retries in a loop
→ Suspend latency spikes

In such cases, we observed that a normal 1–2ms freezer cycle could balloon to **tens of milliseconds**.
Worse, the kernel has no insight into the root cause and simply retries blindly.

## Proposed solution: Freeze priority model

To address this, we propose a **layered freeze model** based on per-task freeze priorities.

### Design

We introduce 4 levels of freeze priority:

| Priority | Level             | Description                       |
|----------|-------------------|-----------------------------------|
| 0        | HIGH              | D-state TASKs                     |
| 1        | NORMAL            | regular  use space TASKS          |
| 2        | LOW               | not yet used                      |
| 4        | NEVER_FREEZE      | zombie TASKs , PF_SUSPNED_TASK    |

The kernel will freeze processes **in priority order**, ensuring that higher-priority tasks are frozen first.
This avoids dependency inversion scenarios and provides a deterministic path forward for tricky cases.
By freezing control or event-source threads first, we prevent dependent tasks from entering D-state prematurely — effectively avoiding dependency inversion.
I really fail to see how that is supposed to work to be honest. If a
process is running in the userspace then the priority shouldn't really
matter much. Tasks will get a signal, freeze themselves and you are
done. If they are running in the userspace and e.g. sleeping while not
TASK_FREEZABLE then priority simply makes no difference. And if they are
TASK_FREEZABLE then the priority doens't matter either.

What am I missing?
under ideal conditions, if a userspace task is TASK_FREEZABLE, receives 
the freezing() signal, and enters the refrigerator in a timely manner, 
then freeze priority wouldn’t make a difference.

However, in practice, we’ve observed cases where tasks appear stuck in 
uninterruptible sleep (D state) during the freeze phase  — and thus 
cannot respond to signals or enter the refrigerator. These tasks are 
technically TASK_FREEZABLE, but due to the nature of their sleep state, 
they don’t freeze promptly, and may require multiple retry rounds, or 
cause the entire suspend to fail.