Hi,
在 2025/8/7 21:25, Michal Hocko 写道:
On Thu 07-08-25 20:14:09, Zihuan Zhang wrote:
The Linux task freezer was designed in a much earlier era, when userspace was relatively simple and flat.
Over the years, as modern desktop and mobile systems have become increasingly complex—with intricate IPC,
asynchronous I/O, and deep event loops—the original freezer model has shown its age.
A modern userspace might be more complex or convoluted but I do not
think the above statement is accurate or even correct.
You’re right — that statement may not be accurate. I’ll be more careful
with the wording.
## Background
Currently, the freezer traverses the task list linearly and attempts to freeze all tasks equally.
It sends a signal and waits for `freezing()` to become true. While this model works well in many cases, it has several inherent limitations:
- Signal-based logic cannot freeze uninterruptible (D-state) tasks
- Dependencies between processes can cause freeze retries
- Retry-based recovery introduces unpredictable suspend latency
## Real-world problem illustration
Consider the following scenario during suspend:
Freeze Window Begins
[process A] - epoll_wait()
│
▼
[process B] - event source (already frozen)
→ A enters D-state because of waiting for B
I thought opoll_wait was waiting in interruptible sleep.
Apologies — my description may not be entirely accurate.
But there are some dmesg logs:
[ 62.880497] PM: suspend entry (deep)
[ 63.130639] Filesystems sync: 0.249 seconds
[ 63.130643] PM: Preparing system for sleep (deep)
[ 63.226398] Freezing user space processes
[ 63.227193] freeze round: 0, task to freeze: 681
[ 63.228110] freeze round: 1, task to freeze: 1
[ 63.230064] task:Xorg state:D stack:0 pid:1404 tgid:1404 ppid:1348 task_flags:0x400100 flags:0x00004004
[ 63.230068] Call Trace:
[ 63.230069] <TASK>
[ 63.230071] __schedule+0x52e/0xea0
[ 63.230077] schedule+0x27/0x80
[ 63.230079] schedule_timeout+0xf2/0x100
[ 63.230082] wait_for_completion+0x85/0x130
[ 63.230085] __flush_work+0x21f/0x310
[ 63.230087] ? __pfx_wq_barrier_func+0x10/0x10
[ 63.230091] drm_mode_rmfb+0x138/0x1b0
[ 63.230093] ? __pfx_drm_mode_rmfb_work_fn+0x10/0x10
[ 63.230095] ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10
[ 63.230097] drm_ioctl_kernel+0xa5/0x100
[ 63.230099] drm_ioctl+0x270/0x4b0
[ 63.230101] ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10
[ 63.230104] ? syscall_exit_work+0x108/0x140
[ 63.230107] radeon_drm_ioctl+0x4a/0x80 [radeon]
[ 63.230141] __x64_sys_ioctl+0x93/0xe0
[ 63.230144] ? syscall_trace_enter+0xfa/0x1c0
[ 63.230146] do_syscall_64+0x7d/0x2c0
[ 63.230148] ? do_syscall_64+0x1f3/0x2c0
[ 63.230150] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 63.230153] RIP: 0033:0x7f1aa132550b
[ 63.230154] RSP: 002b:00007ffebab69678 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 63.230156] RAX: ffffffffffffffda RBX: 00007ffebab696bc RCX: 00007f1aa132550b
[ 63.230158] RDX: 00007ffebab696bc RSI: 00000000c00464af RDI: 000000000000000e
[ 63.230159] RBP: 00000000c00464af R08: 00007f1aa0c41220 R09: 000055a71ce32310
[ 63.230160] R10: 0000000000000087 R11: 0000000000000246 R12: 000055a71b813660
[ 63.230161] R13: 000000000000000e R14: 0000000003a8f5cd R15: 000055a71b6bbfb0
[ 63.230164] </TASK>
[ 63.230248] freeze round: 2, task to freeze: 1
You can find it in this patch
link:
https://lore.kernel.org/all/20250619035355.33402-1-zhangzihuan@xxxxxxxxxx/
→ Cannot respond to freezing signal
→ Freezer retries in a loop
→ Suspend latency spikes
In such cases, we observed that a normal 1–2ms freezer cycle could balloon to **tens of milliseconds**.
Worse, the kernel has no insight into the root cause and simply retries blindly.
## Proposed solution: Freeze priority model
To address this, we propose a **layered freeze model** based on per-task freeze priorities.
### Design
We introduce 4 levels of freeze priority:
| Priority | Level | Description |
|----------|-------------------|-----------------------------------|
| 0 | HIGH | D-state TASKs |
| 1 | NORMAL | regular use space TASKS |
| 2 | LOW | not yet used |
| 4 | NEVER_FREEZE | zombie TASKs , PF_SUSPNED_TASK |
The kernel will freeze processes **in priority order**, ensuring that higher-priority tasks are frozen first.
This avoids dependency inversion scenarios and provides a deterministic path forward for tricky cases.
By freezing control or event-source threads first, we prevent dependent tasks from entering D-state prematurely — effectively avoiding dependency inversion.
I really fail to see how that is supposed to work to be honest. If a
process is running in the userspace then the priority shouldn't really
matter much. Tasks will get a signal, freeze themselves and you are
done. If they are running in the userspace and e.g. sleeping while not
TASK_FREEZABLE then priority simply makes no difference. And if they are
TASK_FREEZABLE then the priority doens't matter either.
What am I missing?
under ideal conditions, if a userspace task is TASK_FREEZABLE, receives
the freezing() signal, and enters the refrigerator in a timely manner,
then freeze priority wouldn’t make a difference.
However, in practice, we’ve observed cases where tasks appear stuck in
uninterruptible sleep (D state) during the freeze phase — and thus
cannot respond to signals or enter the refrigerator. These tasks are
technically TASK_FREEZABLE, but due to the nature of their sleep state,
they don’t freeze promptly, and may require multiple retry rounds, or
cause the entire suspend to fail.