From: Shashank Balaji <shashank.mahadasyam@xxxxxxxx> The cpu controller interface files account for or affect processes differently based on their scheduling policy, and the underlying scheduler used (fair-class vs. BPF scheduler). Document these differences Signed-off-by: Shashank Balaji <shashank.mahadasyam@xxxxxxxx> --- Documentation/admin-guide/cgroup-v2.rst | 98 +++++++++++++++++++++++++-------- 1 file changed, 75 insertions(+), 23 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 3b3685736fe9b12e96a273248dfb4a8c62a4b698..0f79bf42a3e3b2fcbe6409f9e182ba9de1fbb79c 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1095,19 +1095,50 @@ realtime processes irrespective of CONFIG_RT_GROUP_SCHED. CPU Interface Files ~~~~~~~~~~~~~~~~~~~ -All time durations are in microseconds. +The interaction of a process with the cpu controller depends on its scheduling +policy. We have the following scheduling policies: ``SCHED_IDLE``, ``SCHED_BATCH``, +``SCHED_OTHER``, ``SCHED_EXT`` (if ``CONFIG_SCHED_CLASS_EXT`` is enabled), ``SCHED_FIFO``, +``SCHED_RR``, and ``SCHED_DEADLINE``. ``SCHED_{IDLE,BATCH,OTHER,EXT}`` can be scheduled +either by the fair-class scheduler or by a BPF scheduler:: + + CONFIG_SCHED_CLASS_EXT + ├─ Disabled + | └─ SCHED_{IDLE,BATCH,OTHER} -> fair-class scheduler + └─ Enabled + ├─ BPF scheduler disabled + | └─ SCHED_{IDLE,BATCH,OTHER,EXT} -> fair-class scheduler + ├─ BPF scheduler without SCX_OPS_SWITCH_PARTIAL enabled + | └─ SCHED_{IDLE,BATCH,OTHER,EXT} -> BPF scheduler + └─ BPF scheduler with SCX_OPS_SWITCH_PARTIAL enabled + ├─ SCHED_{IDLE,BATCH,OTHER} -> fair-class scheduler + └─ SCHED_EXT -> BPF scheduler + +For more details on ``SCHED_EXT``, check out :ref:`Documentation/scheduler/sched-ext.rst. <sched-ext>` +From the point of view of the cpu controller, processes can be categorized as +follows: + +* Processes under the fair-class scheduler +* Processes under a BPF scheduler with the ``cgroup_set_weight`` callback +* Everything else: ``SCHED_{FIFO,RR,DEADLINE}`` and processes under a BPF scheduler + without the ``cgroup_set_weight`` callback + +Note that the ``cgroup_*`` family of callbacks require ``CONFIG_EXT_GROUP_SCHED`` +to be enabled. For each of the following interface files, the above categories +will be referred to. All time durations are in microseconds. cpu.stat A read-only flat-keyed file. This file exists whether the controller is enabled or not. - It always reports the following three stats: + It always reports the following three stats, which account for all the + processes in the cgroup: - usage_usec - user_usec - system_usec - and the following five when the controller is enabled: + and the following five when the controller is enabled, which account for + only the processes under the fair-class scheduler: - nr_periods - nr_throttled @@ -1125,6 +1156,10 @@ All time durations are in microseconds. If the cgroup has been configured to be SCHED_IDLE (cpu.idle = 1), then the weight will show as a 0. + This file affects only processes under the fair-class scheduler and a BPF + scheduler with the ``cgroup_set_weight`` callback depending on what the + callback actually does. + cpu.weight.nice A read-write single value file which exists on non-root cgroups. The default is "0". @@ -1137,6 +1172,10 @@ All time durations are in microseconds. granularity is coarser for the nice values, the read value is the closest approximation of the current weight. + This file affects only processes under the fair-class scheduler and a BPF + scheduler with the ``cgroup_set_weight`` callback depending on what the + callback actually does. + cpu.max A read-write two value file which exists on non-root cgroups. The default is "max 100000". @@ -1149,43 +1188,56 @@ All time durations are in microseconds. $PERIOD duration. "max" for $MAX indicates no limit. If only one number is written, $MAX is updated. + This file affects only processes under the fair-class scheduler. + cpu.max.burst A read-write single value file which exists on non-root cgroups. The default is "0". The burst in the range [0, $MAX]. + This file affects only processes under the fair-class scheduler. + cpu.pressure A read-write nested-keyed file. - Shows pressure stall information for CPU. See - :ref:`Documentation/accounting/psi.rst <psi>` for details. + Shows pressure stall information for CPU, including the contribution of + realtime processes. See :ref:`Documentation/accounting/psi.rst <psi>` + for details. + + This file accounts for all the processes in the cgroup. cpu.uclamp.min - A read-write single value file which exists on non-root cgroups. - The default is "0", i.e. no utilization boosting. + A read-write single value file which exists on non-root cgroups. + The default is "0", i.e. no utilization boosting. - The requested minimum utilization (protection) as a percentage - rational number, e.g. 12.34 for 12.34%. + The requested minimum utilization (protection) as a percentage + rational number, e.g. 12.34 for 12.34%. - This interface allows reading and setting minimum utilization clamp - values similar to the sched_setattr(2). This minimum utilization - value is used to clamp the task specific minimum utilization clamp. + This interface allows reading and setting minimum utilization clamp + values similar to the sched_setattr(2). This minimum utilization + value is used to clamp the task specific minimum utilization clamp, + including those of realtime processes. - The requested minimum utilization (protection) is always capped by - the current value for the maximum utilization (limit), i.e. - `cpu.uclamp.max`. + The requested minimum utilization (protection) is always capped by + the current value for the maximum utilization (limit), i.e. + `cpu.uclamp.max`. + + This file affects all the processes in the cgroup. cpu.uclamp.max - A read-write single value file which exists on non-root cgroups. - The default is "max". i.e. no utilization capping + A read-write single value file which exists on non-root cgroups. + The default is "max". i.e. no utilization capping - The requested maximum utilization (limit) as a percentage rational - number, e.g. 98.76 for 98.76%. + The requested maximum utilization (limit) as a percentage rational + number, e.g. 98.76 for 98.76%. - This interface allows reading and setting maximum utilization clamp - values similar to the sched_setattr(2). This maximum utilization - value is used to clamp the task specific maximum utilization clamp. + This interface allows reading and setting maximum utilization clamp + values similar to the sched_setattr(2). This maximum utilization + value is used to clamp the task specific maximum utilization clamp, + including those of realtime processes. + + This file affects all the processes in the cgroup. cpu.idle A read-write single value file which exists on non-root cgroups. @@ -1197,7 +1249,7 @@ All time durations are in microseconds. own relative priorities, but the cgroup itself will be treated as very low priority relative to its peers. - + This file affects only processes under the fair-class scheduler. Memory ------ -- 2.43.0