On Wed, May 21, 2025 at 04:03:37PM -0700, Reinette Chatre wrote: > Hi Peter and Babu, > > On 5/21/25 2:18 AM, Peter Newman wrote: > > Hi Babu/Reinette, > > > > On Wed, May 21, 2025 at 1:44 AM Reinette Chatre > > <reinette.chatre@xxxxxxxxx> wrote: > >> > >> Hi Babu, > >> > >> On 5/20/25 4:25 PM, Moger, Babu wrote: > >>> Hi Reinette, > >>> > >>> On 5/20/2025 1:23 PM, Reinette Chatre wrote: > >>>> Hi Babu, > >>>> > >>>> On 5/20/25 10:51 AM, Moger, Babu wrote: > >>>>> Hi Reinette, > >>>>> > >>>>> On 5/20/25 11:06, Reinette Chatre wrote: > >>>>>> Hi Babu, > >>>>>> > >>>>>> On 5/20/25 8:28 AM, Moger, Babu wrote: > >>>>>>> On 5/19/25 10:59, Peter Newman wrote: > >>>>>>>> On Fri, May 16, 2025 at 12:52 AM Babu Moger <babu.moger@xxxxxxx> wrote: > >>>>>> > >>>>>> ... > >>>>>> > >>>>>>>>> /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs: Reports the number of monitoring > >>>>>>>>> counters available for assignment. > >>>>>>>> > >>>>>>>> Earlier I discussed with Reinette[1] what num_mbm_cntrs should > >>>>>>>> represent in a "soft-ABMC" implementation where assignment is > >>>>>>>> implemented by assigning an RMID, which would result in all events > >>>>>>>> being assigned at once. > >>>>>>>> > >>>>>>>> My main concern is how many "counters" you can assign by assigning > >>>>>>>> RMIDs. I recall Reinette proposed reporting the number of groups which > >>>>>>>> can be assigned separately from counters which can be assigned. > >>>>>>> > >>>>>>> More context may be needed here. Currently, num_mbm_cntrs indicates the > >>>>>>> number of counters available per domain, which is 32. > >>>>>>> > >>>>>>> At the moment, we can assign 2 counters to each group, meaning each RMID > >>>>>>> can be associated with 2 hardware counters. In theory, it's possible to > >>>>>>> assign all 32 hardware counters to a group—allowing one RMID to be linked > >>>>>>> with up to 32 counters. However, we currently lack the interface to > >>>>>>> support that level of assignment. > >>>>>>> > >>>>>>> For now, the plan is to support basic assignment and expand functionality > >>>>>>> later once we have the necessary data structure and requirements. > >>>>>> > >>>>>> Looks like some requirements did not make it into this implementation. > >>>>>> Do you recall the discussion that resulted in you writing [2]? Looks like > >>>>>> there is a question to Peter in there on how to determine how many "counters" > >>>>>> are available in soft-ABMC. I interpreted [3] at that time to mean that this > >>>>>> information would be available in a future AMD publication. > >>>>> > >>>>> We already have a method to determine the number of counters in soft-ABMC > >>>>> mode, which Peter has addressed [4]. > >>>>> > >>>>> [4] > >>>>> https://lore.kernel.org/lkml/20250203132642.2746754-1-peternewman@xxxxxxxxxx/ > >>>>> > >>>>> This appears to be more of a workaround, and I doubt it will be included > >>>>> in any official AMD documentation. Additionally, the long-term direction > >>>>> is moving towards ABMC. > >>>>> > >>>>> I don’t believe this workaround needs to be part of the current series. It > >>>>> can be added later when soft-ABMC is implemented. > >>>> > >>>> Agreed. What about the plans described in [2]? (Thanks to Peter for > >>>> catching this!). > >>>> > >>>> It is important to keep track of requirements while working on a feature to > >>>> ensure that the implementation supports the planned use cases. Re-reading that > >>>> thread it is not clear to me how soft-ABMC's per-group assignment would look. > >>>> Could you please share how you see it progress from this implementation? > >>>> This includes the single event vs. multiple event assignment. I would like to > >>>> highlight that this is not a request for this to be supported in this implementation > >>>> but there needs to be a plan for how this can be supported on top of interfaces > >>>> established by this work. > >>>> > >>> > >>> Here’s my current understanding of soft-ABMC. Peter may have a more in-depth perspective on this. > >>> > >>> Soft-ABMC: > >>> a. num_mbm_cntrs: This is a software-defined limit based on the number of active RMIDs that can be supported. The value can be obtained using the code referenced in [4]. > > > > I would call it a hardware-defined limit that can be probed by software. > > > > The main question is whether this file returns the exact number of > > RMIDs hardware can track or double that number (mbm_total_bytes + > > mbm_local_bytes) so that the value is always measured in events. > > tl;dr: I continue [3] to find it most intuitive for num_mbm_cntrs to be the exact > number of "active" RMIDs that the system can support *and* changing the name of > the modes to help user interpret num_mbm_cntrs: "mbm_cntr_event_assign" for ABMC, > "mbm_cntr_group_assign" for soft-ABMC. > > details > ------- > > We are now back to the previous discussion about what user can expect from > the interface. Let me try and re-cap that discussion so that we can all hopefully > get back on the same page. Please add corrections/updates where needed. > > soft-ABMC > --------- > soft-ABMC manages "active" (term TBD) RMID assignment to monitor groups. When an > "active" RMID is assigned to a monitor group then *all* MBM events (not LLC occupancy) > in that monitor group are counted. "Active" RMID assignment can be done per domain. > > Requirement: resctrl should accurately reflect which events are counted. That is, > we do not want resctrl to pretend to allow user to assign an "active" RMID to > only one event in a monitor group while all events are actually counted. > > Caveat: To support rapid re-assignment of RMIDs to monitor groups, llc_occupancy > event is disabled when soft-ABMC is enabled. > > ABMC > ---- > ABMC manages (hardware) counter assignment to monitor group (RMID), event pairs. > When a hardware counter is assigned to an RMID, event pair then only that > RMID, event is counted. Hardware counter assignment can be done per domain. > > > shared assignment > ----------------- > A shared assignment applies to both soft-ABMC and ABMC. A user can designate a > "counter" (could be hardware counter or "active" RMID) as shared and that means > the counter within that domain is shared between different monitor groups and actual > assignment is scheduled by resctrl. > > > user interface > -------------- > > Next, consider the interface while keeping above definitions and requirements in mind. > > This series introduces (using implementation, not cover-letter): > > /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs > "num_mbm_cntrs": > The maximum number of monitoring counters (total of available and assigned > counters) in each domain when the system supports mbm_cntr_assign mode. > > /sys/fs/resctrl/mbm_L3_assignments > "mbm_L3_assignments": > This interface file is created when the mbm_cntr_assign mode is supported > and shows the assignment status for each group. > > Consider "mbm_L3_assignments" first. The interface is documented for ABMC support > where it is possible to manage individual event assignment within monitor group. > > For ABMC it is possible to assign just one event at a time and doing so consumes > one counter in that domain: > > a) Starting state on system with 32 counters per domain, two events in default > resource group consumes two counters in that domain: > # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs > 0=30;1=32 > # cat /sys/fs/resctrl/mbm_L3_assignments > mbm_total_bytes:0=e;1=_ > mbm_local_bytes:0=e;1=_ > > b) Assign counter to mbm_local_bytes in domain 1: > # echo "mbm_local_bytes:1=e" > /sys/fs/resctrl/mbm_L3_assignments > # cat /sys/fs/resctrl/mbm_L3_assignments > mbm_total_bytes:0=e;1=_ > mbm_local_bytes:0=e;1=e > # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs > 0=30;1=31 > > The question is how this should look on soft-ABMC system. Let's say hypothetically > that on a soft-ABMC system it is possible to have 32 "active" RMIDs. > > a) Starting state on system with 32 "active RMIDs" per domain, two events in default > resource group consumes one RMID in that domain: > > # cat /sys/fs/resctrl/mbm_L3_assignments > mbm_total_bytes:0=e;1=_ > mbm_local_bytes:0=e;1=_ > > What should num_mbm_cntrs display? > > Option A (counters are RMIDs): > # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs > 0=31;1=32 > > Option B (pretend RMIDs are events): > # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs > 0=62;1=64 > > b) Assign counter to mbm_local_bytes in domain 1: > # echo "mbm_local_bytes:1=e" > /sys/fs/resctrl/mbm_L3_assignments > # cat /sys/fs/resctrl/mbm_L3_assignments > mbm_total_bytes:0=e;1=e > mbm_local_bytes:0=e;1=e > > Note that even though user requested only mbm_local_bytes to be assigned, it > actually results in both mbm_total_bytes and mbm_local_bytes to be assigned. This > ensures accurate state representation to user space but this also creates an > inconsistent user interface between soft-ABMC and ABMC since user space intends > to use the same interface but "sometimes" assigning one event results in assign > of one event while "sometimes" it results in assign of multiple events. > > wrt "num_mbm_cntrs" > > Option A (counters are RMIDs): > # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs > 0=31;1=31 > > Option B (pretend RMIDs are events): > # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs > 0=62;1=62 > > Neither option seems ideal to me since the interface cannot be consistent > between ABMC and soft-ABMC. > As I mentioned in [2] it is not possible to hide ABMC and soft-ABMC behind > the same interface. When user space wants to monitor a particular monitor group > then it should be clear how that can be accomplished. Not knowing if > an assignment/unassignment to/from an event would impact one or all events > and whether it will consume one or multiple counters does not sound like a good > interface to me. > > As I understand current interface, user is required to know how ABMC and soft-ABMC > is implemented to be able to configure the system. For example, if user has file like: > # cat /sys/fs/resctrl/mbm_L3_assignments > mbm_total_bytes:0=e;1=e > mbm_local_bytes:0=e;1=e > user must know underlying implementation to be able to manage monitoring of > events and assigning counters otherwise it will be a surprise to lose monitoring > of all events when unassigning one event. > > This is why I proposed in [3] that the name of the mode reflects how user can interact > with the system. Instead of one "mbm_cntr_assign" mode there can be "mbm_cntr_event_assign" > that is used for ABMC and "mbm_cntr_group_assign" that is used for soft-ABMC. The mode should > make it clear what the system is capable of wrt counter assignments. > > Considering this the interface should be clear: > num_mbm_cntrs: reflects the number of counters in each domain that can be assigned. In > "mbm_cntr_event_assign" this will be the number of counters that can be assigned to > each event within a monitoring group, in "mbm_cntr_group_assign" this will be the number > of counters that can be assigned to entire monitoring groups impacting all MBM events. > > mbm_L3_assignments: manages the counter assignment in each group. When user knows the mode > is "mbm_cntr_event_assign"/"mbm_cntr_group_assign" then it should be clear to user space how the > interface behaves wrt assignment, no surprises of multiple events impacted when > assigning/unassigning single event. > > For soft-ABMC I thus find it most intuitive for num_mbm_cntrs to be the exact number > of "active" RMIDs that the system can support *and* changing the name of the modes > to help user interpret num_mbm_cntrs. > > > > > There's also the mongroup-RMID overcommit use case I described > > above[1]. On Intel we can safely assume that there are counters to > > back all RMIDs, so num_mbm_cntrs would be calculated directly from > > num_rmids. > > This is about the: > There's now more interest in Google for allowing explicit control of > where RMIDs are assigned on Intel platforms. Even though the number of > RMIDs implemented by hardware tends to be roughly the number of > containers they want to support, they often still need to create > containers when all RMIDs have already been allocated, which is not > currently allowed. Once the container has been created and starts > running, it's no longer possible to move its threads into a monitoring > group whenever RMIDs should become available again, so it's important > for resctrl to maintain an accurate task list for a container even > when RMIDs are not available. > > I see a monitor group as a collection of tasks that need to be monitored together. > The "task list" is the group of tasks that share a monitoring ID that > is required to be a valid ID since when any of the tasks are scheduled that ID is > written to the hardware. I intentionally tried to not use RMID since I believe > this is required for all archs. > I thus do not understand how a task can start running when it does not have > a valid monitoring ID. The idea of "deferred assignment" is not clear to me, > there can never be "unmonitored tasks", no? I think I am missing something here. In the AMD/RMID implemenentation this might be achieved with something extra in the task structure to denote whether a task is in a monitored group or not. E.g. We add "task->rmid_valid" as well as "task->rmid". Tasks in an unmonitored group retain their "task->rmid" (that's what identifies them as a member of a group) but have task->rmid_valid set to false. Context switch code would be updated to load "0" into the IA32_PQR_ASSOC.RMID field for tasks without a valid RMID. So they would still be monitored, but activity would be bundled with all tasks in the default resctrl group. Presumably something analogous could be done for ARM/MPAM. > > I realized this use case is more difficult to implement on MPAM, > > because a PARTID is effectively a CLOSID+RMID, so deferring assigning > > a unique PARTID to a group also results in it being in a different > > allocation group. It will work if the unmonitored groups could find a > > way to share PARTIDs, but this has consequences on allocation - but > > hopefully no worse than sharing CLOSIDs on x86. > > > > There's a lot of interest in monitoring ID overcommit in Google, so I > > think it's worth it for me to investigate the additional structural > > changes needed in resctrl (i.e., breaking the FS-level association > > between mongroups and HW monitoring IDs). Such a framework could be a > > better fit for soft-ABMC. For example, if overcommit is allowed, we > > would just report the number of simultaneous RMIDs we were able to > > probe as num_rmids. I would want the same shared assignment scheduler > > to be able to work with RMIDs and counters, though. > > > > Thanks, > > -Peter > > > > [1] https://lore.kernel.org/lkml/CALPaoChSzzU5mzMZsdT6CeyEn0WD1qdT9fKCoNW_ty4tojtrkw@xxxxxxxxxxxxxx/ > > Reinette > > [2] https://lore.kernel.org/lkml/b9e48e8f-3035-4a7e-a983-ce829bd9215a@xxxxxxxxx/ > [3] https://lore.kernel.org/lkml/b3babdac-da08-4dfd-9544-47db31d574f5@xxxxxxxxx/ -Tony