On Mon, Apr 21, 2025 at 7:21 PM Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > Michal Clapinski wrote: > > Currently, the user has to specify each memory region to be used with > > nvdimm via the memmap parameter. Due to the character limit of the > > command line, this makes it impossible to have a lot of pmem devices. > > This new parameter solves this issue by allowing users to divide > > one e820 entry into many nvdimm regions. > > > > This change is needed for the hypervisor live update. VMs' memory will > > be backed by those emulated pmem devices. To support various VM shapes > > I want to create devdax devices at 1GB granularity similar to hugetlb. > > This looks fairly straightforward, but if this moves forward I would > explicitly call the parameter something like "split" instead of "pmem" > to align it better with its usage. > > However, while this is expedient I wonder if you would be better > served with ACPI table injection to get more control and configuration > options... > > > It's also possible to expand this parameter in the future, > > e.g. to specify the type of the device (fsdax/devdax). > > ...for example, if you injected or customized your BIOS to supply an > ACPI NFIT table you could get to deeper degrees of customization without > wrestling with command lines. Supply an ACPI NFIT that carves up a large > memory-type range into an aribtrary number of regions. In the NFIT there > is a natural place to specify whether the range gets sent to PMEM. See > call to nvdimm_pmem_region_create() near NFIT_SPA_PM in > acpi_nfit_register_region()", and "simply" pick a new guid to signify > direct routing to device-dax. I say simply, but that implies new ACPI > NFIT driver plumbing for the new mode. > > Another overlooked detail about NFIT is that there is an opportunity to > determine cases where the platform might have changed the physical > address map from one boot to the next. In other words, I cringe at the > fragility of memmap=, but I understand that it has the benefit of being > simple. See the "nd_set cookie" concept in > acpi_nfit_init_interleave_set(). I also dislike the potential fragility of the memmap= parameter; however, in our environment, kernel parameters are specifically crafted for target machine configurations and supplied separately from the kernel binary, giving us good control. Regarding the ACPI NFIT suggestion: Our use case involves reusing the same physical machines (with unchanged firmware) for various configurations (similar to loaning them out). An advantage for us is that switching the machine's role only requires changing the kernel parameters. The ACPI approach, potentially requiring firmware changes, would break this dynamic reconfiguration. As I understand, using ACPI injection instead of firmware change doesn't eliminate fragility concerns either. We would still need to carefully reserve the specific physical range for a particular machine configuration, and it also adds a dependency on managing and packaging an external NFIT injection file and process. We have a process for kernel parameters but doing this externally would complicate things for us. Also, I might be missing something, but I haven't found a standard way to automatically create devdax devices using NFIT injection. Our current plan is to expand the proposed kernel parameter. We are working on making it default to creating either fsdax or devdax type regions, without requiring explicit labels, and ensuring these regions remain stable across kexec as long as the kernel parameter itself doesn't change (in a way kernel parameters take the role of the labels). Pasha