Re: [PATCH v3 0/2] libnvdimm/e820: Add a new parameter to configure many regions per e820 entry

Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> · Thu, 28 Aug 2025 22:40:11 -0400

On Thu, Aug 28, 2025 at 8:48 PM Ira Weiny <ira.weiny@xxxxxxxxx> wrote:
>
> Michał Cłapiński wrote:
> > On Tue, Jul 1, 2025 at 2:05 PM Michał Cłapiński <mclapinski@xxxxxxxxxx> wrote:
> > >
> > > On Wed, Jun 25, 2025 at 11:16 PM Ira Weiny <ira.weiny@xxxxxxxxx> wrote:
> > > >
> > > > Michal Clapinski wrote:
> > > > > This includes:
> > > > > 1. Splitting one e820 entry into many regions.
> > > > > 2. Conversion to devdax during boot.
> > > > >
> > > > > This change is needed for the hypervisor live update. VMs' memory will
> > > > > be backed by those emulated pmem devices. To support various VM shapes
> > > > > I want to create devdax devices at 1GB granularity similar to hugetlb.
> > > > > Also detecting those devices as devdax during boot speeds up the whole
> > > > > process. Conversion in userspace would be much slower which is
> > > > > unacceptable while trying to minimize
> > > >
> > > > Did you explore the NFIT injection strategy which Dan suggested?[1]
> > > >
> > > > [1] https://lore.kernel.org/all/6807f0bfbe589_71fe2944d@xxxxxxxxxxxxxxxxxxxxxxxxx.notmuch/
> > > >
> > > > If so why did it not work?
> > >
> > > I'm new to all this so I might be off on some/all of the things.
> > >
> > > My issues with NFIT:
> > > 1. I can either go with custom bios or acpi nfit injection. Custom
> > > bios sounds rather aggressive to me and I'd prefer to avoid this. The
> > > NFIT injection is done via initramfs, right? If a system doesn't use
> > > initramfs at the moment, that would introduce another step in the boot
> > > process. One of the requirements of the hypervisor live update project
> > > is that the boot process has to be blazing fast and I'm worried
> > > introducing initramfs would go against this requirement.
> > > 2. If I were to create an NFIT, it would have to contain thousands of
> > > entries. That would have to be parsed on every boot. Again, I'm
> > > worried about the performance.
> > >
> > > Do you think an NFIT solution could be as fast as the simple command
> > > line solution?
> >
> > Hello,
> > just a follow up email. I'd like to receive some feedback on this.
>
> Apologies.  I'm not keen on adding kernel parameters so I'm curious what
> you think about Mike's new driver?[1]

Hi Ira,

Mike's proposal and our use case are different.

What we're proposing is a way to automatically convert emulated PMEM
into DAX/FSDAX during boot and subdivide it into page-aligned chunks
(e.g., 1G/2M). We have a userspace agent that then manages these
devdax devices, similar to how HugeTLB pages are handled, allowing the
chunks to be used in a cloud environment to support guest memory for
live updates.

To be clear, we are not trying to make the carved-out PMEM region
scalable. The hypervisor's memory allocation stays the same, and these
PMEM/DAX devices are used exclusively for running VMs. This approach
isn't intended for the general-purpose, scalable persistent memory use
case that Mike's driver addresses.

Pasha