On 4/30/25 11:10 AM, Gregory Price wrote: > Add documentation on how the CXL driver interacts with the DAX driver. > > Signed-off-by: Gregory Price <gourry@xxxxxxxxxx> > --- > Documentation/driver-api/cxl/index.rst | 1 + > .../driver-api/cxl/linux/cxl-driver.rst | 115 ++++++++++++++++-- > .../driver-api/cxl/linux/dax-driver.rst | 43 +++++++ > 3 files changed, 149 insertions(+), 10 deletions(-) > create mode 100644 Documentation/driver-api/cxl/linux/dax-driver.rst > > diff --git a/Documentation/driver-api/cxl/linux/cxl-driver.rst b/Documentation/driver-api/cxl/linux/cxl-driver.rst > index 486baf8551aa..1a354ea1cda4 100644 > --- a/Documentation/driver-api/cxl/linux/cxl-driver.rst > +++ b/Documentation/driver-api/cxl/linux/cxl-driver.rst > @@ -34,6 +34,32 @@ into a single memory region. The memory region has been converted to dax. :: > decoder1.0 decoder5.0 endpoint5 port1 region0 > decoder2.0 decoder5.1 endpoint6 port2 root0 > > + > +.. kernel-render:: DOT > + :alt: Digraph of CXL fabric describing host-bridge interleaving > + :caption: Diagraph of CXL fabric with a host-bridge interleave memory region > + > + digraph foo { > + "root0" -> "port1"; > + "root0" -> "port3"; > + "root0" -> "decoder0.0"; > + "port1" -> "endpoint5"; > + "port3" -> "endpoint6"; > + "port1" -> "decoder1.0"; > + "port3" -> "decoder3.0"; > + "endpoint5" -> "decoder5.0"; > + "endpoint6" -> "decoder6.0"; > + "decoder0.0" -> "region0"; > + "decoder0.0" -> "decoder1.0"; > + "decoder0.0" -> "decoder3.0"; > + "decoder1.0" -> "decoder5.0"; > + "decoder3.0" -> "decoder6.0"; > + "decoder5.0" -> "region0"; > + "decoder6.0" -> "region0"; > + "region0" -> "dax_region0"; > + "dax_region0" -> "dax0.0"; > + } > + > For this section we'll explore the devices present in this configuration, but > we'll explore more configurations in-depth in example configurations below. > > @@ -41,7 +67,7 @@ Base Devices > ------------ > Most devices in a CXL fabric are a `port` of some kind (because each > device mostly routes request from one device to the next, rather than > -provide a manageable service). > +provide a direct service). > > Root > ~~~~ > @@ -53,6 +79,8 @@ The Root contains links to: > > * `Host Bridge Ports` defined by ACPI CEDT CHBS. > > +* `Downstream Ports` typically connected to `Host Bridge Ports` Add ending '.' for consistency. > + > * `Root Decoders` defined by ACPI CEDT CFMWS. > > :: > @@ -150,6 +178,27 @@ device configuration data. :: > driver label_storage_size pmem serial > firmware numa_node ram subsystem > > +A Memory Device is a discrete base object that is not a port. While it the > +physical device it belongs to may host an `endpoint`, this relationship is I have some parsing trouble with the sentence above. Maybe s/it the/the/. > +not captured in sysfs. > + > +Port Relationships > +~~~~~~~~~~~~~~~~~~ > +In our example described above, there are four host bridges attached to the > +root, and two of the host bridges have one endpoint attached. > + > +.. kernel-render:: DOT > + :alt: Digraph of CXL fabric describing host-bridge interleaving > + :caption: Diagraph of CXL fabric with a host-bridge interleave memory region > + > + digraph foo { > + "root0" -> "port1"; > + "root0" -> "port2"; > + "root0" -> "port3"; > + "root0" -> "port4"; > + "port1" -> "endpoint5"; > + "port3" -> "endpoint6"; > + } > > Decoders > -------- > @@ -322,6 +371,29 @@ settings (granularity and ways must be the same). > Endpoint decoders are created during :code:`cxl_endpoint_port_probe` in the > :code:`cxl_port` driver, and is created based on a PCI device's DVSEC registers. > > +Decoder Relationships > +~~~~~~~~~~~~~~~~~~~~~ > +In our example described above, there is one root decoder which routes memory > +accesses over two host bridges. Each host bridge has a decoder which routes > +access to their singular endpoint targets. Each endpoint has an decoder which a decoder > +translates HPA to DPA and services the memory request. > + > +The driver validates relationships between ports by decoder programming, so > +we can think of decoders being related in a similarly hierarchical fashion to > +ports. > + > +.. kernel-render:: DOT > + :alt: Digraph of hierarchical relationship between root, switch, and endpoint decoders. > + :caption: Diagraph of CXL root, switch, and endpoint decoders. > + > + digraph foo { > + "root0" -> "decoder0.0"; > + "decoder0.0" -> "decoder1.0"; > + "decoder0.0" -> "decoder3.0"; > + "decoder1.0" -> "decoder5.0"; > + "decoder3.0" -> "decoder6.0"; > + } > + > Regions > ------- > > @@ -348,6 +420,17 @@ The interleave settings in a `Memory Region` describe the configuration of the > `Interleave Set` - and are what can be expected to be seen in the endpoint > interleave settings. > > +.. kernel-render:: DOT > + :alt: Digraph of CXL memory region relationships between root and endpoint decoders. > + :caption: Regions are created based on root decoder configurations. Endpoint decoders > + must be programmed with the same interleave settings as the region. > + > + digraph foo { > + "root0" -> "decoder0.0"; > + "decoder0.0" -> "region0"; > + "region0" -> "decoder5.0"; > + "region0" -> "decoder6.0"; > + } > > DAX Region > ~~~~~~~~~~ > @@ -360,7 +443,6 @@ for more details. :: > dax0.0 devtype modalias uevent > dax_region driver subsystem > > - > Mailbox Interfaces > ------------------ > A mailbox command interface for each device is exposed in :: > @@ -418,17 +500,30 @@ the relationships between a decoder and it's parent. > > For example, in a `Cross-Link First` interleave setup with 16 endpoints > attached to 4 host bridges, linux expects the following ways/granularity > -across the root, host bridge, and endpoints respectively. :: > +across the root, host bridge, and endpoints respectively. > + > +.. flat-table:: 4x4 cross-link first interleave settings > + > + * - decoder > + - ways > + - granularity > > - ways granularity > - root 4 256 > - host bridge 4 1024 > - endpoint 16 256 > + * - root > + - 4 > + - 256 > + > + * - host bridge > + - 4 > + - 1024 > + > + * - endpoint > + - 16 > + - 256 > > At the root, every a given access will be routed to the > :code:`((HPA / 256) % 4)th` target host bridge. Within a host bridge, every > -:code:`((HPA / 1024) % 4)th` target endpoint. Each endpoint will translate > -the access based on the entire 16 device interleave set. > +:code:`((HPA / 1024) % 4)th` target endpoint. Each endpoint translates based > +on the entire 16 device interleave set. > > Unbalanced interleave sets are not supported - decoders at a similar point > in the hierarchy (e.g. all host bridge decoders) must have the same ways and > @@ -467,7 +562,7 @@ In this example, the CFMWS defines two discrete non-interleaved 4GB regions > for each host bridge, and one interleaved 8GB region that targets both. This > would result in 3 root decoders presenting in the root. :: > > - # ls /sys/bus/cxl/devices/root0 > + # ls /sys/bus/cxl/devices/root0/decoder* > decoder0.0 decoder0.1 decoder0.2 > > # cat /sys/bus/cxl/devices/decoder0.0/target_list start size > diff --git a/Documentation/driver-api/cxl/linux/dax-driver.rst b/Documentation/driver-api/cxl/linux/dax-driver.rst > new file mode 100644 > index 000000000000..5063d2b675b4 > --- /dev/null > +++ b/Documentation/driver-api/cxl/linux/dax-driver.rst > @@ -0,0 +1,43 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +==================== > +DAX Driver Operation > +==================== > +The `Direct Access Device` driver was originally designed to provide a > +memory-like access mechanism to memory-like block-devices. It was > +extended to support CXL Memory Devices, which provide user-configured > +memory devices. > + > +The CXL subsystem depends on the DAX subsystem to generate either: to either: > + > +- A file-like interface to userland via :code:`/dev/daxN.Y`, or - Generate a file-like interface ... > +- Engaging the memory-hotplug interface to add CXL memory to page allocator. - Engage the ... > + > +The DAX subsystem exposes this ability through the `cxl_dax_region` driver. > +A `dax_region` provides the translation between a CXL `memory_region` and > +a `DAX Device`. > + > +DAX Device > +========== > +A `DAX Device` is a file-like interface exposed in :code:`/dev/daxN.Y`. A > +memory region exposed via dax device can be accessed via userland software > +via the :code:`mmap()` system-call. The result is direct mappings to the > +CXL capacity in the task's page tables. > + > +Users wishing to manually handle allocation of CXL memory should use this > +interface. > + > +kmem conversion > +=============== > +The :code:`dax_kmem` driver converts a `DAX Device` into a series of `hotplug > +memory blocks` managed by :code:`kernel/memory-hotplug.c`. This capacity > +will be exposed to the kernel page allocator in the user-selected memory > +zone. > + > +The :code:`memmap_on_memory` setting (both global and DAX device local) dictate dictates > +where the kernell will allocate the :code:`struct folio` descriptors for this kernel > +memory will come from. If :code:`memmap_on_memory` is set, memory hotplug > +will set aside a portion of the memory block capacity to allocate folios. If > +unset, the memory is allocated via a normal :code:`GFP_KERNEL` allocation - > +and as a result will most likely land on the local NUM node of the cpu executing s/cpu/CPU/ preferably. > +the hotplug operation. -- ~Randy