On Tue, Jun 24, 2025 at 08:40:32PM -0300, Jason Gunthorpe wrote: > On Tue, Jun 24, 2025 at 04:37:26PM -0400, Peter Xu wrote: > > On Thu, Jun 19, 2025 at 03:40:41PM -0300, Jason Gunthorpe wrote: > > > Even with this new version you have to decide to return PUD_SIZE or > > > bar_size in pci and your same reasoning that PUD_SIZE make sense > > > applies (though I would probably return bar_size and just let the core > > > code cap it to PUD_SIZE) > > > > Yes. > > > > Today I went back to look at this, I was trying to introduce this for > > file_operations: > > > > int (*get_mapping_order)(struct file *, unsigned long, size_t); > > > > It looks almost good, except that it so far has no way to return the > > physical address for further calculation on the alignment. > > > > For THP, VA is always calculated against pgoff not physical address on the > > alignment. I think it's OK for THP, because every 2M THP folio will be > > naturally 2M aligned on the physical address, so it fits when e.g. pgoff=0 > > in the calculation of thp_get_unmapped_area_vmflags(). > > > > Logically it should even also work for vfio-pci, as long as VFIO keeps > > using the lower 40 bits of the device_fd to represent the bar offset, > > meanwhile it'll also require PCIe spec asking the PCI bars to be mapped > > aligned with bar sizes. > > > > But from an API POV, get_mapping_order() logically should return something > > for further calculation of the alignment to get the VA. pgoff here may not > > always be the right thing to use to align to the VA: after all, pgtable > > mapping is about VA -> PA, the only reasonable and reliable way is to align > > VA to the PA to be mappped, and as an API we shouldn't assume pgoff is > > always aligned to PA address space. > > My feeling, and the reason I used the phrase "pgoff aligned address", > is that the owner of the file should already ensure that for the large > PTEs/folios: > pgoff % 2**order == 0 > physical % 2**order == 0 IMHO there shouldn't really be any hard requirement in mm that pgoff and physical address space need to be aligned.. but I confess I don't have an example driver that didn't do that in the linux tree. > > So, things like VFIO do need to hand out high alignment pgoffs to make > this work - which it already does. > > To me this just keeps thing simpler. I guess if someone comes up with > a case where they really can't get a pgoff alignment and really need a > high order mapping then maybe we can add a new return field of some > kind (pgoff adjustment?) but that is so weird I'd leave it to the > future person to come and justfiy it. When looking more, I also found some special cased get_unmapped_area() that may not be trivially converted into the new API even for CONFIG_MMU, namely: - io_uring_get_unmapped_area - arena_get_unmapped_area (from bpf_map->ops->map_get_unmapped_area) I'll need to have some closer look tomorrow. If any of them cannot be 100% safely converted to the new API, I'd also think we should not introduce the new API, but reuse get_unmapped_area() until we know a way out. -- Peter Xu