On Fri, Feb 28, 2025 at 1:07 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 27.02.25 23:12, Matthew Wilcox wrote: > > On Tue, Feb 25, 2025 at 10:56:21AM +1100, Dave Chinner wrote: > >>> From the previous discussions that Matthew shared [7], it seems like > >>> Dave proposed an alternative to moving the extents to the VFS layer to > >>> invert the IO read path operations [8]. Maybe this is a move > >>> approachable solution since there is precedence for the same in the > >>> write path? > >>> > >>> [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@xxxxxxxxxxxxxxxxxxxx/ > >>> [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@xxxxxxxxxxxxxxxxxxx/ > >> > >> Yes, if we are going to optimise away redundant zeros being stored > >> in the page cache over holes, we need to know where the holes in the > >> file are before the page cache is populated. > > > > Well, you shot that down when I started trying to flesh it out: > > https://lore.kernel.org/linux-fsdevel/Zs+2u3%2FUsoaUHuid@xxxxxxxxxxxxxxxxxxx/ > > > >> As for efficient hole tracking in the mapping tree, I suspect that > >> we should be looking at using exceptional entries in the mapping > >> tree for holes, not inserting mulitple references to the zero folio. > >> i.e. the important information for data storage optimisation is that > >> the region covers a hole, not that it contains zeros. > > > > The xarray is very much optimised for storing power-of-two sized & > > aligned objects. It makes no sense to try to track extents using the > > mapping tree. Now, if we abandon the radix tree for the maple tree, we > > could talk about storing zero extents in the same data structure. > > But that's a big change with potentially significant downsides. > > It's something I want to play with, but I'm a little busy right now. > > > >> For buffered reads, all that is required when such an exceptional > >> entry is returned is a memset of the user buffer. For buffered > >> writes, we simply treat it like a normal folio allocating write and > >> replace the exceptional entry with the allocated (and zeroed) folio. > > > > ... and unmap the zero page from any mappings. > > > >> For read page faults, the zero page gets mapped (and maybe > >> accounted) via the vma rather than the mapping tree entry. For write > >> faults, a folio gets allocated and the exception entry replaced > >> before we call into ->page_mkwrite(). > >> > >> Invalidation simply removes the exceptional entries. > > > > ... and unmap the zero page from any mappings. > > > > I'll add one detail for future reference; not sure about the priority > this should have, but it's one of these nasty corner cases that are not > the obvious to spot when having the shared zeropage in MAP_SHARED mappings: > > Currently, only FS-DAX makes use of the shared zeropage in "ordinary > MAP_SHARED" mappings. It doesn't use it for "holes" but for "logically > zero" pages, to avoid allocating disk blocks (-> translating to actual > DAX memory) on read-only access. > > There is one issue between gup(FOLL_LONGTERM | FOLL_PIN) and the shared > zeropage in MAP_SHARED mappings. It so far does not apply to fsdax, > because ... we don't support FOLL_LONGTERM for fsdax at all. > > I spelled out part of the issue in fce831c92092 ("mm/memory: cleanly > support zeropage in vm_insert_page*(), vm_map_pages*() and > vmf_insert_mixed()"). > > In general, the problem is that gup(FOLL_LONGTERM | FOLL_PIN) will have > to decide if it is okay to longterm-pin the shared zeropage in a > MAP_SHARED mapping (which might just be fine with a R/O file in some > cases?), and if not, it would have to trigger FAULT_FLAG_UNSHARE similar > to how we break COW in MAP_PRIVATE mappings (shared zeropage -> > anonymous folio). > > If gup(FOLL_LONGTERM | FOLL_PIN) would just always longterm-pin the > shared zeropage, and somebody else would end up triggering replacement > of the shared zeropage in the pagecache (e.g., write() to the file > offset, write access to the VMA that triggers a write fault etc.), you'd > get a disconnect between what the GUP user sees and what the pagecache > actually contains. > > The file system fault logic will have to be taught about > FAULT_FLAG_UNSHARE and handle it accordingly (e.g., allocate fill file > hole, allocate disk space, allocate an actual folio ...). > > Things like memfd_pin_folios() might require similar care -- that one in > particular should likely never return the shared zeropage. > > Likely gup(FOLL_LONGTERM | FOLL_PIN) users like RDMA or VFIO will be > able to trigger it. > > > Not using the shared zeropage but instead some "hole" PTE marker could > avoid this problem. Of course, not allowing for reading the shared > zeropage there, but maybe that's not strictly required? > Link to slides for the talk: https://drive.google.com/file/d/1MOJu5FZurV4XaCLrQhM9S5ubN7H_jEA8/view?usp=drive_link Thanks, Kalesh > -- > Cheers, > > David / dhildenb >