> On Tue, Jun 10, 2025 at 12:52:18PM +0200, Christian König wrote: > > >> dma_addr_t/len array now that the new DMA API supporting that has > > >> been merged. Is there any chance the dma-buf maintainers could > > >> start to kick this off? I'm of course happy to assist. > > > > Work on that is already underway for some time. > > > > Most GPU drivers already do sg_table -> DMA array conversion, I need > > to push on the remaining to clean up. > > Do you have a pointer? > > > >> Yes, that's really puzzling and should be addressed first. > > > With high CPU performance (e.g., 3GHz), GUP (get_user_pages) > > > overhead is relatively low (observed in 3GHz tests). > > > > Even on a low end CPU walking the page tables and grabbing references > > shouldn't be that much of an overhead. > > Yes. > > > > > There must be some reason why you see so much CPU overhead. E.g. > > compound pages are broken up or similar which should not happen in the > > first place. > > pin_user_pages outputs an array of PAGE_SIZE (modulo offset and shorter > last length) array strut pages unfortunately. The block direct I/O code has > grown code to reassemble folios from them fairly recently which did speed > up some workloads. > > Is this test using the block device or iomap direct I/O code? What kernel > version is it run on? Here's my analysis on Linux 6.6 with F2FS/iomap. Comparing udmabuf+memfd direct read vs dmabuf direct c_f_r: Systrace: On a high-end 3 GHz CPU, the former occupies >80% runtime vs <20% for the latter. On a low-end 1 GHz CPU, the former becomes CPU-bound. Perf: For the former, bio_iov_iter_get_pages/get_user_pages dominate latency. The latter avoids this via lightweight bvec assignments. |- 13.03% __arm64_sys_read |-|- 13.03% f2fs_file_read_iter |-|-|- 13.03% __iomap_dio_rw |-|-|-|- 12.95% iomap_dio_bio_iter |-|-|-|-|- 10.69% bio_iov_iter_get_pages |-|-|-|-|-|- 10.53% iov_iter_extract_pages |-|-|-|-|-|-|- 10.53% pin_user_pages_fast |-|-|-|-|-|-|-|- 10.53% internal_get_user_pages_fast |-|-|-|-|-|-|-|-|- 10.23% __gup_longterm_locked |-|-|-|-|-|-|-|-|-|- 8.85% __get_user_pages |-|-|-|-|-|-|-|-|-|-|- 6.26% handle_mm_fault |-|-|-|-|- 1.91% iomap_dio_submit_bio |-|-|-|-|-|- 1.64% submit_bio |- 1.13% __arm64_sys_copy_file_range |-|- 1.13% vfs_copy_file_range |-|-|- 1.13% dma_buf_copy_file_range |-|-|-|- 1.13% system_heap_dma_buf_rw_file |-|-|-|-|- 1.13% f2fs_file_read_iter |-|-|-|-|-|- 1.13% __iomap_dio_rw |-|-|-|-|-|-|- 1.13% iomap_dio_bio_iter |-|-|-|-|-|-|-|- 1.13% iomap_dio_submit_bio |-|-|-|-|-|-|-|-|- 1.08% submit_bio Large folios can reduce GUP overhead but still significantly slower than dmabuf to bio_vec conversion. Regards, Wangtao.