RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

wangtao <tao.wangtao@xxxxxxxxx> · Fri, 13 Jun 2025 09:43:08 +0000

> On Tue, Jun 10, 2025 at 12:52:18PM +0200, Christian König wrote:
> > >> dma_addr_t/len array now that the new DMA API supporting that has
> > >> been merged.  Is there any chance the dma-buf maintainers could
> > >> start to kick this off?  I'm of course happy to assist.
> >
> > Work on that is already underway for some time.
> >
> > Most GPU drivers already do sg_table -> DMA array conversion, I need
> > to push on the remaining to clean up.
> 
> Do you have a pointer?
> 
> > >> Yes, that's really puzzling and should be addressed first.
> > > With high CPU performance (e.g., 3GHz), GUP (get_user_pages)
> > > overhead is relatively low (observed in 3GHz tests).
> >
> > Even on a low end CPU walking the page tables and grabbing references
> > shouldn't be that much of an overhead.
> 
> Yes.
> 
> >
> > There must be some reason why you see so much CPU overhead. E.g.
> > compound pages are broken up or similar which should not happen in the
> > first place.
> 
> pin_user_pages outputs an array of PAGE_SIZE (modulo offset and shorter
> last length) array strut pages unfortunately.  The block direct I/O code has
> grown code to reassemble folios from them fairly recently which did speed
> up some workloads.
> 
> Is this test using the block device or iomap direct I/O code?  What kernel
> version is it run on?
Here's my analysis on Linux 6.6 with F2FS/iomap.

Comparing udmabuf+memfd direct read vs dmabuf direct c_f_r:
Systrace: On a high-end 3 GHz CPU, the former occupies >80% runtime vs
<20% for the latter. On a low-end 1 GHz CPU, the former becomes CPU-bound.
Perf: For the former, bio_iov_iter_get_pages/get_user_pages dominate
latency. The latter avoids this via lightweight bvec assignments.
|- 13.03% __arm64_sys_read
|-|- 13.03% f2fs_file_read_iter
|-|-|- 13.03% __iomap_dio_rw
|-|-|-|- 12.95% iomap_dio_bio_iter
|-|-|-|-|- 10.69% bio_iov_iter_get_pages
|-|-|-|-|-|- 10.53% iov_iter_extract_pages
|-|-|-|-|-|-|- 10.53% pin_user_pages_fast
|-|-|-|-|-|-|-|- 10.53% internal_get_user_pages_fast
|-|-|-|-|-|-|-|-|- 10.23% __gup_longterm_locked
|-|-|-|-|-|-|-|-|-|- 8.85% __get_user_pages
|-|-|-|-|-|-|-|-|-|-|- 6.26% handle_mm_fault
|-|-|-|-|- 1.91% iomap_dio_submit_bio
|-|-|-|-|-|- 1.64% submit_bio

|- 1.13% __arm64_sys_copy_file_range
|-|- 1.13% vfs_copy_file_range
|-|-|- 1.13% dma_buf_copy_file_range
|-|-|-|- 1.13% system_heap_dma_buf_rw_file
|-|-|-|-|- 1.13% f2fs_file_read_iter
|-|-|-|-|-|- 1.13% __iomap_dio_rw
|-|-|-|-|-|-|- 1.13% iomap_dio_bio_iter
|-|-|-|-|-|-|-|- 1.13% iomap_dio_submit_bio
|-|-|-|-|-|-|-|-|- 1.08% submit_bio

Large folios can reduce GUP overhead but still significantly slower
than dmabuf to bio_vec conversion.

Regards,
Wangtao.