Dear experts,
I was trying to improve the performance of loading a big file to memory,
inspired by this patchset[1], i used direct IO to read this file to the
CMA memory i reserved, in order to reduce the overhead of memory copy
from page cache to its destination.
below is detail:
1, i used cma=6G in cmdline to reserve memory
2, allocate memory with "addr = dma_alloc_coherent"
3, open file with O_RIRECT
int fd = open(FILE_PATH, O_RDONLY | O_DIRECT);
4, read file
while (copied < fsz) {
ssize_t bytes;
ssize_t wanted = fsz - copied;
bytes = kernel_read(file, addr + copied, wanted, &pos);
if (bytes <= 0) {
printk("%s, error , bytes:%ld, wanted:%ld, cpoied:%ld\n",
__FUNCTION__, bytes, wanted, copied);
return bytes;
}
copied += bytes;
}
ftrace log shows every single pair of block_rq_issue and
block_rq_complete spends 0.07s.
test-2991 0..... 119.255974: block_rq_issue: 8,0 R 1310720 ()
505083904 + 2560 0x0,0,0 [test]
test-2991 0..... 119.255974: scsi_dispatch_cmd_start: host_no=0
channel=0 id=0 lun=0 data_sgl=20 prot_sgl=0 prot_op=0x0 driver_tag=19
scheduler_tag=40 cmnd=(READ_10 lba=505083904 txlen=2560 protect=0 raw=28
00 1e 1a f8 00 00 0a 00 00)
test-2991 0d..1. 119.255977: ata_qc_issue: [FAILED TO
PARSE] ata_port=1 ata_dev=0 tag=19 cmd=96 dev=64 lbal=0 lbam=248 lbah=26
nsect=152 feature=0 hob_lbal=30 hob_lbam=0 hob_lbah=0 hob_nsect=0
hob_feature=10 ctl=0 proto=6 flags=0x0
test-2991 0..... 119.255979: block_rq_issue: 8,0 R
1310720 () 505086464 + 2560 0x0,0,0 [test]
test-2991 0..... 119.255994: scsi_dispatch_cmd_start:
host_no=0 channel=0 id=0 lun=0 data_sgl=20 prot_sgl=0 prot_op=0x0
driver_tag=20 scheduler_tag=41 cmnd=(READ_10 lba=505086464 txlen=2560
protect=0 raw=28 00 1e 1b 02 00 00 0a 00 00)
test-2991 0d..1. 119.255997: ata_qc_issue: [FAILED TO
PARSE] ata_port=1 ata_dev=0 tag=20 cmd=96 dev=64 lbal=0 lbam=2 lbah=27
nsect=160 feature=0 hob_lbal=30 hob_lbam=0 hob_lbah=0 hob_nsect=0
hob_feature=10 ctl=0 proto=6 flags=0x0
......
<idle>-0 4d.h2. 119.323158: ata_qc_complete_done: [FAILED TO
PARSE] ata_port=1 ata_dev=0 tag=19 status=64 dev=0 lbal=0 lbam=0 lbah=0
nsect=1 error=0 hob_lbal=0 hob_lbam=0 hob_lbah=0 hob_nsect=0
hob_feature=0 ctl=0 flags=0xf
<idle>-0 4d.h2. 119.323166: scsi_dispatch_cmd_done: host_no=0
channel=0 id=0 lun=0 data_sgl=20 prot_sgl=0 prot_op=0x0 driver_tag=19
scheduler_tag=40 cmnd=(READ_10 lba=505083904 txlen=2560 protect=0 raw=28
00 1e 1a f8 00 00 0a 00 00) result=(driver=DRIVER_OK host=0x0
message=COMMAND_COMPLETE status=0x0) sense=(key=0 asc=0 ascq=0)
<idle>-0 4..s1. 119.323172: block_rq_complete: 8,0 R ()
505083904 + 2560 0x0,0,0 [0]
sorry for the mess, this request applies to read 2560 sectors from
sector 505083904, it takes 119.323172 - 119.255974 = 0.067s.
however, i also used buffered IO to compare, instead of opening with
O_DIRECT, i only open it with O_RDONLY. Turns out, it only takes 0.003
second for each block_rq_issue and block_rq_complete pair(read 2560
sectors from sector 505083904). below is ftrace log:
test-3136 2..... 284.024657: block_rq_issue: 8,0 RA
1310720 () 505083904 + 2560 0x0,0,0 [test]
test-3136 2..... 284.024659: scsi_dispatch_cmd_start:
host_no=0 channel=0 id=0 lun=0 data_sgl=20 prot_sgl=0 prot_op=0x0
driver_tag=10 scheduler_tag=12 cmnd=(READ_10 lba=505083904 txlen=2560
protect=0 raw=28 00 1e 1a f8 00 00 0a 00 00)
test-3136 2d..1. 284.024672: ata_qc_issue: [FAILED TO
PARSE] ata_port=1 ata_dev=0 tag=10 cmd=96 dev=64 lbal=0 lbam=248 lbah=26
nsect=80 feature=0 hob_lbal=30 hob_lbam=0 hob_lbah=0 hob_nsect=0
hob_feature=10 ctl=0 proto=6 flags=0x0
<idle>-0 4d.h2. 284.025808: ata_qc_complete_done: [FAILED TO
PARSE] ata_port=1 ata_dev=0 tag=9 status=64 dev=0 lbal=0 lbam=0 lbah=0
nsect=0 error=0 hob_lbal=0 hob_lbam=0 hob_lbah=0 hob_nsect=0
hob_feature=0 ctl=0 flags=0xf
<idle>-0 4d.h2. 284.025819: scsi_dispatch_cmd_done: host_no=0
channel=0 id=0 lun=0 data_sgl=20 prot_sgl=0 prot_op=0x0 driver_tag=9
scheduler_tag=11 cmnd=(READ_10 lba=505081344 txlen=2560 protect=0 raw=28
00 1e 1a ee 00 00 0a 00 00) result=(driver=DRIVER_OK host=0x0
message=COMMAND_COMPLETE status=0x0) sense=(key=0 asc=0 ascq=0)
<idle>-0 4..s1. 284.025825: block_rq_complete: 8,0 RA ()
505081344 + 2560 0x0,0,0 [0]
test-3136 2..... 284.026912: block_rq_issue: 8,0 RA
1310720 () 505086464 + 2560 0x0,0,0 [test]
test-3136 2..... 284.026914: scsi_dispatch_cmd_start:
host_no=0 channel=0 id=0 lun=0 data_sgl=20 prot_sgl=0 prot_op=0x0
driver_tag=11 scheduler_tag=13 cmnd=(READ_10 lba=505086464 txlen=2560
protect=0 raw=28 00 1e 1b 02 00 00 0a 00 00)
test-3136 2d..1. 284.026927: ata_qc_issue: [FAILED TO
PARSE] ata_port=1 ata_dev=0 tag=11 cmd=96 dev=64 lbal=0 lbam=2 lbah=27
nsect=88 feature=0 hob_lbal=30 hob_lbam=0 hob_lbah=0 hob_nsect=0
hob_feature=10 ctl=0 proto=6 flags=0x0
<idle>-0 4d.h2. 284.028022: ata_qc_complete_done: [FAILED TO
PARSE] ata_port=1 ata_dev=0 tag=10 status=64 dev=0 lbal=0 lbam=0 lbah=0
nsect=0 error=0 hob_lbal=0 hob_lbam=0 hob_lbah=0 hob_nsect=0
hob_feature=0 ctl=0 flags=0xf
<idle>-0 4d.h2. 284.028033: scsi_dispatch_cmd_done: host_no=0
channel=0 id=0 lun=0 data_sgl=20 prot_sgl=0 prot_op=0x0 driver_tag=10
scheduler_tag=12 cmnd=(READ_10 lba=505083904 txlen=2560 protect=0 raw=28
00 1e 1a f8 00 00 0a 00 00) result=(driver=DRIVER_OK host=0x0
message=COMMAND_COMPLETE status=0x0) sense=(key=0 asc=0 ascq=0)
<idle>-0 4..s1. 284.028039: block_rq_complete: 8,0 RA ()
505083904 + 2560 0x0,0,0 [0]
284.028039 - 284.024657 = 0.003382.
It seems the ssd returns data immediately but i don't know what is the
problem of direct IO read. I would appreciate it if anyone of you gives
me some advice about this case.
My OS is ubuntu 24.04, kernel is 6.14, from upstream no change, with
CONFIG_CMA and CONFIG_DMA_CMA. Intel i5, 16G DDR, 500G SATA SSD(Kingston).
/Song
[1]:https://lore.kernel.org/all/20240730075755.10941-1-link@xxxxxxxx/