Re: [RFC PATCH 6/6] btrfs: zlib: add support for zlib-deflate through acomp

Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> · Tue, 27 May 2025 20:08:30 +0800

Hi David,

On 2025/5/27 19:17, David Sterba wrote:
On Tue, May 27, 2025 at 10:32:00AM +0800, Gao Xiang wrote:

On 2025/5/8 12:19, Eric Biggers wrote:

...

BTW, I also have to wonder why this patchset is proposing accelerating zlib
instead of Zstandard.  Zstandard is a much more modern algorithm.

I think simply because QAT doesn't support the Zstandard native offload.
At least, for Intel Xeon Sapphire Rapids processors (it seems to have
built-in QAT 4xxx), only LZ4 and deflate-family are natively supported.

I've confirmed that SPR QAT deflate hardware decompresion already surpasses
LZ4 software decompression on our cloud server setup, which is useful since
it greatly improves decompression performance (even compared to software LZ4)
and saves CPU overhead completely.

Does this measure the overall time of decompression (including the setup
steps, like the scatter/gather or similar, allocating requests, waiting
etc)?. Comparing that to the library calls plus the input page iteration.
I haven't found any public benchmarks with the QAT enabled compression.
I'm interested how it's benchmarked because we'v had people pointing out
that LZ4 itself is very fast, but when the overhead is taken into
account it's reducing the overall performance. Thanks.

Yes, EROFS already supports QAT end-to-end since the ongoing Linux 6.16:

Processor: Intel(R) Xeon(R) Platinum 8475B (192 cores)
Memory: 512 GiB
Dataset: enwik9
Test command: fio --filename=enwik9 -rw=read -readonly -bs=4k -ioengine=psync -name=job1

1) $ mkfs.erofs -zdeflate -C1048576 enwik9.dfl enwik9
   $ echo qat_deflate > /sys/fs/erofs/accel
   READ: bw=662MiB/s (694MB/s), 662MiB/s-662MiB/s (694MB/s-694MB/s), io=954MiB (1000MB), run=1440-1440msec
   $ echo >  /sys/fs/erofs/accel
   READ: bw=381MiB/s (400MB/s), 381MiB/s-381MiB/s (400MB/s-400MB/s), io=954MiB (1000MB), run=2500-2500msec

2) $ mkfs.erofs -zlz4hc -C1048576 enwik9.lz4 enwik9
   READ: bw=541MiB/s (568MB/s), 541MiB/s-541MiB/s (568MB/s-568MB/s), io=954MiB (1000MB), run=1762-1762msec

However, my current test case that the cloud disk is slow (I use the cheapest
cloud disk setup because it will be used for rootfs and container images), so
the overall e2e is I/O bound instead of CPU bound, so in that case since
deflate can compress better (so can save more disk I/Os), it can surpass the
LZ4 one (because even LZ4 is faster but cost more I/Os due to large image).

If the storage/CPU combination is CPU bound, I think it could have different
results anyway.

Thanks,
Gao Xiang