> -----Original Message----- > From: Nhat Pham <nphamcs@xxxxxxxxx> > Sent: Thursday, May 8, 2025 2:20 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > hannes@xxxxxxxxxxx; yosry.ahmed@xxxxxxxxx; chengming.zhou@xxxxxxxxx; > usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; > ying.huang@xxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; > senozhatsky@xxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx; > herbert@xxxxxxxxxxxxxxxxxxx; davem@xxxxxxxxxxxxx; > clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx; ebiggers@xxxxxxxxxx; > surenb@xxxxxxxxxx; Accardi, Kristen C <kristen.c.accardi@xxxxxxxxx>; > Gomes, Vinicius <vinicius.gomes@xxxxxxxxx>; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > Subject: Re: [RESEND PATCH v9 00/19] zswap compression batching > > On Thu, May 8, 2025 at 1:55 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > > > On Thu, May 8, 2025 at 12:41 PM Kanchana P Sridhar > > <kanchana.p.sridhar@xxxxxxxxx> wrote: > > > > > > > > > Compression Batching: > > > ===================== > > > > > > This patch-series introduces batch compression of pages in large folios to > > > improve zswap swapout latency. It preserves the existing zswap protocols > > > for non-batching software compressors by calling crypto_acomp > sequentially > > > per page in the batch. Additionally, in support of hardware accelerators > > > that can process a batch as an integral unit, the patch-series creates > > > generic batching interfaces in crypto_acomp, and calls the > > > crypto_acomp_batch_compress() interface in zswap_compress() for > compressors > > > that intrinsically support batching. > > > > > > The patch series provides a proof point by using the Intel Analytics > > > Accelerator (IAA) for implementing the compress/decompress batching > API > > > using hardware parallelism in the iaa_crypto driver and another proof > point > > > with a sequential software compressor, zstd. > > > > Any plan on doing hardware accelerated/offloaded/parallelized zstd? :) > > > > > > > > SUMMARY: > > > ======== > > > > > > The first proof point is to test with IAA using a sequential call (fully > > > synchronous, compress one page at a time) vs. a batching call (fully > > > asynchronous, submit a batch to IAA for parallel compression, then poll > for > > > completion statuses). > > > > > > The performance testing data with usemem 30 processes and kernel > > > compilation test using 32 threads, show 67%-77% throughput gains and > > > 28%-32% sys time reduction (usemem30) and 2-3% sys time reduction > > > (kernel compilation) with zswap_store() large folios using IAA compress > > > batching as compared to IAA sequential. > > > > > > The second proof point is to make sure that software algorithms such as > > > zstd do not regress. The data indicates that for sequential software > > > algorithms a performance gain is achieved. > > > > > > With the performance optimizations implemented in patches 18 and 19 > of > > > v9, zstd usemem30 throughput increases by 1%, along with a 6%-8% sys > time > > > reduction. With kernel compilation using zstd, we get a 0.4%-3.2% > > > reduction in sys time. These optimizations pertain to common code > > > paths, removing redundant branches/computes, using prefetchw() of > the > > > zswap entry before it is written, and selectively annotating branches > > > with likely()/unlikely() compiler directives to minimize branch > > > mis-prediction penalty. Additionally, using the batching code for > > > non-batching compressors to sequentially compress/store batches of up > > > to ZSWAP_MAX_BATCH_SIZE (8) pages seems to help, most likely due to > > > cache locality of working set structures such as the array of > > > zswap_entry-s for the batch. > > > > Nice! > > > > > > > > Our internal validation of zstd with the batching interface vs. IAA with > > > the batching interface on Emerald Rapids has shown that IAA > > > compress/decompress batching gives 21.3% more memory savings as > compared > > > to zstd, for 5% performance loss as compared to the baseline without > any > > > memory pressure. IAA batching demonstrates more than 2X the > memory > > > savings obtained by zstd at this 95% performance KPI. > > > The compression ratio with IAA is 2.23, and with zstd 2.96. Even with > > > this compression ratio deficit for IAA, batching is extremely > > > > I'm confused. How does IAA give more memory savings, while having a > > worse compression ratio? How do you define memory savings here? > > > > > beneficial. As we improve the compression ratio of the IAA accelerator, > > > we expect to see even better memory savings with IAA as compared to > > > software compressors. > > > > > > > > > Batching Roadmap: > > > ================= > > > > > > 1) Compression batching within large folios (this series). > > > > > > 2) Reclaim batching of hybrid folios: > > > > > > We can expect to see even more significant performance and > throughput > > > improvements if we use the parallelism offered by IAA to do reclaim > > > batching of 4K/large folios (really any-order folios), and using the > > > zswap_store() high throughput compression pipeline to batch-compress > > > pages comprising these folios, not just batching within large > > > folios. This is the reclaim batching patch 13 in v1, which we expect > > > to submit in a separate patch-series. > > > > Are you aware of the current kcompressd work: > > > > https://lore.kernel.org/all/20250430082651.3152444-1-qun- > wei.lin@xxxxxxxxxxxx/ > > > > It basically offloads compression work into a separate kernel thread > > (kcompressd), for kswapd reclaim. > > > > This might provide you with a more natural place to perform batch > > compression - instead of compressing one page at a time from the > > worker thread's queue, you can grab a batch worth of pages and feed it > > to IAA. > > > > Downside is it only applies to indirect reclaim. Proactive and direct > > reclaimers are not covered, unfortunately. > > > > > > > > 3) Decompression batching: > > > > > > We have developed a zswap load batching interface for IAA to be used > > > for parallel decompression batching, using swapin_readahead(). > > > > > > These capabilities are architected so as to be useful to zswap and > > > zram. We are actively working on integrating these components with > zram. > > > > Yeah problem with readahead is you can potentially get different > > backends in the batch, and modifying readahead code is pretty ugly :) > > But we'll see... > > > > Another place where you can do decompression batching is for zswap > writeback :) Right now, we are decompressing the pages and writing > them back one page at a time. You can, however, grab a batch worth of > them, feed to IAA for processing, before submitting them all for IO :) Thanks Nhat, great idea! > > I have a prototype that perform batch writeback (mostly for IO > efficiency purpose) - lmk if you want to play with it. Problem, as > usual, is benchmarking :) Sure. Please do share the patch implementing the batch writeback and I can give this a try. Thanks, Kanchana