> -----Original Message----- > From: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> > Sent: Thursday, August 14, 2025 10:28 PM > To: Nhat Pham <nphamcs@xxxxxxxxx> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>; linux- > kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; hannes@xxxxxxxxxxx; > yosry.ahmed@xxxxxxxxx; chengming.zhou@xxxxxxxxx; > usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; > ying.huang@xxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; > senozhatsky@xxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx; > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx; > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C > <kristen.c.accardi@xxxxxxxxx>; Gomes, Vinicius <vinicius.gomes@xxxxxxxxx>; > Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh > <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v11 00/24] zswap compression batching with optimized > iaa_crypto driver > > On Fri, Aug 08, 2025 at 04:51:14PM -0700, Nhat Pham wrote: > > > > Can we get some comments from crypto tree maintainers as well? I feel > > like this patch series is more crypto patch than zswap patch, at this > > point. > > > > Can we land any zswap parts without the crypto API change? Grasping at > > straws here, in case we can parallelize the reviewing and merging > > process. > > My preference is for a unified interface that caters to both > software compression as well as parallel hardware compression. > > The reason is that there is clear advantage in passing a large > batch of pages to the Crypto API even for software compression, > the least we could do is to pack the compressed result together > and avoid the unnecessary copying of the compressed output that > is currently done in zswap. > > However, since you guys are both happy with this patch-set, > I'm not going stand in the way. > > But I do want some changes made to the proposed Crypto API interface > so that it can be reused for IPComp. > > In particular, instead of passing an opaque pointer (kernel_data) > to magically turn on batching, please add a new helper that enables > batching. > > I don't think we need any extra fields in struct acomp_req apart > from a new field called unit_size. This would be 4096 for zswap, > it could be the MTU for IPsec. > > So add something like this and document that it must be called > after acmop_request_set_callback (which should set unit_size to 0): > > static inline void acomp_request_set_unit_size(struct acomp_req *req, > unsigned int du) > { > req->unit = du; > } > > static inline void acomp_request_set_callback(struct acomp_req *req, ...) > { > ... > + req->unit = 0; > } > > For the source, nothing needs to be done because the folio could > be passed in as is. > > For the destination, construct an SG list for them and pass that in. > The rule should be that the SG list must contain a sufficient number > of pages for the compression output based on the given unit size. > > For the output lengths, just set the lengths in the destination > SG list after compression. If a page is incompressible (including > an error), just set the length to a negative value (-ENOSPC could > be used for incompressible input, as we already do). Even though > struct scatterlist->length is unsigned, there should be no issue > with storing a negative value there. Hi Herbert, Nhat, Thanks Herbert for these suggestions! I have implemented the new crypto API and the SG list suggestion. While doing so, I was also able to consolidate the new scatterlist based zswap_compress() implementation for software and hardware (i.e. batching) compressors, within the constraints of not changing anything below the crypto layer for software compressors. I wanted to provide some additional details so that you can review the overall approach and let me know if things look Ok. I will rebase the code to the latest mm-unstable and start working on v12 in the meantime. 1) The zswap per-CPU acomp_ctx has two sg_tables added, one each for inputs/outputs, with nents set to the pool->compr_batch_size (1 for software compressors). This per-CPU data incurs additional memory overhead per-CPU, however this is memory that will anyway be allocated on the stack in zswap_compress(); and less memory overhead than the latter because we know exactly how many sg_table scatterlists to allocate for the given pool (assuming we don't kmalloc in zswap_compress()). I will make sure to quantify the overhead in v12's commit logs. 2) I added new sg_alloc_table_node() and sg_kmalloc_node() to facilitate this. 3) I added the acomp_request_set_unit_size() helper for batching; initialized the unit_size to 0 in acomp_request_set_callback(). zswap_cpu_comp_prepare() will set the unit_size to PAGE_SIZE after the call to acomp_request_set_callback(). 4) Unified code in zswap_compress() for software and hardware compressors to use the per-CPU SG lists. Some unavoidable changes for software path to use the acomp_req->dlen instead of the SG list output length. 5) A trade-off I had to make in the iaa_crypto driver to adhere to the new SG list based batching architecture: Currently, all calls to dma_map_sg() in iaa_crypto_main.c use sg_nents(req->src[dst]). This requires the sg_init_marker() is set correctly based on the number of pages in the batch. This in turn requires sg_unmark_end() to be called to clear the termination marker before returning. All this adds latency to zswap_compress() (i.e. per batch compress call) with the new approach and causes regression wrt v11. To make the new approach functional and performant, I have changed all the calls to dma_map_sg() to use nents of 1. This should not be a concern, since it eliminates redundant computes to scan an SG list with only one scatterlist for existing kernel users, i.e. zswap with the zswap_compress() modifications described in (4). This will continue to hold true with the zram IAA batching support I am developing. There are no kernel use cases for the iaa_crypto driver that will break this assumption. 6) "For the source, nothing needs to be done because the folio could be passed in as is.". As far as I know, this cannot be accomplished without modifications to the crypto API for software compressors, because compressed buffers need to be stored in the zswap/zram zs_pools at PAGE_SIZE granularity. I have validated the above v12 changes applied over the v11 patch-series, on Sapphire Rapids: 1) usemem30: Both zstd and deflate-iaa have comparable performance to v11. 2) kernel_compilation test: Mostly better performance than baseline, but worse than v11. Slightly worse sys time than baseline for zstd/PMD. usemem30 with 64K folios: ========================= zswap shrinker_enabled = N. ----------------------------------------------------------------------- mm-unstable-7-30-2025 v11 v12 ----------------------------------------------------------------------- zswap compressor deflate-iaa deflate-iaa deflate-iaa ----------------------------------------------------------------------- Total throughput (KB/s) 7,153,359 10,856,388 10,714,236 Avg throughput (KB/s) 238,445 361,879 357,141 elapsed time (sec) 92.61 70.50 68.87 sys time (sec) 2,193.59 1,675.32 1,613.11 ----------------------------------------------------------------------- ----------------------------------------------------------------------- mm-unstable-7-30-2025 v11 v12 ----------------------------------------------------------------------- zswap compressor zstd zstd zstd ----------------------------------------------------------------------- Total throughput (KB/s) 6,866,411 6,874,244 6,922,818 Avg throughput (KB/s) 228,880 229,141 230,760 elapsed time (sec) 96.45 89.05 87.75 sys time (sec) 2,410.72 2,150.63 2,090.86 ----------------------------------------------------------------------- usemem30 with 2M folios: ======================== zswap shrinker_enabled = N. ----------------------------------------------------------------------- mm-unstable-7-30-2025 v11 v12 ----------------------------------------------------------------------- zswap compressor deflate-iaa deflate-iaa deflate-iaa ----------------------------------------------------------------------- Total throughput (KB/s) 7,268,929 11,312,195 10,943,491 Avg throughput (KB/s) 242,297 377,073 364,783 elapsed time (sec) 80.40 68.73 69.19 sys time (sec) 1,856.54 1,599.25 1,618.08 ----------------------------------------------------------------------- ----------------------------------------------------------------------- mm-unstable-7-30-2025 v11 v12 ----------------------------------------------------------------------- zswap compressor zstd zstd zstd ----------------------------------------------------------------------- Total throughput (KB/s) 7,560,441 7,627,155 7,600,588 Avg throughput (KB/s) 252,014 254,238 253,352 elapsed time (sec) 88.89 83.22 87.55 sys time (sec) 2,132.05 1,952.98 2,079.26 ----------------------------------------------------------------------- kernel_compilation with 64K folios: =================================== zswap shrinker_enabled = Y. -------------------------------------------------------------------------- mm-unstable-7-30-2025 v11 v12 -------------------------------------------------------------------------- zswap compressor deflate-iaa deflate-iaa deflate-iaa -------------------------------------------------------------------------- real_sec 901.81 840.60 895.94 sys_sec 2,672.93 2,171.17 2,584.04 zswpout 34,700,692 24,076,095 37,725,671 zswap_written_back_pages 2,612,474 1,451,961 3,050,557 -------------------------------------------------------------------------- -------------------------------------------------------------------------- mm-unstable-7-30-2025 v11 v12 -------------------------------------------------------------------------- zswap compressor zstd zstd zstd -------------------------------------------------------------------------- real_sec 882.67 837.21 872.98 sys_sec 3,573.31 2,593.94 3,301.67 zswpout 42,768,967 22,660,215 36,810,396 zswap_written_back_pages 2,109,739 727,919 1,475,480 -------------------------------------------------------------------------- kernel_compilation with PMD folios: =================================== zswap shrinker_enabled = Y. -------------------------------------------------------------------------- mm-unstable-7-30-2025 v11 v12 -------------------------------------------------------------------------- zswap compressor deflate-iaa deflate-iaa deflate-iaa -------------------------------------------------------------------------- real_sec 838.76 804.83 826.05 sys_sec 3,173.57 2,422.63 3,128.11 zswpout 59,544,198 38,093,995 60,072,119 zswap_written_back_pages 2,726,367 929,614 2,324,707 -------------------------------------------------------------------------- -------------------------------------------------------------------------- mm-unstable-7-30-2025 v11 v12 -------------------------------------------------------------------------- zswap compressor zstd zstd zstd -------------------------------------------------------------------------- real_sec 831.09 813.40 827.84 sys_sec 4,251.11 3,053.95 4,406.65 zswpout 59,452,638 35,832,407 63,459,471 zswap_written_back_pages 1,041,721 423,334 1,162,913 -------------------------------------------------------------------------- I am still in the process of verifying if modifying zswap_decompress() to use the per-CPU SG lists improves kernel_compilation, but thought this would be a good sync point to get your thoughts. I would greatly appreciate your comments on the approach and trade-offs, and guidance on how to proceed. "v12" zswap.c diff wrt v11: =========================== diff --git a/mm/zswap.c b/mm/zswap.c index c30c1f325f57..58ad257e87e8 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -152,6 +152,8 @@ struct crypto_acomp_ctx { struct acomp_req *req; struct crypto_wait wait; u8 **buffers; + struct sg_table *sg_inputs; + struct sg_table *sg_outputs; struct mutex mutex; bool is_sleepable; }; @@ -282,6 +284,16 @@ static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_buffers) kfree(acomp_ctx->buffers[i]); kfree(acomp_ctx->buffers); } + + if (acomp_ctx->sg_inputs) { + sg_free_table(acomp_ctx->sg_inputs); + acomp_ctx->sg_inputs = NULL; + } + + if (acomp_ctx->sg_outputs) { + sg_free_table(acomp_ctx->sg_outputs); + acomp_ctx->sg_outputs = NULL; + } } static struct zswap_pool *zswap_pool_create(char *type, char *compressor) @@ -922,6 +934,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) { struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); + int cpu_node = cpu_to_node(cpu); int ret = -ENOMEM; u8 i; @@ -936,7 +949,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) return 0; - acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); + acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_node); if (IS_ERR_OR_NULL(acomp_ctx->acomp)) { pr_err("could not alloc crypto acomp %s : %ld\n", pool->tfm_name, PTR_ERR(acomp_ctx->acomp)); @@ -960,13 +973,13 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) crypto_acomp_batch_size(acomp_ctx->acomp)); acomp_ctx->buffers = kcalloc_node(pool->compr_batch_size, sizeof(u8 *), - GFP_KERNEL, cpu_to_node(cpu)); + GFP_KERNEL, cpu_node); if (!acomp_ctx->buffers) goto fail; for (i = 0; i < pool->compr_batch_size; ++i) { acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, - cpu_to_node(cpu)); + cpu_node); if (!acomp_ctx->buffers[i]) goto fail; } @@ -981,6 +994,26 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG, crypto_req_done, &acomp_ctx->wait); + acomp_request_set_unit_size(acomp_ctx->req, PAGE_SIZE); + + acomp_ctx->sg_inputs = kmalloc_node(sizeof(*acomp_ctx->sg_inputs), + GFP_KERNEL, cpu_node); + if (!acomp_ctx->sg_inputs) + goto fail; + + if (sg_alloc_table_node(&acomp_ctx->sg_inputs, pool->compr_batch_size, + GFP_KERNEL, cpu_node)) + goto fail; + + acomp_ctx->sg_outputs = kmalloc_node(sizeof(*acomp_ctx->sg_outputs), + GFP_KERNEL, cpu_node); + if (!acomp_ctx->sg_outputs) + goto fail; + + if (sg_alloc_table_node(&acomp_ctx->sg_outputs, pool->compr_batch_size, + GFP_KERNEL, cpu_node)) + goto fail; + mutex_init(&acomp_ctx->mutex); return 0; @@ -1027,17 +1060,14 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page struct zswap_entry *entries[], struct zswap_pool *pool, int node_id) { + unsigned int nr_comps = min(nr_pages, pool->compr_batch_size); + unsigned int dlens[ZSWAP_MAX_BATCH_SIZE]; struct crypto_acomp_ctx *acomp_ctx; - struct scatterlist input, output; struct zpool *zpool = pool->zpool; - - unsigned int dlens[ZSWAP_MAX_BATCH_SIZE]; - int errors[ZSWAP_MAX_BATCH_SIZE]; - - unsigned int nr_comps = min(nr_pages, pool->compr_batch_size); - unsigned int i, j; - int err; + struct scatterlist *sg; + unsigned int i, j, k; gfp_t gfp; + int err; gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE; @@ -1045,59 +1075,58 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page mutex_lock(&acomp_ctx->mutex); + prefetchw(acomp_ctx->sg_inputs->sgl); + prefetchw(acomp_ctx->sg_outputs->sgl); + /* * Note: * [i] refers to the incoming batch space and is used to - * index into the folio pages, @entries and @errors. + * index into the folio pages and @entries. + * + * [k] refers to the @acomp_ctx space, as determined by + * @pool->compr_batch_size, and is used to index into + * @acomp_ctx->buffers and @dlens. */ for (i = 0; i < nr_pages; i += nr_comps) { - if (nr_comps == 1) { - sg_init_table(&input, 1); - sg_set_page(&input, folio_page(folio, start + i), PAGE_SIZE, 0); + for_each_sg(acomp_ctx->sg_inputs->sgl, sg, nr_comps, k) + sg_set_page(sg, folio_page(folio, start + k + i), PAGE_SIZE, 0); - /* - * We need PAGE_SIZE * 2 here since there maybe over-compression case, - * and hardware-accelerators may won't check the dst buffer size, so - * giving the dst buffer with enough length to avoid buffer overflow. - */ - sg_init_one(&output, acomp_ctx->buffers[0], PAGE_SIZE * 2); - acomp_request_set_params(acomp_ctx->req, &input, - &output, PAGE_SIZE, PAGE_SIZE); - - errors[i] = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), - &acomp_ctx->wait); - if (unlikely(errors[i])) - goto compress_error; - - dlens[i] = acomp_ctx->req->dlen; - } else { - struct page *pages[ZSWAP_MAX_BATCH_SIZE]; - unsigned int k; - - for (k = 0; k < nr_pages; ++k) - pages[k] = folio_page(folio, start + k); - - struct swap_batch_comp_data batch_comp_data = { - .pages = pages, - .dsts = acomp_ctx->buffers, - .dlens = dlens, - .errors = errors, - .nr_comps = nr_pages, - }; - - acomp_ctx->req->kernel_data = &batch_comp_data; - - if (unlikely(crypto_acomp_compress(acomp_ctx->req))) - goto compress_error; + /* + * We need PAGE_SIZE * 2 here since there maybe over-compression case, + * and hardware-accelerators may won't check the dst buffer size, so + * giving the dst buffer with enough length to avoid buffer overflow. + */ + for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) + sg_set_buf(sg, acomp_ctx->buffers[k], PAGE_SIZE * 2); + + acomp_request_set_params(acomp_ctx->req, + acomp_ctx->sg_inputs->sgl, + acomp_ctx->sg_outputs->sgl, + nr_comps * PAGE_SIZE, + nr_comps * PAGE_SIZE); + + err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), + &acomp_ctx->wait); + + if (unlikely(err)) { + if (nr_comps == 1) + dlens[0] = err; + goto compress_error; } + if (nr_comps == 1) + dlens[0] = acomp_ctx->req->dlen; + else + for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) + dlens[k] = sg->length; + /* * All @nr_comps pages were successfully compressed. * Store the pages in zpool. * * Note: * [j] refers to the incoming batch space and is used to - * index into the folio pages, @entries, @dlens and @errors. + * index into the folio pages and @entries. * [k] refers to the @acomp_ctx space, as determined by * @pool->compr_batch_size, and is used to index into * @acomp_ctx->buffers. @@ -1113,7 +1142,7 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page * non-batching software compressors. */ prefetchw(entries[j]); - err = zpool_malloc(zpool, dlens[j], gfp, &handle, node_id); + err = zpool_malloc(zpool, dlens[k], gfp, &handle, node_id); if (unlikely(err)) { if (err == -ENOSPC) @@ -1124,9 +1153,9 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page goto err_unlock; } - zpool_obj_write(zpool, handle, acomp_ctx->buffers[k], dlens[j]); + zpool_obj_write(zpool, handle, acomp_ctx->buffers[k], dlens[k]); entries[j]->handle = handle; - entries[j]->length = dlens[j]; + entries[j]->length = dlens[k]; } } /* finished compress and store nr_pages. */ @@ -1134,9 +1163,9 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page return true; compress_error: - for (j = i; j < i + nr_comps; ++j) { - if (errors[j]) { - if (errors[j] == -ENOSPC) + for (k = 0; k < nr_comps; ++k) { + if (dlens[k] < 0) { + if (dlens[k] == -ENOSPC) zswap_reject_compress_poor++; else zswap_reject_compress_fail++; Thanks, Kanchana > > Cheers, > -- > Email: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt