RE: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Fri, 22 Aug 2025 19:26:34 +0000

> -----Original Message-----
> From: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
> Sent: Thursday, August 14, 2025 10:28 PM
> To: Nhat Pham <nphamcs@xxxxxxxxx>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>; linux-
> kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; hannes@xxxxxxxxxxx;
> yosry.ahmed@xxxxxxxxx; chengming.zhou@xxxxxxxxx;
> usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx;
> ying.huang@xxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
> senozhatsky@xxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx;
> davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx;
> ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C
> <kristen.c.accardi@xxxxxxxxx>; Gomes, Vinicius <vinicius.gomes@xxxxxxxxx>;
> Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh
> <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [PATCH v11 00/24] zswap compression batching with optimized
> iaa_crypto driver
> 
> On Fri, Aug 08, 2025 at 04:51:14PM -0700, Nhat Pham wrote:
> >
> > Can we get some comments from crypto tree maintainers as well? I feel
> > like this patch series is more crypto patch than zswap patch, at this
> > point.
> >
> > Can we land any zswap parts without the crypto API change? Grasping at
> > straws here, in case we can parallelize the reviewing and merging
> > process.
> 
> My preference is for a unified interface that caters to both
> software compression as well as parallel hardware compression.
> 
> The reason is that there is clear advantage in passing a large
> batch of pages to the Crypto API even for software compression,
> the least we could do is to pack the compressed result together
> and avoid the unnecessary copying of the compressed output that
> is currently done in zswap.
> 
> However, since you guys are both happy with this patch-set,
> I'm not going stand in the way.
> 
> But I do want some changes made to the proposed Crypto API interface
> so that it can be reused for IPComp.
> 
> In particular, instead of passing an opaque pointer (kernel_data)
> to magically turn on batching, please add a new helper that enables
> batching.
> 
> I don't think we need any extra fields in struct acomp_req apart
> from a new field called unit_size.  This would be 4096 for zswap,
> it could be the MTU for IPsec.
> 
> So add something like this and document that it must be called
> after acmop_request_set_callback (which should set unit_size to 0):
> 
> static inline void acomp_request_set_unit_size(struct acomp_req *req,
> 					       unsigned int du)
> {
> 	req->unit = du;
> }
> 
> static inline void acomp_request_set_callback(struct acomp_req *req, ...)
> {
> 	...
> +	req->unit = 0;
> }
> 
> For the source, nothing needs to be done because the folio could
> be passed in as is.
> 
> For the destination, construct an SG list for them and pass that in.
> The rule should be that the SG list must contain a sufficient number
> of pages for the compression output based on the given unit size.
> 
> For the output lengths, just set the lengths in the destination
> SG list after compression.  If a page is incompressible (including
> an error), just set the length to a negative value (-ENOSPC could
> be used for incompressible input, as we already do).  Even though
> struct scatterlist->length is unsigned, there should be no issue
> with storing a negative value there.

Hi Herbert, Nhat,

Thanks Herbert for these suggestions! I have implemented the new crypto API
and the SG list suggestion. While doing so, I was also able to consolidate the
new scatterlist based zswap_compress() implementation for software and hardware
(i.e. batching) compressors, within the constraints of not changing anything
below the crypto layer for software compressors.

I wanted to provide some additional details so that you can review the overall
approach and let me know if things look Ok. I will rebase the code to the latest
mm-unstable and start working on v12 in the meantime.

1) The zswap per-CPU acomp_ctx has two sg_tables added, one each for
   inputs/outputs, with nents set to the pool->compr_batch_size (1 for software
   compressors). This per-CPU data incurs additional memory overhead per-CPU,
   however this is memory that will anyway be allocated on the stack in
   zswap_compress(); and less memory overhead than the latter because we know
   exactly how many sg_table scatterlists to allocate for the given pool
   (assuming we don't kmalloc in zswap_compress()). I will make sure to quantify
   the overhead in v12's commit logs.

2) I added new sg_alloc_table_node() and sg_kmalloc_node() to facilitate this.

3) I added the acomp_request_set_unit_size() helper for
   batching; initialized the unit_size to 0 in
   acomp_request_set_callback(). zswap_cpu_comp_prepare() will set the unit_size
   to PAGE_SIZE after the call to acomp_request_set_callback().

4) Unified code in zswap_compress() for software and hardware compressors to use
   the per-CPU SG lists. Some unavoidable changes for software path to use the
   acomp_req->dlen instead of the SG list output length.

5) A trade-off I had to make in the iaa_crypto driver to adhere to the new SG
   list based batching architecture:

   Currently, all calls to dma_map_sg() in iaa_crypto_main.c use
   sg_nents(req->src[dst]). This requires the sg_init_marker() is set correctly
   based on the number of pages in the batch. This in turn requires
   sg_unmark_end() to be called to clear the termination marker before
   returning. All this adds latency to zswap_compress() (i.e. per batch compress
   call) with the new approach and causes regression wrt v11.

   To make the new approach functional and performant, I have changed all
   the calls to dma_map_sg() to use nents of 1. This should not be a concern,
   since it eliminates redundant computes to scan an SG list with only one
   scatterlist for existing kernel users, i.e. zswap with the zswap_compress()
   modifications described in (4). This will continue to hold true with the zram
   IAA batching support I am developing. There are no kernel use cases for the
   iaa_crypto driver that will break this assumption.

6) "For the source, nothing needs to be done because the folio could be passed
   in as is.". As far as I know, this cannot be accomplished without
   modifications to the crypto API for software compressors, because compressed
   buffers need to be stored in the zswap/zram zs_pools at PAGE_SIZE
   granularity.
   
I have validated the above v12 changes applied over the v11 patch-series,
on Sapphire Rapids:

  1) usemem30: Both zstd and deflate-iaa have comparable performance to v11.

  2) kernel_compilation test: Mostly better performance than baseline, but worse
     than v11. Slightly worse sys time than baseline for zstd/PMD.


  usemem30 with 64K folios:
  =========================
  
     zswap shrinker_enabled = N.
  
     -----------------------------------------------------------------------
                     mm-unstable-7-30-2025             v11             v12
     -----------------------------------------------------------------------
     zswap compressor          deflate-iaa     deflate-iaa     deflate-iaa
     -----------------------------------------------------------------------
     Total throughput (KB/s)     7,153,359      10,856,388      10,714,236
     Avg throughput (KB/s)         238,445         361,879         357,141          
     elapsed time (sec)              92.61           70.50           68.87
     sys time (sec)               2,193.59        1,675.32        1,613.11
     -----------------------------------------------------------------------
    
     -----------------------------------------------------------------------
                     mm-unstable-7-30-2025             v11             v12
     -----------------------------------------------------------------------
     zswap compressor                 zstd            zstd            zstd
     -----------------------------------------------------------------------
     Total throughput (KB/s)     6,866,411       6,874,244       6,922,818
     Avg throughput (KB/s)         228,880         229,141         230,760
     elapsed time (sec)              96.45           89.05           87.75
     sys time (sec)               2,410.72        2,150.63        2,090.86     
     -----------------------------------------------------------------------


  usemem30 with 2M folios:
  ========================
  
     zswap shrinker_enabled = N.
    
     -----------------------------------------------------------------------
                     mm-unstable-7-30-2025             v11             v12
     -----------------------------------------------------------------------
     zswap compressor          deflate-iaa     deflate-iaa     deflate-iaa
     -----------------------------------------------------------------------
     Total throughput (KB/s)     7,268,929      11,312,195      10,943,491
     Avg throughput (KB/s)         242,297         377,073         364,783
     elapsed time (sec)              80.40           68.73           69.19
     sys time (sec)               1,856.54        1,599.25        1,618.08
     -----------------------------------------------------------------------
  
     -----------------------------------------------------------------------
                     mm-unstable-7-30-2025             v11             v12
     -----------------------------------------------------------------------
     zswap compressor                 zstd            zstd            zstd           
     -----------------------------------------------------------------------
     Total throughput (KB/s)     7,560,441       7,627,155       7,600,588
     Avg throughput (KB/s)         252,014         254,238         253,352
     elapsed time (sec)              88.89           83.22           87.55
     sys time (sec)               2,132.05        1,952.98        2,079.26
     -----------------------------------------------------------------------


  kernel_compilation with 64K folios:
  ===================================

     zswap shrinker_enabled = Y.
  
     --------------------------------------------------------------------------
                        mm-unstable-7-30-2025             v11             v12
     --------------------------------------------------------------------------
     zswap compressor             deflate-iaa     deflate-iaa     deflate-iaa
     --------------------------------------------------------------------------
     real_sec                          901.81          840.60          895.94
     sys_sec                         2,672.93        2,171.17        2,584.04
     zswpout                       34,700,692      24,076,095      37,725,671
     zswap_written_back_pages       2,612,474       1,451,961       3,050,557
     --------------------------------------------------------------------------

     --------------------------------------------------------------------------
                        mm-unstable-7-30-2025             v11             v12
     --------------------------------------------------------------------------
     zswap compressor                    zstd            zstd            zstd
     --------------------------------------------------------------------------
     real_sec                          882.67          837.21          872.98  
     sys_sec                         3,573.31        2,593.94        3,301.67
     zswpout                       42,768,967      22,660,215      36,810,396
     zswap_written_back_pages       2,109,739         727,919       1,475,480
     --------------------------------------------------------------------------


  kernel_compilation with PMD folios:
  ===================================

     zswap shrinker_enabled = Y.

     --------------------------------------------------------------------------
                        mm-unstable-7-30-2025             v11             v12
     --------------------------------------------------------------------------
     zswap compressor             deflate-iaa     deflate-iaa     deflate-iaa
     --------------------------------------------------------------------------
     real_sec                          838.76          804.83          826.05
     sys_sec                         3,173.57        2,422.63        3,128.11
     zswpout                       59,544,198      38,093,995      60,072,119
     zswap_written_back_pages       2,726,367         929,614       2,324,707
     --------------------------------------------------------------------------
 
 
     --------------------------------------------------------------------------
                        mm-unstable-7-30-2025             v11             v12
     --------------------------------------------------------------------------
     zswap compressor                    zstd            zstd            zstd 
     --------------------------------------------------------------------------
     real_sec                          831.09          813.40          827.84
     sys_sec                         4,251.11        3,053.95        4,406.65
     zswpout                       59,452,638      35,832,407      63,459,471
     zswap_written_back_pages       1,041,721         423,334       1,162,913
     --------------------------------------------------------------------------


I am still in the process of verifying if modifying zswap_decompress() to use
the per-CPU SG lists improves kernel_compilation, but thought this would be a
good sync point to get your thoughts.

I would greatly appreciate your comments on the approach and trade-offs, and
guidance on how to proceed.


"v12" zswap.c diff wrt v11:
===========================

diff --git a/mm/zswap.c b/mm/zswap.c
index c30c1f325f57..58ad257e87e8 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -152,6 +152,8 @@ struct crypto_acomp_ctx {
 	struct acomp_req *req;
 	struct crypto_wait wait;
 	u8 **buffers;
+	struct sg_table *sg_inputs;
+	struct sg_table *sg_outputs;
 	struct mutex mutex;
 	bool is_sleepable;
 };
@@ -282,6 +284,16 @@ static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_buffers)
 			kfree(acomp_ctx->buffers[i]);
 		kfree(acomp_ctx->buffers);
 	}
+
+	if (acomp_ctx->sg_inputs) {
+		sg_free_table(acomp_ctx->sg_inputs);
+		acomp_ctx->sg_inputs = NULL;
+	}
+
+	if (acomp_ctx->sg_outputs) {
+		sg_free_table(acomp_ctx->sg_outputs);
+		acomp_ctx->sg_outputs = NULL;
+	}
 }
 
 static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
@@ -922,6 +934,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 {
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
+	int cpu_node = cpu_to_node(cpu);
 	int ret = -ENOMEM;
 	u8 i;
 
@@ -936,7 +949,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 		return 0;
 
-	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
+	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_node);
 	if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
 				pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
@@ -960,13 +973,13 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 				     crypto_acomp_batch_size(acomp_ctx->acomp));
 
 	acomp_ctx->buffers = kcalloc_node(pool->compr_batch_size, sizeof(u8 *),
-					  GFP_KERNEL, cpu_to_node(cpu));
+					  GFP_KERNEL, cpu_node);
 	if (!acomp_ctx->buffers)
 		goto fail;
 
 	for (i = 0; i < pool->compr_batch_size; ++i) {
 		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
-						     cpu_to_node(cpu));
+						     cpu_node);
 		if (!acomp_ctx->buffers[i])
 			goto fail;
 	}
@@ -981,6 +994,26 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
+	acomp_request_set_unit_size(acomp_ctx->req, PAGE_SIZE);
+
+	acomp_ctx->sg_inputs = kmalloc_node(sizeof(*acomp_ctx->sg_inputs),
+					    GFP_KERNEL, cpu_node);
+	if (!acomp_ctx->sg_inputs)
+		goto fail;
+
+	if (sg_alloc_table_node(&acomp_ctx->sg_inputs, pool->compr_batch_size,
+				GFP_KERNEL, cpu_node))
+		goto fail;
+
+	acomp_ctx->sg_outputs = kmalloc_node(sizeof(*acomp_ctx->sg_outputs),
+					     GFP_KERNEL, cpu_node);
+	if (!acomp_ctx->sg_outputs)
+		goto fail;
+
+	if (sg_alloc_table_node(&acomp_ctx->sg_outputs, pool->compr_batch_size,
+				GFP_KERNEL, cpu_node))
+		goto fail;
+
 	mutex_init(&acomp_ctx->mutex);
 	return 0;
 
@@ -1027,17 +1060,14 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
 			   struct zswap_entry *entries[], struct zswap_pool *pool,
 			   int node_id)
 {
+	unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
+	unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
 	struct crypto_acomp_ctx *acomp_ctx;
-	struct scatterlist input, output;
 	struct zpool *zpool = pool->zpool;
-
-	unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
-	int errors[ZSWAP_MAX_BATCH_SIZE];
-
-	unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
-	unsigned int i, j;
-	int err;
+	struct scatterlist *sg;
+	unsigned int i, j, k;
 	gfp_t gfp;
+	int err;
 
 	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
 
@@ -1045,59 +1075,58 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
 
 	mutex_lock(&acomp_ctx->mutex);
 
+	prefetchw(acomp_ctx->sg_inputs->sgl);
+	prefetchw(acomp_ctx->sg_outputs->sgl);
+
 	/*
 	 * Note:
 	 * [i] refers to the incoming batch space and is used to
-	 *     index into the folio pages, @entries and @errors.
+	 *     index into the folio pages and @entries.
+	 *
+	 * [k] refers to the @acomp_ctx space, as determined by
+	 *     @pool->compr_batch_size, and is used to index into
+	 *     @acomp_ctx->buffers and @dlens.
 	 */
 	for (i = 0; i < nr_pages; i += nr_comps) {
-		if (nr_comps == 1) {
-			sg_init_table(&input, 1);
-			sg_set_page(&input, folio_page(folio, start + i), PAGE_SIZE, 0);
+		for_each_sg(acomp_ctx->sg_inputs->sgl, sg, nr_comps, k)
+			sg_set_page(sg, folio_page(folio, start + k + i), PAGE_SIZE, 0);
 
-			/*
-			 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
-			 * and hardware-accelerators may won't check the dst buffer size, so
-			 * giving the dst buffer with enough length to avoid buffer overflow.
-			 */
-			sg_init_one(&output, acomp_ctx->buffers[0], PAGE_SIZE * 2);
-			acomp_request_set_params(acomp_ctx->req, &input,
-						 &output, PAGE_SIZE, PAGE_SIZE);
-
-			errors[i] = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
-						    &acomp_ctx->wait);
-			if (unlikely(errors[i]))
-				goto compress_error;
-
-			dlens[i] = acomp_ctx->req->dlen;
-		} else {
-			struct page *pages[ZSWAP_MAX_BATCH_SIZE];
-			unsigned int k;
-
-			for (k = 0; k < nr_pages; ++k)
-				pages[k] = folio_page(folio, start + k);
-
-			struct swap_batch_comp_data batch_comp_data = {
-				.pages = pages,
-				.dsts = acomp_ctx->buffers,
-				.dlens = dlens,
-				.errors = errors,
-				.nr_comps = nr_pages,
-			};
-
-			acomp_ctx->req->kernel_data = &batch_comp_data;
-
-			if (unlikely(crypto_acomp_compress(acomp_ctx->req)))
-				goto compress_error;
+		/*
+		 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
+		 * and hardware-accelerators may won't check the dst buffer size, so
+		 * giving the dst buffer with enough length to avoid buffer overflow.
+		 */
+		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k)
+			sg_set_buf(sg, acomp_ctx->buffers[k], PAGE_SIZE * 2);
+
+		acomp_request_set_params(acomp_ctx->req,
+					 acomp_ctx->sg_inputs->sgl,
+					 acomp_ctx->sg_outputs->sgl,
+					 nr_comps * PAGE_SIZE,
+					 nr_comps * PAGE_SIZE);
+
+		err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
+				      &acomp_ctx->wait);
+
+		if (unlikely(err)) {
+			if (nr_comps == 1)
+				dlens[0] = err;
+			goto compress_error;
 		}
 
+		if (nr_comps == 1)
+			dlens[0] = acomp_ctx->req->dlen;
+		else
+			for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k)
+				dlens[k] = sg->length;
+
 		/*
 		 * All @nr_comps pages were successfully compressed.
 		 * Store the pages in zpool.
 		 *
 		 * Note:
 		 * [j] refers to the incoming batch space and is used to
-		 *     index into the folio pages, @entries, @dlens and @errors.
+		 *     index into the folio pages and @entries.
 		 * [k] refers to the @acomp_ctx space, as determined by
 		 *     @pool->compr_batch_size, and is used to index into
 		 *     @acomp_ctx->buffers.
@@ -1113,7 +1142,7 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
 			 * non-batching software compressors.
 			 */
 			prefetchw(entries[j]);
-			err = zpool_malloc(zpool, dlens[j], gfp, &handle, node_id);
+			err = zpool_malloc(zpool, dlens[k], gfp, &handle, node_id);
 
 			if (unlikely(err)) {
 				if (err == -ENOSPC)
@@ -1124,9 +1153,9 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
 				goto err_unlock;
 			}
 
-			zpool_obj_write(zpool, handle, acomp_ctx->buffers[k], dlens[j]);
+			zpool_obj_write(zpool, handle, acomp_ctx->buffers[k], dlens[k]);
 			entries[j]->handle = handle;
-			entries[j]->length = dlens[j];
+			entries[j]->length = dlens[k];
 		}
 	} /* finished compress and store nr_pages. */
 
@@ -1134,9 +1163,9 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
 	return true;
 
 compress_error:
-	for (j = i; j < i + nr_comps; ++j) {
-		if (errors[j]) {
-			if (errors[j] == -ENOSPC)
+	for (k = 0; k < nr_comps; ++k) {
+		if (dlens[k] < 0) {
+			if (dlens[k] == -ENOSPC)
 				zswap_reject_compress_poor++;
 			else
 				zswap_reject_compress_fail++;



Thanks,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt