RE: [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with compress batching of large folios.

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Thu, 14 Aug 2025 22:17:58 +0000

> -----Original Message-----
> From: Nhat Pham <nphamcs@xxxxxxxxx>
> Sent: Thursday, August 14, 2025 2:15 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> hannes@xxxxxxxxxxx; yosry.ahmed@xxxxxxxxx; chengming.zhou@xxxxxxxxx;
> usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx;
> ying.huang@xxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
> senozhatsky@xxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx;
> herbert@xxxxxxxxxxxxxxxxxxx; davem@xxxxxxxxxxxxx;
> clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx; ebiggers@xxxxxxxxxx;
> surenb@xxxxxxxxxx; Accardi, Kristen C <kristen.c.accardi@xxxxxxxxx>;
> Gomes, Vinicius <vinicius.gomes@xxxxxxxxx>; Feghali, Wajdi K
> <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Thu, Jul 31, 2025 at 9:36 PM Kanchana P Sridhar
> <kanchana.p.sridhar@xxxxxxxxx> wrote:
> >
> > This patch introduces a new unified implementation of zswap_compress()
> > for compressors that do and do not support batching. This eliminates
> > code duplication and facilitates maintainability of the code with the
> > introduction of compress batching.
> >
> > The vectorized implementation of calling the earlier zswap_compress()
> > sequentially, one page at a time in zswap_store_pages(), is replaced
> > with this new version of zswap_compress() that accepts multiple pages to
> > compress as a batch.
> >
> > If the compressor does not support batching, each page in the batch is
> > compressed and stored sequentially.
> >
> > If the compressor supports batching, for e.g., 'deflate-iaa', the Intel
> > IAA hardware accelerator, the batch is compressed in parallel in
> > hardware by setting the acomp_ctx->req->kernel_data to contain the
> > necessary batching data before calling crypto_acomp_compress(). If all
> > requests in the batch are compressed without errors, the compressed
> > buffers are then stored in zpool.
> >
> > Another important change this patch makes is with the acomp_ctx mutex
> > locking in zswap_compress(). Earlier, the mutex was held per page's
> > compression. With the new code, [un]locking the mutex per page caused
> > regressions for software compressors when testing with usemem
> > (30 processes) and also kernel compilation with 'allmod' config. The
> > regressions were more eggregious when PMD folios were stored. The
> > implementation in this commit locks/unlocks the mutex once per batch,
> > that resolves the regression.
> >
> > The use of prefetchw() for zswap entries and likely()/unlikely()
> > annotations prevent regressions with software compressors like zstd, and
> > generally improve non-batching compressors' performance with the
> > batching code by ~3%.
> >
> > Architectural considerations for the zswap batching framework:
> >
> ==============================================================
> > We have designed the zswap batching framework to be
> > hardware-agnostic. It has no dependencies on Intel-specific features and
> > can be leveraged by any hardware accelerator or software-based
> > compressor. In other words, the framework is open and inclusive by
> > design.
> >
> > Other ongoing work that can use batching:
> > =========================================
> > This patch-series demonstrates the performance benefits of compress
> > batching when used in zswap_store() of large folios. shrink_folio_list()
> > "reclaim batching" of any-order folios is the major next work that uses
> > the zswap compress batching framework: our testing of kernel_compilation
> > with writeback and the zswap shrinker indicates 10X fewer pages get
> > written back when we reclaim 32 folios as a batch, as compared to one
> > folio at a time: this is with deflate-iaa and with zstd. We expect to
> > submit a patch-series with this data and the resulting performance
> > improvements shortly. Reclaim batching relieves memory pressure faster
> > than reclaiming one folio at a time, hence alleviates the need to scan
> > slab memory for writeback.
> >
> > Nhat has given ideas on using batching with the ongoing kcompressd work,
> > as well as beneficially using decompression batching & block IO batching
> > to improve zswap writeback efficiency.
> >
> > Experiments that combine zswap compress batching, reclaim batching,
> > swapin_readahead() decompression batching of prefetched pages, and
> > writeback batching show that 0 pages are written back with deflate-iaa
> > and zstd. For comparison, the baselines for these compressors see
> > 200K-800K pages written to disk (kernel compilation 'allmod' config).
> >
> > To summarize, these are future clients of the batching framework:
> >
> >    - shrink_folio_list() reclaim batching of multiple folios:
> >        Implemented, will submit patch-series.
> >    - zswap writeback with decompress batching:
> >        Implemented, will submit patch-series.
> >    - zram:
> >        Implemented, will submit patch-series.
> >    - kcompressd:
> >        Not yet implemented.
> >    - file systems:
> >        Not yet implemented.
> >    - swapin_readahead() decompression batching of prefetched pages:
> >        Implemented, will submit patch-series.
> >
> > Additionally, any place we have folios that need to be compressed, can
> > potentially be parallelized.
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx>
> > ---
> >  mm/swap.h  |  23 ++++++
> >  mm/zswap.c | 201 ++++++++++++++++++++++++++++++++++++++----------
> -----
> >  2 files changed, 168 insertions(+), 56 deletions(-)
> >
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 911ad5ff0f89f..2afbf00f59fea 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -11,6 +11,29 @@ extern int page_cluster;
> >  #include <linux/swapops.h> /* for swp_offset */
> >  #include <linux/blk_types.h> /* for bio_end_io_t */
> >
> > +/* linux/mm/zswap.c */
> > +/*
> > + * A compression algorithm that wants to batch
> compressions/decompressions
> > + * must define its own internal data structures that exactly mirror
> > + * @struct swap_batch_comp_data and @struct
> swap_batch_decomp_data.
> > + */
> > +struct swap_batch_comp_data {
> > +       struct page **pages;
> > +       u8 **dsts;
> > +       unsigned int *dlens;
> > +       int *errors;
> > +       u8 nr_comps;
> > +};
> > +
> > +struct swap_batch_decomp_data {
> > +       u8 **srcs;
> > +       struct page **pages;
> > +       unsigned int *slens;
> > +       unsigned int *dlens;
> > +       int *errors;
> > +       u8 nr_decomps;
> > +};
> 
> This struct is not being used yet right? I assume this is used for
> batch zswap load and writeback etc.

Yes, you are right :)

> 
> Can we introduce them when those series are sent out? Just to limit
> the amount of reviewing here :)

Sure,  I can make this change in v12.

> 
> > +
> >  /* linux/mm/page_io.c */
> >  int sio_pool_init(void);
> >  struct swap_iocb;
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 8ca69c3f30df2..c30c1f325f573 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -35,6 +35,7 @@
> >  #include <linux/pagemap.h>
> >  #include <linux/workqueue.h>
> >  #include <linux/list_lru.h>
> > +#include <linux/prefetch.h>
> >
> >  #include "swap.h"
> >  #include "internal.h"
> > @@ -988,71 +989,163 @@ static int zswap_cpu_comp_prepare(unsigned
> int cpu, struct hlist_node *node)
> >         return ret;
> >  }
> >
> > -static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > -                          struct zswap_pool *pool)
> > +/*
> > + * Unified code path for compressors that do and do not support batching.
> This
> > + * procedure will compress multiple @nr_pages in @folio starting from the
> > + * @start index.
> > + *
> > + * It is assumed that @nr_pages <= ZSWAP_MAX_BATCH_SIZE.
> zswap_store() makes
> > + * sure of this by design.
> 
> Maybe add a VM_WARN_ON_ONCE(nr_pages <= ZSWAP_MAX_BATCH_SIZE);
> in
> zswap_store_pages() to codify this design choice?
> 
> > + *
> > + * @nr_pages can be in (1, ZSWAP_MAX_BATCH_SIZE] even if the
> compressor does not
> > + * support batching.
> > + *
> > + * If @pool->compr_batch_size is 1, each page is processed sequentially.
> > + *
> > + * If @pool->compr_batch_size is > 1, compression batching is invoked,
> except if
> > + * @nr_pages is 1: if so, we call the fully synchronous non-batching
> > + * crypto_acomp API.
> > + *
> > + * In both cases, if all compressions are successful, the compressed buffers
> > + * are stored in zpool.
> > + *
> > + * A few important changes made to not regress and in fact improve
> > + * compression performance with non-batching software compressors,
> using this
> > + * new/batching code:
> > + *
> > + * 1) acomp_ctx mutex locking:
> > + *    Earlier, the mutex was held per page compression. With the new code,
> > + *    [un]locking the mutex per page caused regressions for software
> > + *    compressors. We now lock the mutex once per batch, which resolves
> the
> > + *    regression.
> 
> Makes sense, yeah.

Thanks!

> 
> > + *
> > + * 2) The prefetchw() and likely()/unlikely() annotations prevent
> > + *    regressions with software compressors like zstd, and generally improve
> > + *    non-batching compressors' performance with the batching code by
> ~3%.
> > + */
> > +static bool zswap_compress(struct folio *folio, long start, unsigned int
> nr_pages,
> > +                          struct zswap_entry *entries[], struct zswap_pool *pool,
> > +                          int node_id)
> >  {
> >         struct crypto_acomp_ctx *acomp_ctx;
> >         struct scatterlist input, output;
> > -       int comp_ret = 0, alloc_ret = 0;
> > -       unsigned int dlen = PAGE_SIZE;
> > -       unsigned long handle;
> > -       struct zpool *zpool;
> > +       struct zpool *zpool = pool->zpool;
> > +
> > +       unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
> > +       int errors[ZSWAP_MAX_BATCH_SIZE];
> > +
> > +       unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> > +       unsigned int i, j;
> > +       int err;
> >         gfp_t gfp;
> > -       u8 *dst;
> > +
> > +       gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM |
> __GFP_MOVABLE;
> >
> >         acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> >
> >         mutex_lock(&acomp_ctx->mutex);
> >
> > -       dst = acomp_ctx->buffers[0];
> > -       sg_init_table(&input, 1);
> > -       sg_set_page(&input, page, PAGE_SIZE, 0);
> > -
> >         /*
> > -        * We need PAGE_SIZE * 2 here since there maybe over-compression
> case,
> > -        * and hardware-accelerators may won't check the dst buffer size, so
> > -        * giving the dst buffer with enough length to avoid buffer overflow.
> > +        * Note:
> > +        * [i] refers to the incoming batch space and is used to
> > +        *     index into the folio pages, @entries and @errors.
> >          */
> > -       sg_init_one(&output, dst, PAGE_SIZE * 2);
> > -       acomp_request_set_params(acomp_ctx->req, &input, &output,
> PAGE_SIZE, dlen);
> > +       for (i = 0; i < nr_pages; i += nr_comps) {
> > +               if (nr_comps == 1) {
> > +                       sg_init_table(&input, 1);
> > +                       sg_set_page(&input, folio_page(folio, start + i), PAGE_SIZE, 0);
> >
> > -       /*
> > -        * it maybe looks a little bit silly that we send an asynchronous request,
> > -        * then wait for its completion synchronously. This makes the process
> look
> > -        * synchronous in fact.
> > -        * Theoretically, acomp supports users send multiple acomp requests in
> one
> > -        * acomp instance, then get those requests done simultaneously. but in
> this
> > -        * case, zswap actually does store and load page by page, there is no
> > -        * existing method to send the second page before the first page is
> done
> > -        * in one thread doing zwap.
> > -        * but in different threads running on different cpu, we have different
> > -        * acomp instance, so multiple threads can do (de)compression in
> parallel.
> > -        */
> > -       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> >req), &acomp_ctx->wait);
> > -       dlen = acomp_ctx->req->dlen;
> > -       if (comp_ret)
> > -               goto unlock;
> > +                       /*
> > +                        * We need PAGE_SIZE * 2 here since there maybe over-
> compression case,
> > +                        * and hardware-accelerators may won't check the dst buffer
> size, so
> > +                        * giving the dst buffer with enough length to avoid buffer
> overflow.
> > +                        */
> > +                       sg_init_one(&output, acomp_ctx->buffers[0], PAGE_SIZE * 2);
> > +                       acomp_request_set_params(acomp_ctx->req, &input,
> > +                                                &output, PAGE_SIZE, PAGE_SIZE);
> > +
> > +                       errors[i] =
> crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> > +                                                   &acomp_ctx->wait);
> > +                       if (unlikely(errors[i]))
> > +                               goto compress_error;
> > +
> > +                       dlens[i] = acomp_ctx->req->dlen;
> > +               } else {
> > +                       struct page *pages[ZSWAP_MAX_BATCH_SIZE];
> > +                       unsigned int k;
> > +
> > +                       for (k = 0; k < nr_pages; ++k)
> > +                               pages[k] = folio_page(folio, start + k);
> > +
> > +                       struct swap_batch_comp_data batch_comp_data = {
> > +                               .pages = pages,
> > +                               .dsts = acomp_ctx->buffers,
> > +                               .dlens = dlens,
> > +                               .errors = errors,
> > +                               .nr_comps = nr_pages,
> > +                       };
> > +
> > +                       acomp_ctx->req->kernel_data = &batch_comp_data;
> > +
> > +                       if (unlikely(crypto_acomp_compress(acomp_ctx->req)))
> > +                               goto compress_error;
> 
> I assume this is a new crypto API?

Not exactly a new API, rather a new "void *kernel_data" member added
to the existing "struct acomp_req".

> 
> I'll let Herbert decide whether this makes sense :)

Definitely.

Thanks,
Kanchana

> 
> > +               }
> >
> > -       zpool = pool->zpool;
> > -       gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM |
> __GFP_MOVABLE;
> > -       alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle, page_to_nid(page));
> > -       if (alloc_ret)
> > -               goto unlock;
> > -
> > -       zpool_obj_write(zpool, handle, dst, dlen);
> > -       entry->handle = handle;
> > -       entry->length = dlen;
> > -
> > -unlock:
> > -       if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> > -               zswap_reject_compress_poor++;
> > -       else if (comp_ret)
> > -               zswap_reject_compress_fail++;
> > -       else if (alloc_ret)
> > -               zswap_reject_alloc_fail++;
> > +               /*
> > +                * All @nr_comps pages were successfully compressed.
> > +                * Store the pages in zpool.
> > +                *
> > +                * Note:
> > +                * [j] refers to the incoming batch space and is used to
> > +                *     index into the folio pages, @entries, @dlens and @errors.
> > +                * [k] refers to the @acomp_ctx space, as determined by
> > +                *     @pool->compr_batch_size, and is used to index into
> > +                *     @acomp_ctx->buffers.
> > +                */
> > +               for (j = i; j < i + nr_comps; ++j) {
> > +                       unsigned int k = j - i;
> > +                       unsigned long handle;
> > +
> > +                       /*
> > +                        * prefetchw() minimizes cache-miss latency by
> > +                        * moving the zswap entry to the cache before it
> > +                        * is written to; reducing sys time by ~1.5% for
> > +                        * non-batching software compressors.
> > +                        */
> > +                       prefetchw(entries[j]);
> > +                       err = zpool_malloc(zpool, dlens[j], gfp, &handle, node_id);
> > +
> > +                       if (unlikely(err)) {
> > +                               if (err == -ENOSPC)
> > +                                       zswap_reject_compress_poor++;
> > +                               else
> > +                                       zswap_reject_alloc_fail++;
> > +
> > +                               goto err_unlock;
> > +                       }
> > +
> > +                       zpool_obj_write(zpool, handle, acomp_ctx->buffers[k],
> dlens[j]);
> > +                       entries[j]->handle = handle;
> > +                       entries[j]->length = dlens[j];
> > +               }
> > +       } /* finished compress and store nr_pages. */
> >
> >         mutex_unlock(&acomp_ctx->mutex);
> > -       return comp_ret == 0 && alloc_ret == 0;
> > +       return true;
> > +
> > +compress_error:
> > +       for (j = i; j < i + nr_comps; ++j) {
> > +               if (errors[j]) {
> > +                       if (errors[j] == -ENOSPC)
> > +                               zswap_reject_compress_poor++;
> > +                       else
> > +                               zswap_reject_compress_fail++;
> > +               }
> > +       }
> > +
> > +err_unlock:
> > +       mutex_unlock(&acomp_ctx->mutex);
> > +       return false;
> >  }
> >
> >  static bool zswap_decompress(struct zswap_entry *entry, struct folio
> *folio)
> > @@ -1590,12 +1683,8 @@ static bool zswap_store_pages(struct folio
> *folio,
> >                 INIT_LIST_HEAD(&entries[i]->lru);
> >         }
> >
> > -       for (i = 0; i < nr_pages; ++i) {
> > -               struct page *page = folio_page(folio, start + i);
> > -
> > -               if (!zswap_compress(page, entries[i], pool))
> > -                       goto store_pages_failed;
> > -       }
> > +       if (unlikely(!zswap_compress(folio, start, nr_pages, entries, pool,
> node_id)))
> > +               goto store_pages_failed;
> >
> >         for (i = 0; i < nr_pages; ++i) {
> >                 struct zswap_entry *old, *entry = entries[i];
> > --
> > 2.27.0
> >