> -----Original Message----- > From: Barry Song <21cnbao@xxxxxxxxx> > Sent: Thursday, August 28, 2025 4:29 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > hannes@xxxxxxxxxxx; yosry.ahmed@xxxxxxxxx; nphamcs@xxxxxxxxx; > chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx; > ryan.roberts@xxxxxxx; ying.huang@xxxxxxxxxxxxxxxxx; akpm@linux- > foundation.org; senozhatsky@xxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx; > herbert@xxxxxxxxxxxxxxxxxxx; davem@xxxxxxxxxxxxx; > clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx; ebiggers@xxxxxxxxxx; > surenb@xxxxxxxxxx; Accardi, Kristen C <kristen.c.accardi@xxxxxxxxx>; > Gomes, Vinicius <vinicius.gomes@xxxxxxxxx>; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources > if the compressor supports batching. > > > > > > > If ZSWAP_MAX_BATCH_SIZE is set to 8 and there is no hardware batching, > > > compression is done with a step size of 1. If the hardware step size is 4, > > > compression occurs in two steps. If the hardware step size is 6, the first > > > compression uses a step size of 6, and the second uses a step size of 2. > > > Do you think this will work? > > > > Hi Barry, > > > > This would be non-optimal from code simplicity and latency perspectives. > > One of the benefits of using the hardware accelerator's "batch parallelism" > > is cost amortization across the batch. We might lose this benefit if we make > > multiple calls to zswap_compress() to ask the hardware accelerator to > > batch compress in smaller batches. Compression throughput would also > > be sub-optimal. > > I guess it wouldn’t be an issue if both ZSWAP_MAX_BATCH_SIZE and > pool->compr_batch_size are powers of two. As you mentioned, we still > gain improvement with ZSWAP_MAX_BATCH_SIZE batching even when > pool->compr_batch_size == 1, by compressing pages one by one but > batching other work such as zswap_entries_cache_alloc_batch() ? > > > > > In my patch-series, the rule is simple: if an algorithm has specified a > > batch-size, carve out the folio by that "batch-size" # of pages to be > > compressed as a batch in zswap_compress(). This custom batch-size > > is capped at ZSWAP_MAX_BATCH_SIZE. > > > > If an algorithm has not specified a batch-size, the default batch-size > > is 1. In this case, carve out the folio by ZSWAP_MAX_BATCH_SIZE > > # of pages to be compressed as a batch in zswap_compress(). > > Yes, I understand your rule. However, having two global variables is still > somewhat confusing. It might be clearer to use a single variable with a > comment, since one variable can clearly determine the value of the other. > > Can we get the batch_size at runtime based on pool->compr_batch_size? > > /* > * If hardware compression supports batching, we use the hardware step size. > * Otherwise, we use ZSWAP_MAX_BATCH_SIZE for batching, but still > compress > * one page at a time. > */ > batch_size = pool->compr_batch_size > 1 ? pool->compr_batch_size : > ZSWAP_MAX_BATCH_SIZE; > > We probably don’t need this if both pool->compr_batch_size and > ZSWAP_MAX_BATCH_SIZE are powers of two? I am not sure I understand this rationale, but I do want to reiterate that the patch-set implements a simple set of rules/design choices to provide a batching framework for software and hardware compressors, that has shown good performance improvements with both, while unifying zswap_store()/zswap_compress() code paths for both. As explained before, keeping the two variables as distinct u8 members of struct zswap_pool is a design choice with these benefits: 1) Saves computes by avoiding computing this in performance-critical zswap_store() code. I have verified that dynamically computing the batch_size based on pool->compr_batch_size impacts latency. 2) The memory overhead is minimal: there is at most one zswap_pool active at any given time, other than at compressor transitions. The additional overhead is one u8, i.e., 1 byte for 1 runtime struct. > > > > > > > > > I don’t quite understand why you want to save > > > ZSWAP_MAX_BATCH_SIZE - X resources, since even without hardware > > > batching > > > you are still allocating all ZSWAP_MAX_BATCH_SIZE resources. This is the > > > case for all platforms except yours. > > > > Not sure I understand.. Just to clarify, this is not done to save on resources, > > rather for the reasons stated above. > > > > We are already saving on resources by only allocating only > > "pool->compr_batch_size" number of resources > > (*not* ZSWAP_MAX_BATCH_SIZE resources): > > > > pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE, > > crypto_acomp_batch_size(acomp_ctx->acomp)); > > > > For non-Intel platforms, this means only 1 dst buffer is allocated, as > > explained in the commit log for this patch. > > you are right. I misunderstood your code :-) > > > > > " A "u8 compr_batch_size" member is added to "struct zswap_pool", as per > > Yosry's suggestion. pool->compr_batch_size is set as the minimum of the > > compressor's max batch-size and ZSWAP_MAX_BATCH_SIZE. Accordingly, it > > proceeds to allocate the necessary compression dst buffers in the > > per-CPU acomp_ctx." > > This is fine, but it still doesn’t provide a strong reason for having > two global variables when one can fully determine the value of the other. Hopefully the above provides clarity. Thanks, Kanchana > > Thanks > Barry