On Mon, Aug 4, 2025 at 5:30 PM SeongJae Park <sj@xxxxxxxxxx> wrote: > > When zswap writeback is enabled and it fails compressing a given page, > the page is swapped out to the backing swap device. This behavior > breaks the zswap's writeback LRU order, and hence users can experience > unexpected latency spikes. If the page is compressed without failure, > but results in a size of PAGE_SIZE, the LRU order is kept, but the > decompression overhead for loading the page back on the later access is > unnecessary. > > Keep the LRU order and optimize unnecessary decompression overheads in > the cases, by storing the original content in zpool as-is. The length > field of zswap_entry will be set appropriately, as PAGE_SIZE, Hence > whether it is saved as-is or not (whether decompression is unnecessary) > is identified by 'zswap_entry->length == PAGE_SIZE'. > > So this change is not increasing per zswap entry metadata overhead. But > as the number of incompressible pages increases, total zswap metadata > overhead is proportionally increased. The overhead should not be > problematic in usual cases, since the zswap metadata for single zswap > entry is much smaller than PAGE_SIZE, and in common zswap use cases > there should be a sufficient amount of compressible pages. Also it can > be mitigated by the zswap writeback. > > When a severe memory pressure comes from memcg's memory.high, storing > incompressible pages as-is may result in reducing accounted memory > footprint slower, since the footprint will be reduced only after the > zswap writeback kicks in. This can incur higher penalty_jiffies and > degrade the performance. Arguably this is just a wrong setup, but we > don't want to introduce unnecessary surprises. Add a parameter, namely > 'save_incompressible_pages', to turn the feature on/off as users want. > It is turned off by default. > > When the writeback is disabled, the additional overhead could be > problematic. For the case, keep the current behavior that just returns > the failure and let swap_writeout() put the page back to the active LRU > list in the case. It is known to be suboptimal when the incompressible > pages are cold, since the incompressible pages will continuously be > tried to be zswapped out, and burn CPU cycles for compression attempts > that will anyway fails. One imaginable solution for the problem is > reusing the swapped-out page and its struct page to store in the zswap > pool. But that's out of the scope of this patch. > > Tests > ----- > > I tested this patch using a simple self-written microbenchmark that is > available at GitHub[1]. You can reproduce the test I did by executing > run_tests.sh of the repo on your system. Note that the repo's > documentation is not good as of this writing, so you may need to read > and use the code. > > The basic test scenario is simple. Run a test program making artificial > accesses to memory having artificial content under memory.high-set > memory limit and measure how many accesses were made in given time. > > The test program repeatedly and randomly access three anonymous memory > regions. The regions are all 500 MiB size, and accessed in the same > probability. Two of those are filled up with a simple content that can > easily be compressed, while the remaining one is filled up with a > content that read from /dev/urandom, which is easy to fail at > compressing to <PAGE_SIZE size. The program runs for two minutes and > prints out the number of accesses made every five seconds. > > The test script runs the program under below seven configurations. > > - 0: memory.high is set to 2 GiB, zswap is disabled. > - 1-1: memory.high is set to 1350 MiB, zswap is disabled. > - 1-2: Same to 1-1, but zswap is enabled. > - 1-3: Same to 1-2, but save_incompressible_pages is turned on. > - 2-1: memory.high is set to 1200 MiB, zswap is disabled. > - 2-2: Same to 2-1, but zswap is enabled. > - 2-3: Same to 2-2, but save_incompressible_pages is turned on. > > For all zswap enabled case, zswap shrinker is enabled. > > Configuration '0' is for showing the original memory performance. > Configurations 1-1, 1-2 and 1-3 are for showing the performance of swap, > zswap, and this patch under a level of memory pressure (~10% of working > set). > > Configurations 2-1, 2-2 and 2-3 are similar to 1-1, 1-2 and 1-3 but to > show those under a severe level of memory pressure (~20% of the working > set). > > Because the per-5 seconds performance is not very reliable, I measured > the average of that for the last one minute period of the test program > run. I also measured a few vmstat counters including zswpin, zswpout, > zswpwb, pswpin and pswpout during the test runs. > > The measurement results are as below. To save space, I show performance > numbers that are normalized to that of the configuration '0' (no memory > pressure), only. The averaged accesses per 5 seconds of configuration > '0' was 36493417.75. > > config 0 1-1 1-2 1-3 2-1 2-2 2-3 > perf_normalized 1.0000 0.0057 0.0235 0.0367 0.0031 0.0122 0.0077 > perf_stdev_ratio 0.0582 0.0652 0.0167 0.0346 0.0404 0.0145 0.0613 > zswpin 0 0 3548424 1999335 0 2912972 1612517 > zswpout 0 0 3588817 2361689 0 2996588 2029884 > zswpwb 0 0 10214 340270 0 34625 382117 > pswpin 0 485806 772038 340967 540476 874909 790418 > pswpout 0 649543 144773 340270 692666 275178 382117 > > 'perf_normalized' is the performance metric, normalized to that of > configuration '0' (no pressure). 'perf_stdev_ratio' is the standard > deviation of the averaged data points, as a ratio to the averaged metric > value. For example, configuration '0' performance was showing 5.8% > stdev. Configurations 1-1 and 1-3 were having about 6.5% and 6.1% > stdev. Also the results were highly variable between multiple runs. So > this result is not very stable but just showing ball park figures. > Please keep this in your mind when reading these results. > > Under about 10% of working set memory pressure, the performance was > dropped to about 0.57% of no-pressure one, when the normal swap is used > (1-1). Actually ~10% working set pressure is not a mild one, at least > on this test setup. > > By turning zswap on (1-2), the performance was improved about 4x, > resulting in about 2.35% of no-pressure one. Because of the > incompressible pages in the third memory region, a significant amount of > (non-zswap) swap I/O operations were made, though. > > By enabling the incompressible pages handling feature that is introduced > by this patch (1-3), about 56% performance improvement was made, > resulting in about 3.67% of no-pressure one. Reduced pswpin of 1-3 > compared to 1-2 let us see where this improvement came from. > > Under about 20% of working set memory pressure, which could be extreme, > the performance drops down to 0.31% of no-pressure one when only the > normal swap is used (2-1). Enabling zswap significantly improves it, up > to 1.22%, though again showing a significant number of (non-zswap) swap > I/O due to incompressible pages. > > Enabling the incompressible pages handling feature of this patch (2-3) > didn't reduce non-zswap swap I/O, because the memory pressure is too > severe to let nearly all zswap pages including the incompressible pages > written back by zswap shrinker. And because the memory usage is not > dropped as soon as incompressible pages are swapped out but only after > those are written back by shrinker, memory.high apparently applied more > penalty_jiffies. As a result, the performance became even worse than > 2-2 about 36.88%, resulting in 0.07% of the no-pressure one. > > 20% of working set memory pressure is pretty extreme, but anyway the > incompressible pages handling feature could make it worse in certain > setups. Hence add the parameter for turning the feature on/off as > needed, and disable it by default. > > Related Works > ------------- > > This is not an entirely new attempt. Nhat Pham and Takero Funaki tried > very similar approaches in October 2023[2] and April 2024[3], > respectively. The two approaches didn't get merged mainly due to the > metadata overhead concern. I described why I think that shouldn't be a > problem for this change, which is automatically disabled when writeback > is disabled, at the beginning of this changelog. > > This patch is not particularly different from those, and actually built > upon those. I wrote this from scratch again, though. Hence adding > Suggested-by tags for them. Actually Nhat first suggested this to me > offlist. > > [1] https://github.com/sjp38/eval_zswap/blob/master/run.sh > [2] https://lore.kernel.org/20231017003519.1426574-3-nphamcs@xxxxxxxxx > [3] https://lore.kernel.org/20240706022523.1104080-6-flintglass@xxxxxxxxx > > Suggested-by: Nhat Pham <nphamcs@xxxxxxxxx> > Suggested-by: Takero Funaki <flintglass@xxxxxxxxx> > Signed-off-by: SeongJae Park <sj@xxxxxxxxxx> > --- > Changes from RFC v1 > (https://lore.kernel.org/20250730234059.4603-1-sj@xxxxxxxxxx) > - Consider PAGE_SIZE-resulting compression successes as failures. > - Use zpool for storing incompressible pages. > - Test with zswap shrinker enabled. > - Wordsmith changelog and comments. > - Add documentation of save_incompressible_pages parameter. > > Documentation/admin-guide/mm/zswap.rst | 9 +++++ > mm/zswap.c | 53 +++++++++++++++++++++++++- > 2 files changed, 61 insertions(+), 1 deletion(-) > > diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst > index c2806d051b92..20eae0734491 100644 > --- a/Documentation/admin-guide/mm/zswap.rst > +++ b/Documentation/admin-guide/mm/zswap.rst > @@ -142,6 +142,15 @@ User can enable it as follows:: > This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON`` is > selected. > > +If a page cannot be compressed into a size smaller than PAGE_SIZE, it can be > +beneficial to save the content as is without compression, to keep the LRU > +order. Users can enable this behavior, as follows:: > + > + echo Y > /sys/module/zswap/parameters/save_incompressible_pages > + > +This is disabled by default, and doesn't change behavior of zswap writeback > +disabled case. > + > A debugfs interface is provided for various statistic about pool size, number > of pages stored, same-value filled pages and various counters for the reasons > pages are rejected. > diff --git a/mm/zswap.c b/mm/zswap.c > index 7e02c760955f..6e196c9a4dba 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -129,6 +129,11 @@ static bool zswap_shrinker_enabled = IS_ENABLED( > CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); > module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644); > > +/* Enable/disable incompressible pages storing */ > +static bool zswap_save_incompressible_pages; > +module_param_named(save_incompressible_pages, zswap_save_incompressible_pages, > + bool, 0644); > + > bool zswap_is_enabled(void) > { > return zswap_enabled; > @@ -937,6 +942,29 @@ static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx) > mutex_unlock(&acomp_ctx->mutex); > } > > +/* > + * Determine whether to save given page as-is. > + * > + * If a page cannot be compressed into a size smaller than PAGE_SIZE, it can be > + * beneficial to saving the content as is without compression, to keep the LRU > + * order. This can increase memory overhead from metadata, but in common zswap > + * use cases where there are sufficient amount of compressible pages, the > + * overhead should be not critical, and can be mitigated by the writeback. > + * Also, the decompression overhead is optimized. > + * > + * When the writeback is disabled, however, the additional overhead could be > + * problematic. For the case, just return the failure. swap_writeout() will > + * put the page back to the active LRU list in the case. > + */ > +static bool zswap_save_as_is(int comp_ret, unsigned int dlen, > + struct page *page) > +{ > + return zswap_save_incompressible_pages && > + (comp_ret || dlen == PAGE_SIZE) && > + mem_cgroup_zswap_writeback_enabled( > + folio_memcg(page_folio(page))); > +} > + > static bool zswap_compress(struct page *page, struct zswap_entry *entry, > struct zswap_pool *pool) > { > @@ -976,8 +1004,13 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, > */ > comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait); > dlen = acomp_ctx->req->dlen; > - if (comp_ret) > + if (zswap_save_as_is(comp_ret, dlen, page)) { > + comp_ret = 0; > + dlen = PAGE_SIZE; > + memcpy_from_page(dst, page, 0, dlen); > + } else if (comp_ret) { > goto unlock; > + } > > zpool = pool->zpool; > gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE; > @@ -1001,6 +1034,17 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, > return comp_ret == 0 && alloc_ret == 0; > } > > +/* > + * If save_incompressible_pages is set and writeback is enabled, incompressible > + * pages are saved as is without compression. For more details, refer to the > + * comments of zswap_save_as_is(). > + */ > +static bool zswap_saved_as_is(struct zswap_entry *entry, struct folio *folio) > +{ > + return entry->length == PAGE_SIZE && zswap_save_incompressible_pages && > + mem_cgroup_zswap_writeback_enabled(folio_memcg(folio)); > +} Actually, this might not be safe either :( What if we have the following sequence: 1. Initially, the cgroup is writeback enabled. We encounter an incompressible page, and store it as-is in the zswap pool. 2. Some userspace agent (systemd or whatever) runs, and disables zswap writeback on the cgroup. 3. At fault time, zswap_saved_as_is() returns false, so we'll treat the page-sized stored object as compressed, and attempt to decompress it. This is a memory corruption. I think you can trigger a similar bug, if you enable zswap_save_incompressible_pages initially, then disable it later on. I think you have to do the following: 1. At store time, if comp_ret or dlen == PAGE_SIZE, treat it as compression failure. This means: saving as-is when writeback enabled, and rejecting when writeback disabled. Basically: if (!comp_ret || dlen == PAGE_SIZE) { if (zswap_save_incompressible_pages && mem_cgroup_zswap_writeback_enabled(folio_memcg(page_folio(folio)))) { /* save as-is */ } else { /* rejects */ } } 2. At load time, just check that dlen == PAGE_SIZE. We NEVER store PAGE_SIZE "compressed" page, so we can safely assume that it is the original, uncompressed data.