On Mon, Apr 28, 2025 at 9:19 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Mon, Apr 14, 2025, James Houghton wrote: > > By using MGLRU's debugfs for invoking test_young() and clear_young(), we > > avoid page_idle's incompatibility with MGLRU, and we can mark pages as > > idle (clear_young()) much faster. > > > > The ability to use page_idle is left in, as it is useful for kernels > > that do not have MGLRU built in. If MGLRU is enabled but is not usable > > (e.g. we can't access the debugfs mount), the test will fail, as > > page_idle is not compatible with MGLRU. > > > > cgroup utility functions have been borrowed so that, when running with > > MGLRU, we can create a memcg in which to run our test. > > > > Other MGLRU-debugfs-specific parsing code has been added to > > lru_gen_util.{c,h}. > > This is not a proper changelog, at least not by upstream KVM standards. Please > rewrite it describe the changes being made, using imperative mood/tone. From > Documentation/process/maintainer-kvm-x86.rst: > > Changelog > ~~~~~~~~~ > Most importantly, write changelogs using imperative mood and avoid pronouns. Right. I'll rewrite the changelog properly. > > > @@ -354,7 +459,12 @@ static int access_tracking_unreliable(void) > > puts("Skipping idle page count sanity check, because NUMA balancing is enabled"); > > return 1; > > } > > + return 0; > > +} > > > > +int run_test_in_cg(const char *cgroup, void *arg) > > static Will change. > > > +{ > > + for_each_guest_mode(run_test, arg); > > Having "separate" flows for MGLRU vs. page_idle is unnecessary. Give the helper > a more common name and use it for both: > > static int run_test_for_each_guest_mode(const char *cgroup, void *arg) > { > for_each_guest_mode(run_test, arg); > return 0; > } Applied for my next version, thanks. > > > return 0; > > } > > > > @@ -372,7 +482,7 @@ static void help(char *name) > > printf(" -v: specify the number of vCPUs to run.\n"); > > printf(" -o: Overlap guest memory accesses instead of partitioning\n" > > " them into a separate region of memory for each vCPU.\n"); > > - printf(" -w: Control whether the test warns or fails if more than 10%\n" > > + printf(" -w: Control whether the test warns or fails if more than 10%%\n" > > " of pages are still seen as idle/old after accessing guest\n" > > " memory. >0 == warn only, 0 == fail, <0 == auto. For auto\n" > > " mode, the test fails by default, but switches to warn only\n" > > @@ -383,6 +493,12 @@ static void help(char *name) > > exit(0); > > } > > > > +void destroy_cgroup(char *cg) > > static. But this is a pointless wrapper, just delete it. Will do. > Using MGLRU on my home box fails. It's full cgroup v2, and has both > CONFIG_IDLE_PAGE_TRACKING=y and MGLRU enabled. > > ==== Test Assertion Failure ==== > access_tracking_perf_test.c:244: false > pid=114670 tid=114670 errno=17 - File exists > 1 0x00000000004032a9: find_generation at access_tracking_perf_test.c:244 > 2 0x00000000004032da: lru_gen_mark_memory_idle at access_tracking_perf_test.c:272 > 3 0x00000000004034e4: mark_memory_idle at access_tracking_perf_test.c:391 > 4 (inlined by) run_test at access_tracking_perf_test.c:431 > 5 0x0000000000403d84: for_each_guest_mode at guest_modes.c:96 > 6 0x0000000000402c61: run_test_for_each_guest_mode at access_tracking_perf_test.c:492 > 7 0x000000000041d8e2: cg_run at cgroup_util.c:382 > 8 0x00000000004027fa: main at access_tracking_perf_test.c:572 > 9 0x00007fa1cb629d8f: ?? ??:0 > 10 0x00007fa1cb629e3f: ?? ??:0 > 11 0x00000000004029d4: _start at ??:? > Could not find a generation with 90% of guest memory (235929 pages). > > Interestingly, if I force the test to use /sys/kernel/mm/page_idle/bitmap, it > passes. > > Please try to reproduce the failure (assuming you haven't already tested that > exact combination of cgroups v2, MGLRU=y, and CONFIG_IDLE_PAGE_TRACKING=y). I > don't have bandwidth to dig any further at this time. Sorry... please see the bottom of this message for a diff that should fix this. It fixes these bugs: 1. Tracking generation numbers without hardware Accessed bit management. (This is addition of lru_gen_last_gen.) 1.5 It does an initial aging pass so that pages always move to newer generations in (or before) the subsequent aging passes. This probably isn't needed given the change I made for (1). 2. Fixes the expected number of pages for guest page sizes > PAGE_SIZE. (This is the move of test_pages. test_pages has also been renamed to avoid shadowing.) 3. Fixes an off-by-one error when looking for the generation with the most pages. Previously it failed to check the youngest generation, which I think is the bug you ran into. (This is the change to lru_gen_util.c.) It might take a couple tweaks to compile in your tree. (It is just a WIP diff from when I was applying changes from your feedback, so it contains partial changes you asked for. > > + if (cg_find_unified_root(cgroup_root, sizeof(cgroup_root), NULL)) > > + ksft_exit_skip("cgroup v2 isn't mounted\n"); > > + > > + new_cg = cg_name(cgroup_root, TEST_MEMCG_NAME); > > + printf("Creating cgroup: %s\n", new_cg); > > + if (cg_create(new_cg) && errno != EEXIST) > > + ksft_exit_skip("could not create new cgroup: %s\n", new_cg); > > + > > + use_lru_gen = true; > > + } else { > > + page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR); > > + __TEST_REQUIRE(page_idle_fd >= 0, > > + "Couldn't open /sys/kernel/mm/page_idle/bitmap. " > > + "Is CONFIG_IDLE_PAGE_TRACKING enabled?"); > > + > > + close(page_idle_fd); > > + } > > Splitting the "check" and "execute" into separate if-else statements results in > some compilers complaining about new_cg possibly being unused. The compiler is > probably being a bit stupid, but the code is just as must to blame. There's zero > reason to split this in two, just do everything after the idle_pages_warn_only > and total_pages processing. Code at the bottom (note, you'll have to rebase on > my not-yet-posted series, or undo the use of __open_path_or_exit()). I have applied the below suggestion for the next version of the series. Thanks. > static int run_test_for_each_guest_mode(const char *cgroup, void *arg) > { > for_each_guest_mode(run_test, arg); > return 0; > } > > int main(int argc, char *argv[]) > { > struct test_params params = { > .backing_src = DEFAULT_VM_MEM_SRC, > .vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE, > .nr_vcpus = 1, > }; > int page_idle_fd; > int opt; > > guest_modes_append_default(); > > while ((opt = getopt(argc, argv, "hm:b:v:os:w:")) != -1) { > switch (opt) { > case 'm': > guest_modes_cmdline(optarg); > break; > case 'b': > params.vcpu_memory_bytes = parse_size(optarg); > break; > case 'v': > params.nr_vcpus = atoi_positive("Number of vCPUs", optarg); > break; > case 'o': > overlap_memory_access = true; > break; > case 's': > params.backing_src = parse_backing_src_type(optarg); > break; > case 'w': > idle_pages_warn_only = > atoi_non_negative("Idle pages warning", > optarg); > break; > case 'h': > default: > help(argv[0]); > break; > } > } > > if (idle_pages_warn_only == -1) > idle_pages_warn_only = access_tracking_unreliable(); > > /* > * If guest_page_size is larger than the host's page size, the > * guest (memstress) will only fault in a subset of the host's pages. > */ > total_pages = params.nr_vcpus * params.vcpu_memory_bytes / > max_t(uint64_t, memstress_args.guest_page_size, getpagesize()); > > if (lru_gen_usable()) { > bool cg_created = true; > char *test_cg = NULL; > int ret; > > puts("Using lru_gen for aging"); > use_lru_gen = true; > > if (cg_find_controller_root(cgroup_root, sizeof(cgroup_root), "memory")) > ksft_exit_skip("Cannot find memory group controller\n"); > > test_cg = cg_name(cgroup_root, TEST_MEMCG_NAME); > printf("Creating cgroup: %s\n", test_cg); > if (cg_create(test_cg)) { > if (errno == EEXIST) > cg_created = false; > else > ksft_exit_skip("could not create new cgroup: %s\n", test_cg); > } > > /* > * This will fork off a new process to run the test within > * a new memcg, so we need to properly propagate the return > * value up. > */ > ret = cg_run(test_cg, &run_test_for_each_guest_mode, ¶ms); > if (cg_created) > cg_destroy(test_cg); > return ret; > } > > puts("Using page_idle for aging"); > page_idle_fd = __open_path_or_exit("/sys/kernel/mm/page_idle/bitmap", O_RDWR, > "Is CONFIG_IDLE_PAGE_TRACKING enabled?"); > close(page_idle_fd); > run_test_for_each_guest_mode(NULL, ¶ms); > return 0; > } And here is the diff that make the test start working for you: diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c index d4ef201b67055..d4ae29c7dfe35 100644 --- a/tools/testing/selftests/kvm/access_tracking_perf_test.c +++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c @@ -90,7 +90,10 @@ static int idle_pages_warn_only = -1; static bool use_lru_gen; /* Total number of pages to expect in the memcg after touching everything */ -static long total_pages; +static long test_pages; + +/* Last generation we found the pages in */ +static int lru_gen_last_gen = -1; struct test_params { /* The backing source for the region of memory. */ @@ -265,11 +268,7 @@ static void lru_gen_mark_memory_idle(struct kvm_vm *vm) struct timespec ts_start; struct timespec ts_elapsed; struct memcg_stats stats; - int found_gens[2]; - - /* Find current generation the pages lie in. */ - lru_gen_read_memcg_stats(&stats, TEST_MEMCG_NAME); - found_gens[0] = find_generation(&stats, total_pages); + int new_gen; /* Make a new generation */ clock_gettime(CLOCK_MONOTONIC, &ts_start); @@ -277,23 +276,24 @@ static void lru_gen_mark_memory_idle(struct kvm_vm *vm) ts_elapsed = timespec_elapsed(ts_start); /* Check the generation again */ - found_gens[1] = find_generation(&stats, total_pages); + new_gen = find_generation(&stats, test_pages); /* * This function should only be invoked with newly-accessed pages, * so pages should always move to a newer generation. */ - if (found_gens[0] >= found_gens[1]) { + if (new_gen <= lru_gen_last_gen) { /* We did not move to a newer generation. */ - long idle_pages = lru_gen_sum_memcg_stats_for_gen(found_gens[1], + long idle_pages = lru_gen_sum_memcg_stats_for_gen(lru_gen_last_gen, &stats); - too_many_idle_pages(min_t(long, idle_pages, total_pages), - total_pages, -1); + too_many_idle_pages(min_t(long, idle_pages, test_pages), + test_pages, -1); } pr_info("%-30s: %ld.%09lds\n", "Mark memory idle (lru_gen)", ts_elapsed.tv_sec, ts_elapsed.tv_nsec); + lru_gen_last_gen = new_gen; } static void assert_ucall(struct kvm_vcpu *vcpu, uint64_t expected_ucall) @@ -410,6 +410,14 @@ static void run_test(enum vm_guest_mode mode, void *arg) vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1, params->backing_src, !overlap_memory_access); + /* + * If guest_page_size is larger than the host's page size, the + * guest (memstress) will only fault in a subset of the host's pages. + */ + test_pages = params->nr_vcpus * params->vcpu_memory_bytes / + max(memstress_args.guest_page_size, + (uint64_t)getpagesize()); + memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main); pr_info("\n"); @@ -418,9 +426,18 @@ static void run_test(enum vm_guest_mode mode, void *arg) if (use_lru_gen) { struct memcg_stats stats; - lru_gen_read_memcg_stats(&stats, TEST_MEMCG_NAME); - TEST_ASSERT(lru_gen_sum_memcg_stats(&stats) >= total_pages, - "Not all pages accounted for. Was the memcg set up correctly?"); + /* + * Do a page table scan now. After initial population, aging + * may not cause the pages to move to a newer generation. Do + * an aging pass now so that future aging passes always move + * pages to a newer generation. + */ + printf("Initial aging pass (lru_gen)\n"); + lru_gen_do_aging(&stats, TEST_MEMCG_NAME); + TEST_ASSERT(lru_gen_sum_memcg_stats(&stats) >= test_pages, + "Not all pages accounted for (looking for %ld). " + "Was the memcg set up correctly?", test_pages); + access_memory(vm, nr_vcpus, ACCESS_WRITE, "Re-populating memory"); } /* As a control, read and write to the populated memory first. */ @@ -496,7 +513,6 @@ static void help(char *name) void destroy_cgroup(char *cg) { printf("Destroying cgroup: %s\n", cg); - cg_destroy(cg); } int main(int argc, char *argv[]) @@ -541,50 +557,48 @@ int main(int argc, char *argv[]) } } - if (lru_gen_usable()) { - if (cg_find_unified_root(cgroup_root, sizeof(cgroup_root), NULL)) - ksft_exit_skip("cgroup v2 isn't mounted\n"); - - new_cg = cg_name(cgroup_root, TEST_MEMCG_NAME); - printf("Creating cgroup: %s\n", new_cg); - if (cg_create(new_cg) && errno != EEXIST) - ksft_exit_skip("could not create new cgroup: %s\n", new_cg); - - use_lru_gen = true; - } else { - page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR); - __TEST_REQUIRE(page_idle_fd >= 0, - "Couldn't open /sys/kernel/mm/page_idle/bitmap. " - "Is CONFIG_IDLE_PAGE_TRACKING enabled?"); - - close(page_idle_fd); - } - if (idle_pages_warn_only == -1) idle_pages_warn_only = access_tracking_unreliable(); - /* - * If guest_page_size is larger than the host's page size, the - * guest (memstress) will only fault in a subset of the host's pages. - */ - total_pages = params.nr_vcpus * params.vcpu_memory_bytes / - max(memstress_args.guest_page_size, - (uint64_t)getpagesize()); - - if (use_lru_gen) { + if (lru_gen_usable()) { + bool cg_created = true; int ret; puts("Using lru_gen for aging"); + use_lru_gen = true; + + if (cg_find_controller_root(cgroup_root, sizeof(cgroup_root), "memory")) + ksft_exit_skip("Cannot find memory cgroup controller\n"); + + new_cg = cg_name(cgroup_root, TEST_MEMCG_NAME); + printf("Creating cgroup: %s\n", new_cg); + if (cg_create(new_cg)) { + if (errno == EEXIST) { + printf("Found existing cgroup"); + cg_created = false; + } + else + ksft_exit_skip("could not create new cgroup: %s\n", new_cg); + } + /* * This will fork off a new process to run the test within * a new memcg, so we need to properly propagate the return * value up. */ ret = cg_run(new_cg, &run_test_in_cg, ¶ms); - destroy_cgroup(new_cg); + if (cg_created) + cg_destroy(new_cg); if (ret) return ret; } else { + page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR); + __TEST_REQUIRE(page_idle_fd >= 0, + "Couldn't open /sys/kernel/mm/page_idle/bitmap. " + "Is CONFIG_IDLE_PAGE_TRACKING enabled?"); + + close(page_idle_fd); + puts("Using page_idle for aging"); for_each_guest_mode(run_test, ¶ms); } diff --git a/tools/testing/selftests/kvm/lib/lru_gen_util.c b/tools/testing/selftests/kvm/lib/lru_gen_util.c index 783a1f1028a26..cab54935b160a 100644 --- a/tools/testing/selftests/kvm/lib/lru_gen_util.c +++ b/tools/testing/selftests/kvm/lib/lru_gen_util.c @@ -341,7 +341,7 @@ int lru_gen_find_generation(const struct memcg_stats *stats, min_gen = gen < min_gen ? gen : min_gen; } - for (gen = min_gen; gen < max_gen; ++gen) + for (gen = min_gen; gen <= max_gen; ++gen) /* See if this generation has enough pages. */ if (lru_gen_sum_memcg_stats_for_gen(gen, stats) > pages) return gen;