Hi SJ, I'm happy to discuss this, and reply below, but I _think_ replying in this thread is really not optimal, as we're digressing quite a bit from the proposal/issue at hand and the cc's now not quite aligned, potentially creating confusion and noise. I know you're in good faith based on your (excellent :) series in this area, so I presume it was just to provide context -as to why you're raising it- more than anything else. This is more of a 'email development is sucky' comment, but I _think_ this would be better as a [DISCUSSION] thread maybe linking this original one back to it or something. But anyway, getting to the point - my answer is simple so not much discussion _really_ required here - you're right, I'm wrong! :) I go into detail inline below: On Sat, May 17, 2025 at 09:20:48AM -0700, SeongJae Park wrote: > Hi Lorenzo, > > On Fri, 16 May 2025 13:57:18 +0100 Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> wrote: > [...] > > Right now madvise() has limited utility because: > [...] > > - While you can perform multiple operations at once via process_madvise(), > > even to the current process (after my changes to extend it), it's limited > > to a single advice over 8 ranges. > > I'm bit confused by the last part, since I'm understanding your point as 'vlen' > parameter of process_madvise() is limited to 8, but my test code below succeeds > with 'vlen' parameter value 512. Could you please enlighten me? Let's keep this simple - I'm just wrong here :) apologies, entirely my fault. We have discussed this a few times before, where I suspect my incorrect assertion on this has led to you also assuming the wrong thing (again, apologies!). But it does raise the important point - we need to re-examine your changes (see [0]) where this assumption reduced the urgency of considering contention issues. Let's take a look at that again on Monday. Though I do strongly suspect it's fine honestly. We just need to take a look...! [0]: https://lore.kernel.org/linux-mm/5fc4e100-70d3-44c1-99f7-f8a5a6a0ba65@lucifer.local/ Anyway, let's dig into the code to get things right: SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, size_t, vlen, int, behavior, unsigned int, flags) { ... struct iovec iovstack[UIO_FASTIOV]; struct iovec *iov = iovstack; struct iov_iter iter; ... ret = import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter); if (ret < 0) goto out; ... } My mistake was assuming that UIO_FASTIOV was the hard limit on size. This is not the case, it's just an optimisation - if the iovec is small enough to fit we use it, otherwise we allocate. We can see this by examining the comment from import_iovec(): /* * ... * If the array pointed to by *@iov is large enough to hold all @nr_segs, * then this function places %NULL in *@iov on return. Otherwise, a new * array will be allocated and the result placed in *@iov. This means that * the caller may call kfree() on *@iov regardless of whether the small * on-stack array was used or not (and regardless of whether this function * returns an error or not). * ... */ ssize_t import_iovec(int type, const struct iovec __user *uvec, unsigned nr_segs, unsigned fast_segs, struct iovec **iovp, struct iov_iter *i) { return __import_iovec(type, uvec, nr_segs, fast_segs, iovp, i, in_compat_syscall()); } Where nr_segs == vlen, fast_segs == UIO_FASTIOV (8), iovp is &iov, and I think iov referred to by the comment is *iovp (only sensible conclusion, really). Looking into the code further we see: ssize_t __import_iovec(int type, const struct iovec __user *uvec, unsigned nr_segs, unsigned fast_segs, struct iovec **iovp, struct iov_iter *i, bool compat) { ... iov = iovec_from_user(uvec, nr_segs, fast_segs, *iovp, compat); ... } struct iovec *iovec_from_user(const struct iovec __user *uvec, unsigned long nr_segs, unsigned long fast_segs, struct iovec *fast_iov, bool compat) { ... if (nr_segs > UIO_MAXIOV) return ERR_PTR(-EINVAL); if (nr_segs > fast_segs) { iov = kmalloc_array(nr_segs, sizeof(struct iovec), GFP_KERNEL); if (!iov) return ERR_PTR(-ENOMEM); } ... } So - this confirms it - we're fine, it just tries to use the stack-based array if it can - otherwise it kmalloc()'s. Of course, UIO_MAXIOV remains the _actual_ hard limit (hardcoded to 1,024 in include/uapi/linux/uio.h). The other points I made about the proposed interface remain, but I won't go into more detail as we are obviously lacking that context here. Thanks for bringing this up and correcting my misinterpretation, as well as providing the below repro code, and let's revisit your old series... but on Monday :) I should really not be looking at work mail on a Saturday (mea culpa, once again... :) One small nit in the repro code below (hey I'm a kernel dev, can't help myself... ;) Cheers, Lorenzo > > Attaching my test code below. You could simply run it as below. > > gcc test.c && ./a.out > > ==== Attachment 0 (test.c) ==== > #define _GNU_SOURCE > > #include <stdio.h> > #include <stdlib.h> > #include <sys/mman.h> > #include <sys/syscall.h> > #include <sys/uio.h> > #include <unistd.h> > > #define SZ_PAGE (4096) > #define NR_PAGES (512) > #define MMAP_SZ (SZ_PAGE * NR_PAGES) > > int main(void) > { > char *buf; > unsigned int i; > int ret; > pid_t pid = getpid(); > int pidfd = syscall(SYS_pidfd_open, pid, 0); > struct iovec *vec; > > buf = mmap(NULL, MMAP_SZ, PROT_READ | PROT_WRITE, MAP_PRIVATE | > MAP_ANON, -1, 0); > if (buf == MAP_FAILED) { > printf("mmap fail\n"); > return -1; > } > > for (i = 0; i < MMAP_SZ; i++) > buf[i] = 123; > > vec = malloc(sizeof(*vec) * NR_PAGES); > for (i = 0; i < NR_PAGES; i++) { > vec[i].iov_base = &buf[i * SZ_PAGE]; > vec[i].iov_len = SZ_PAGE; > } > > ret = syscall(SYS_process_madvise, pidfd, vec, NR_PAGES, > MADV_DONTNEED, 0); > if (ret != MMAP_SZ) { > printf("process_madvise fail\n"); > return -1; > } To be pedantic, you are really only checking to see if an error was returned, in theory no error might have been returned but the operation might have not proceeded, so a more proper check here would be to populated the anon memory with non-zero data, then check afterwards that it's zeroed. Given this outcome would probably imply iovec issues, it's not likely, but to really assert the point you'd probably want to do that! > > ret = munmap(buf, MMAP_SZ); > if (ret) { > printf("munmap failed\n"); > return -1; > } > > close(pidfd); > printf("good\n"); > return 0; > }