On Wed, 2025-08-20 at 12:18 -0400, Derek Barbosa wrote: > When ssdd is invoked with nforks > 100 && niters == 10000 on a tuned, > realtime kernel, the following error messages can be seen: > > forktest#4/8719: EXITING, ERROR: wait on PTRACE_SINGLESTEP #385: no SIGCHLD seen (signal count == 0), signo 5 > forktest#1/8716: EXITING, ERROR: wait on PTRACE_SINGLESTEP #398: no SIGCHLD seen (signal count == 0), signo 5 > forktest#6/8721: EXITING, ERROR: wait on PTRACE_SINGLESTEP #385: no SIGCHLD seen (signal count == 0), signo 5 > forktest#10/8725: EXITING, ERROR: wait on PTRACE_SINGLESTEP #388: no SIGCHLD seen (signal count == 0), signo 5 > forktest#11/8726: EXITING, ERROR: wait on PTRACE_SINGLESTEP #388: no SIGCHLD seen (signal count == 0), signo 5 > forktest#12/8727: EXITING, ERROR: wait on PTRACE_SINGLESTEP #389: no SIGCHLD seen (signal count == 0), signo 5 > forktest#14/8729: EXITING, ERROR: wait on PTRACE_SINGLESTEP #389: no SIGCHLD seen (signal count == 0), signo 5 > forktest#15/8730: EXITING, ERROR: wait on PTRACE_SINGLESTEP #389: no SIGCHLD seen (signal count == 0), signo 5 > > This behavior is caused by ptrace_stop() being unable to sleep after taking > tasklist_lock(). > > As forktest() generates "niter" PTRACE_SINGLESTEP's for nforks, in the event > where nforks >= 100, the sporadic test failures caused by missing SIGCHLDs > indicates that the tracees are unable to effectively wait for their asynchronous > signals to arrive --as denoted in the previous sleeps for check_sigchld(). > > Therefore, by performing an addtional sleep() in check_sigchld(), we give the > tracee enough CPU time to call do_notify_parent_cldstop()->send_signal_locked(). > > The observed behavior after appling this patch mitigates the aforementioned > issue in scenarios with a high number of nforks. > > Suggested-by: Oleg Nesterov <oleg@xxxxxxxxxx> > Signed-off-by: Derek Barbosa <debarbos@xxxxxxxxxx> > --- > src/ssdd/ssdd.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/src/ssdd/ssdd.c b/src/ssdd/ssdd.c > index 50f7424..7fdb039 100644 > --- a/src/ssdd/ssdd.c > +++ b/src/ssdd/ssdd.c > @@ -145,6 +145,15 @@ static int check_sigchld(void) > for (i = 0; i < 10 && !got_sigchld; i++) > usleep(16000); /* 160 + 150 = 310 msecs */ > > + /* > + * In the _worst case scenario_ where the signal still > + * has not arrived: the tracee is starved or > + * preempted, and needs more CPU time. > + */ > + if(!got_sigchld){ > + sleep(1); > + } And then down the road we'll hit a load high enough that an extra second isn't enough... How about replacing this whole thing with a call to sigtimedwait()? Especially if the goal is to do the steps "as fast as possible". -Crystal