When ssdd is invoked with nforks > 100 && niters == 10000 on a tuned, realtime kernel, the following error messages can be seen: forktest#4/8719: EXITING, ERROR: wait on PTRACE_SINGLESTEP #385: no SIGCHLD seen (signal count == 0), signo 5 forktest#1/8716: EXITING, ERROR: wait on PTRACE_SINGLESTEP #398: no SIGCHLD seen (signal count == 0), signo 5 forktest#6/8721: EXITING, ERROR: wait on PTRACE_SINGLESTEP #385: no SIGCHLD seen (signal count == 0), signo 5 forktest#10/8725: EXITING, ERROR: wait on PTRACE_SINGLESTEP #388: no SIGCHLD seen (signal count == 0), signo 5 forktest#11/8726: EXITING, ERROR: wait on PTRACE_SINGLESTEP #388: no SIGCHLD seen (signal count == 0), signo 5 forktest#12/8727: EXITING, ERROR: wait on PTRACE_SINGLESTEP #389: no SIGCHLD seen (signal count == 0), signo 5 forktest#14/8729: EXITING, ERROR: wait on PTRACE_SINGLESTEP #389: no SIGCHLD seen (signal count == 0), signo 5 forktest#15/8730: EXITING, ERROR: wait on PTRACE_SINGLESTEP #389: no SIGCHLD seen (signal count == 0), signo 5 This behavior is caused by ptrace_stop() being unable to sleep after taking tasklist_lock(). As forktest() generates "niter" PTRACE_SINGLESTEP's for nforks, in the event where nforks >= 100, the sporadic test failures caused by missing SIGCHLDs indicates that the tracees are unable to effectively wait for their asynchronous signals to arrive --as denoted in the previous sleeps for check_sigchld(). Therefore, by performing an addtional sleep() in check_sigchld(), we give the tracee enough CPU time to call do_notify_parent_cldstop()->send_signal_locked(). The observed behavior after appling this patch mitigates the aforementioned issue in scenarios with a high number of nforks. Suggested-by: Oleg Nesterov <oleg@xxxxxxxxxx> Signed-off-by: Derek Barbosa <debarbos@xxxxxxxxxx> --- src/ssdd/ssdd.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/src/ssdd/ssdd.c b/src/ssdd/ssdd.c index 50f7424..7fdb039 100644 --- a/src/ssdd/ssdd.c +++ b/src/ssdd/ssdd.c @@ -145,6 +145,15 @@ static int check_sigchld(void) for (i = 0; i < 10 && !got_sigchld; i++) usleep(16000); /* 160 + 150 = 310 msecs */ + /* + * In the _worst case scenario_ where the signal still + * has not arrived: the tracee is starved or + * preempted, and needs more CPU time. + */ + if(!got_sigchld){ + sleep(1); + } + return got_sigchld; } -- 2.50.0