[PATCH] ssdd: mitigate tracee starvation

Derek Barbosa <debarbos@xxxxxxxxxx> · Wed, 20 Aug 2025 12:18:20 -0400

When ssdd is invoked with nforks > 100 && niters == 10000 on a tuned,
realtime kernel, the following error messages can be seen:

forktest#4/8719: EXITING, ERROR: wait on PTRACE_SINGLESTEP #385: no SIGCHLD seen (signal count == 0), signo 5
forktest#1/8716: EXITING, ERROR: wait on PTRACE_SINGLESTEP #398: no SIGCHLD seen (signal count == 0), signo 5
forktest#6/8721: EXITING, ERROR: wait on PTRACE_SINGLESTEP #385: no SIGCHLD seen (signal count == 0), signo 5
forktest#10/8725: EXITING, ERROR: wait on PTRACE_SINGLESTEP #388: no SIGCHLD seen (signal count == 0), signo 5
forktest#11/8726: EXITING, ERROR: wait on PTRACE_SINGLESTEP #388: no SIGCHLD seen (signal count == 0), signo 5
forktest#12/8727: EXITING, ERROR: wait on PTRACE_SINGLESTEP #389: no SIGCHLD seen (signal count == 0), signo 5
forktest#14/8729: EXITING, ERROR: wait on PTRACE_SINGLESTEP #389: no SIGCHLD seen (signal count == 0), signo 5
forktest#15/8730: EXITING, ERROR: wait on PTRACE_SINGLESTEP #389: no SIGCHLD seen (signal count == 0), signo 5

This behavior is caused by ptrace_stop() being unable to sleep after taking
tasklist_lock().

As forktest() generates "niter" PTRACE_SINGLESTEP's for nforks, in the event
where nforks >= 100, the sporadic test failures caused by missing SIGCHLDs
indicates that the tracees are unable to effectively wait for their asynchronous
signals to arrive --as denoted in the previous sleeps for check_sigchld().

Therefore, by performing an addtional sleep() in check_sigchld(), we give the
tracee enough CPU time to call do_notify_parent_cldstop()->send_signal_locked().

The observed behavior after appling this patch mitigates the aforementioned
issue in scenarios with a high number of nforks.

Suggested-by: Oleg Nesterov <oleg@xxxxxxxxxx>
Signed-off-by: Derek Barbosa <debarbos@xxxxxxxxxx>
---
 src/ssdd/ssdd.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/ssdd/ssdd.c b/src/ssdd/ssdd.c
index 50f7424..7fdb039 100644
--- a/src/ssdd/ssdd.c
+++ b/src/ssdd/ssdd.c
@@ -145,6 +145,15 @@ static int check_sigchld(void)
 	for (i = 0; i < 10 && !got_sigchld; i++)
 		usleep(16000); /* 160 + 150 = 310 msecs */
 
+        /*
+         * In the _worst case scenario_ where the signal still
+         * has not arrived: the tracee is starved or
+	 * preempted, and needs more CPU time.
+         */
+        if(!got_sigchld){
+		sleep(1);
+	}
+
 	return got_sigchld;
 }
 
-- 
2.50.0