Re: Large(ish) variance induced by SCHED_FIFO

Marc Gonzalez <marc.w.gonzalez@xxxxxxx> · Mon, 8 Sep 2025 17:42:58 +0200

On 08/09/2025 11:36, John Ogness wrote:

> There are still reasons why CLOCK_MONOTONIC_RAW might be
> interesting. For example, if you want a very stable time source to
> compare intervals, but do not care so much about the real world time
> lengths of those intervals (i.e. where is the greatest latency vs. what
> is the value of the greatest latency). Although even here, I doubt
> CLOCK_MONOTONIC_RAW has a practical advantage over CLOCK_MONOTONIC.

In fact, I'm just trying to compare the run-time of 2 minor
variations of the same program (testing micro-optimizations).

Absolute run-time is not really important, what I really want
to know is: does v2 run faster or slower than v1?

This is the framework I'm using at this point:

#include <stdio.h>
#include <time.h>
extern void my_code(void);

static long loop(int log2)
{
	int n = 1 << log2;
	struct timespec t0, t1;
	clock_gettime(CLOCK_MONOTONIC, &t0);
	for (int i = 0; i < n; ++i) my_code();
	clock_gettime(CLOCK_MONOTONIC, &t1);
	long d = (t1.tv_sec - t0.tv_sec)*1000000000L + (t1.tv_nsec - t0.tv_nsec);
	long t = d >> log2;
	return t;
}

int main(void)
{
	long t, min = loop(4);
	for (int i = 0; i < 20; ++i)
		if ((t = loop(8)) < min) min = t;
	printf("MIN=%ld\n", min);
	return 0;
}

Basically:
- warm up the caches
- run my_code() 256 times && compute average run-time
- repeat 20 times to find MINIMUM average run-time

When my_code() is a trivial computational loop such as:

	mov $(1<<12), %eax
1:	dec %ecx
	dec %ecx
	dec %eax
	jnz 1b
	ret

Then running the benchmark 1000 times returns the same value 1000 times:
MIN=2737

Obviously, the program I'm working on is a bit more complex, but barely:
- no system calls, no library calls
- just simple bit twiddling
- tiny code, tiny data structures, everything fits in L1
$ size a.out
   text	   data	    bss	    dec	    hex	filename
   8549	    632	   1072	  10253	   280d	a.out

When I run the benchmark 1000 times, there are some large outliers:
MIN_MIN=2502
MAX_MIN=2774

NOTE: 95% of the results are below 2536.
But the top 1% (worst 10) are really bad (2646-2774)

How to get repeatable results?

Random 10% outliers break the ability to measure the impact
of micro-optimizations expected to provide 0-3% improvements :(

For reference, the script launching the benchmark does:

echo     -1 > /proc/sys/kernel/sched_rt_runtime_us
for I in 0 1 2 3; do echo userspace > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor; done
sleep 0.25
for I in 0 1 2 3; do echo   3000000 > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_setspeed; done
sleep 0.25

for I in $(seq 1 1000); do
chrt -f 99 taskset -c 2 ./a.out < $1
done

for I in 0 1 2 3; do
echo schedutil > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor
done
echo 950000 > /proc/sys/kernel/sched_rt_runtime_us

I've run out of ideas to identify other sources of variance.
(I ran everything in single user mode with nothing else running.)
Perhaps with perf I could identify the source of stalls or bubbles?

Hoping someone can point me in the right direction.

Regards