Re: [PATCH bpf-next v1 1/2] bpf: states with loop entry have incomplete read/precision marks

Eduard Zingerman <eddyz87@xxxxxxxxx> · Fri, 21 Mar 2025 13:02:13 -0700

On Mon, 2025-03-17 at 14:46 -0700, Alexei Starovoitov wrote:

[...]

> > Looks like there are only two options left:
> > a. propagate read/precision marks in state graph until fixed point is
> >    reached;
> > b. switch to CFG based DFA analysis (will need slot/register type
> >    propagation to identify which scalars can be precise).
> >
> > (a) is more in line with what we do currently and probably less code.
> > But I expect that it would be harder to think about in case if
> > something goes wrong (e.g. we don't print/draw states graph at the
> > moment, I have a debug patch for this that I just cherry-pick).
>
> if (a) is not a ton of code it's worth giving it a shot,
> but while(!converged) propagate_precision()
> will look scary.
> We definitely need to do (b) at the end.

I tried a simpler approach we talked about on Wednesday.
Here is an implementation using strongly connected components:
https://github.com/eddyz87/bpf/tree/scc-instead-of-loop-entry

The idea is as follows:
- Compute strongly connected components on control flow graph;
- The number of such components is not big for selftests and
  sched_ext;
- For each component track the following:
  - A flag indicating if states loop had ever been formed by states
    originating in this component
    (states loop is states_equal(...RANGE_WITHIN)).
    When such flag is set the states within the component need to be
    compared using RANGE_WITHIN logic.
  - A counter for "yet to be explored / currently explored" states
    originating from the component + an epoch counter. The epoch
    counter is incremented each time the "yet to be explored" counter
    reaches zero. Verifier states are extended to keep an epoch
    counter if states insn_idx is within the component. The epoch
    counter is needed to know if two verifier states are within the
    same states sub-tree.
- in clean_live_states() mark all state registers/stack slots as read
  and precise if:
  - state->insn_idx is within a strongly connected component;
  - state loops had been formed for states within this component;
  - state's component's epoch differs from current component's epoch,
    meaning that sub-tree of child states had already been explored
    and there would be no need to apply widening within this component.

This almost works.
It comes short for a single selftest:

    int cond_break1(const void *ctx)
    {
    	unsigned long i;
    	unsigned int sum = 0;

    	for (i = zero; i < ARR_SZ && can_loop; i++) // Loop A
    		sum += i;
    	for (i = zero; i < ARR_SZ; i++) {           // Loop B
    		barrier_var(i);
    		sum += i + arr[i];
    		cond_break;
    	}

    	return sum;
    }

The test is still verified, but the number of explored states goes
from 10 to 8K. The problem here is that the loop B converges first and
`clean_live_states()` marks `sum` as precise at entry to loop B.
Then, when a branch originating from Loop A reaches entry to loop B it
hits a checkpoint and propagates `sum` to the `may_goto` checkpoint at
the beginning of the Loop A.
Hence, widening logic for `sum` does not kick in.

There is also a problem with one sched_ext program (lavd_dispatch),
where eager precision marks applied to cached state confuse
mark_chain_precision, at it hits a bug assert when trying to mark r1
precise when backtracking a function call. This is something to debug
further.

---

I also tried a more complex approach, where:
- state loop backedges are accumulated for each strongly connected
  component;
- when component is done exploring (using same branch/epoch logic as
  above) do propagate_{liveness,precision} along the backedges while
  precision marks change.

It is here:
https://github.com/eddyz87/bpf/tree/read-marks-stead-state-per-scc
The test cases at hand pass (absent_mark_in_the_middle_state{,2}),
but it hits memory protection errors on complete test_progs run and
requires further debugging.

Overall, I don't like the level of complexity this approach involves.
It appears that it would be better to invest this complexity into a
compiler style pass computing a conservative approximation of precise
registers / stack slots.