On Mon, 2025-03-17 at 14:46 -0700, Alexei Starovoitov wrote: [...] > > Looks like there are only two options left: > > a. propagate read/precision marks in state graph until fixed point is > > reached; > > b. switch to CFG based DFA analysis (will need slot/register type > > propagation to identify which scalars can be precise). > > > > (a) is more in line with what we do currently and probably less code. > > But I expect that it would be harder to think about in case if > > something goes wrong (e.g. we don't print/draw states graph at the > > moment, I have a debug patch for this that I just cherry-pick). > > if (a) is not a ton of code it's worth giving it a shot, > but while(!converged) propagate_precision() > will look scary. > We definitely need to do (b) at the end. I tried a simpler approach we talked about on Wednesday. Here is an implementation using strongly connected components: https://github.com/eddyz87/bpf/tree/scc-instead-of-loop-entry The idea is as follows: - Compute strongly connected components on control flow graph; - The number of such components is not big for selftests and sched_ext; - For each component track the following: - A flag indicating if states loop had ever been formed by states originating in this component (states loop is states_equal(...RANGE_WITHIN)). When such flag is set the states within the component need to be compared using RANGE_WITHIN logic. - A counter for "yet to be explored / currently explored" states originating from the component + an epoch counter. The epoch counter is incremented each time the "yet to be explored" counter reaches zero. Verifier states are extended to keep an epoch counter if states insn_idx is within the component. The epoch counter is needed to know if two verifier states are within the same states sub-tree. - in clean_live_states() mark all state registers/stack slots as read and precise if: - state->insn_idx is within a strongly connected component; - state loops had been formed for states within this component; - state's component's epoch differs from current component's epoch, meaning that sub-tree of child states had already been explored and there would be no need to apply widening within this component. This almost works. It comes short for a single selftest: int cond_break1(const void *ctx) { unsigned long i; unsigned int sum = 0; for (i = zero; i < ARR_SZ && can_loop; i++) // Loop A sum += i; for (i = zero; i < ARR_SZ; i++) { // Loop B barrier_var(i); sum += i + arr[i]; cond_break; } return sum; } The test is still verified, but the number of explored states goes from 10 to 8K. The problem here is that the loop B converges first and `clean_live_states()` marks `sum` as precise at entry to loop B. Then, when a branch originating from Loop A reaches entry to loop B it hits a checkpoint and propagates `sum` to the `may_goto` checkpoint at the beginning of the Loop A. Hence, widening logic for `sum` does not kick in. There is also a problem with one sched_ext program (lavd_dispatch), where eager precision marks applied to cached state confuse mark_chain_precision, at it hits a bug assert when trying to mark r1 precise when backtracking a function call. This is something to debug further. --- I also tried a more complex approach, where: - state loop backedges are accumulated for each strongly connected component; - when component is done exploring (using same branch/epoch logic as above) do propagate_{liveness,precision} along the backedges while precision marks change. It is here: https://github.com/eddyz87/bpf/tree/read-marks-stead-state-per-scc The test cases at hand pass (absent_mark_in_the_middle_state{,2}), but it hits memory protection errors on complete test_progs run and requires further debugging. Overall, I don't like the level of complexity this approach involves. It appears that it would be better to invest this complexity into a compiler style pass computing a conservative approximation of precise registers / stack slots.