Re: [PATCH v2 7/8] refs: stop unsetting REF_HAVE_OLD for log-only updates

Patrick Steinhardt <ps@xxxxxx> · Tue, 29 Jul 2025 09:54:57 +0200

On Tue, Jul 29, 2025 at 03:14:55AM -0400, Jeff King wrote:
> On Mon, Jul 28, 2025 at 04:43:12PM +0200, Patrick Steinhardt wrote:
> > > > @@ -2508,8 +2507,9 @@ static enum ref_transaction_error check_old_oid(struct ref_update *update,
> > > >  						struct object_id *oid,
> > > >  						struct strbuf *err)
> > > >  {
> > > > -	if (!(update->flags & REF_HAVE_OLD) ||
> > > > -		   oideq(oid, &update->old_oid))
> > > > +	if (update->flags & REF_LOG_ONLY ||
> > > > +	    !(update->flags & REF_HAVE_OLD) ||
> > > > +	    oideq(oid, &update->old_oid))
> > > >  		return 0;
> > > >  
> > > >  	if (is_null_oid(&update->old_oid)) {
> > > 
> > > Which make sense to me. But the weird thing I noticed is that when we do
> > > something similar for split_head_update(), we don't strip REF_HAVE_OLD!
> > 
> > And we shouldn't do that, as in the next commit we actually build on
> > always having `REF_HAVE_OLD` set for reflog-only updates. So I'd argue
> > that the problem is actually the other way round: when splitting off the
> > HEAD update we must resolve the old object ID if `REF_HAVE_OLD` is not
> > set.
> 
> Yeah, I agree that after your patches, split_head_update() should
> definitely not be clearing that flag. What I more meant was: this patch
> is introducing a behavior change for those split HEAD updates, which
> used to do the extra old-oid check but now won't (whereas for other
> symref log-only updates, you are preserving the behavior).
> 
> I _think_ that's a reasonable thing, but I wanted to make sure.
> 
> However...
> 
> > > (For those not familiar with that
> 
> ...did you mean to write more? I know you've been running into weird
> email truncation issues lately.

Sigh. Yes. I really need to figure this out, but I have no clue
whatsoever where to look.

Anyway, here's the remainder of that mail:

On Fri, Jul 25, 2025 at 07:36:10AM -0400, Jeff King wrote:
> On Fri, Jul 25, 2025 at 08:58:29AM +0200, Patrick Steinhardt wrote:
> > @@ -2508,8 +2507,9 @@ static enum ref_transaction_error check_old_oid(struct ref_update *update,
> (For those not familiar with that function, it notices when we are
> updating refs/heads/foo that is pointed-to by HEAD, and then adds an
> extra HEAD reflog update to the transaction).
> 
> So as I understand it, right now we are doing an extra check_old_oid()
> on that log-only HEAD update, and after your patch we would stop doing
> so.
> 
> Which I _think_ is the right thing to do, but it made me wonder if the
> transaction were ever non-atomic. That is, could we split off a log-only
> update that succeeds, even though the old-oid check for the actual
> ref fails?

I think this can only happen the other way round: the log update never
gets persisted unless the parent ref is, as we'd otherwise abort. But
what can happen is that we end up with a broken reflog entry. See below.

> Historically, I'd guess the answer is mostly "no", because the point of
> ref transactions is to be all-or-nothing, and to do the locking and
> old-oid checking before writing out any updates. But I also think I saw
> some discussion of non-atomic transactions recently. I didn't really
> follow it, but is this a potential problem?

I'd say that the whole logic has always been flawed: we resolve the
target that "HEAD" points to without locking the reference. Consequently
we have a race in case "HEAD" got updated to point somewhere else, as
we'd still write a reflog entry to "HEAD".

What used to save us a bit is that at least the old object ID would be
correct in such a case because we used to verify it even for log-only
updates. But whether or not we write such a reflog message in the first
place is subject to a race. And when `REF_HAVE_OLD` wasn't set we might
even end up writing a different old object ID than what we write to the
target reference.

Theoretically speaking we'd have to lock "HEAD" immediately after we
have resolved it to ensure that it doesn't change anymore. But that lock
would be quite restrictive.

I guess the next-best thing that we can do is to lock "HEAD" as soon as
we find the ref that it's pointing to. If so, we can re-evaluate whether
"HEAD" still points to the same ref -- and if so, we split off the
update for the reflog. If it doesn't anymore then all bets are off, as
it may be the case that "HEAD" has now been changed to point to a ref
that has already been processed by us.

I guess the safest bet would be to just abort the whole transaction in
that case? After all it is a racy update, but it feels heavy-handed to
reject the whole transaction only because we fail to write a reflog
entry.

But even that doesn't solve this race completely: "HEAD" might have been
unborn at the start of a transaction or refer to a target ref that isn't
updated as part of the ref. So we wouldn't ever get to re-resolving the
ref. We could double check at the end of the transaction whether "HEAD"
has changed, but that isn't really working either as its target ref may
have flip-flopped.

In any case, I think we can improve the situation at least a bit:

  - We lock the parent update before calling `split_head_update()`. This
    ensures the old object ID is resolved already and cannot change
    anymore.

  - In `split_head_update()` we take in the parent lock as a parameter.
    If `REF_HAVE_OLD` is unset we take the old object ID from the parent
    lock and set `REF_HAVE_OLD` for the reflog entry. This ensures that
    we at least use the same old ID for both ref and reflog updates.

But the other race, that "HEAD" may have changed concurrently... I don't
think that one can be plugged without a bigger effort.

Patrick