On 2025-04-04 at 20:57:40, Jeff King wrote: > On Fri, Apr 04, 2025 at 12:58:38PM +0200, Patrick Steinhardt wrote: > > > But in practice the check itself is misbehaving when handling unicode > > characters. The particular issue triggered with a branch that got the > > "shaved ice" unicode character in its name, which is composed of the > > bytes "0xEE 0x90 0xBF". The bug triggers when we compare the refname > > "refs/heads/<shaved-ice>" to something like "refs/heads/z", and it > > specifically hits when comparing the first byte, "0xEE". > > > > The root cause is that the most-significant bit of 0xEE is set. The > > `refname` and `prefix` pointers that we use to compare bytes with one > > another are both pointers to signed characters. As such, when we > > dereference the 0xEE byte the result is a _negative_ value, and this > > value will of course compare smaller than "z". > > > > We can see that this issue is avoided in `cmp_packed_refname()`, where > > we explicitly cast each byte to its unsigned form. Fix the bug by doing > > the same in `packed_ref_iterator_advance()`. > > Ah, good catch. I think this signed-ness issue has come up before, long > ago, but I don't remember the context. In theory any stable ordering is > OK for sorting, but of course cmp_packed_refname() chose to use unsigned > in order to match strcmp(), and the standard defines it as interpreting > the bytes as unsigned. One of the enjoyable quirks of C. I'd agree it's a quirk, but I'm not sure I'd call it enjoyable. Anyway, this does seem like the right solution and I agree matching strcmp is the right decision here. I also think unsigned byte comparisons are more intuitive, honestly. > > diff --git a/refs/packed-backend.c b/refs/packed-backend.c > > index b4289a7d9ce..7e31904bd41 100644 > > --- a/refs/packed-backend.c > > +++ b/refs/packed-backend.c > > @@ -980,9 +980,9 @@ static int packed_ref_iterator_advance(struct ref_iterator *ref_iterator) > > continue; > > > > while (prefix && *prefix) { > > - if (*refname < *prefix) > > + if ((unsigned char)*refname < (unsigned char)*prefix) > > BUG("packed-refs backend yielded reference preceding its prefix"); > > - else if (*refname > *prefix) > > + else if ((unsigned char)*refname > (unsigned char)*prefix) > > return ITER_DONE; > > prefix++; > > refname++; > > The patch itself looks good to me. Same here. > > +test_expect_success 'list packed refs with unicode characters' ' > > + test_when_finished "rm -rf repo" && > > + git init repo && > > + ( > > + cd repo && > > + test_commit --no-tag A && > > + git update-ref refs/heads/ HEAD && > > + git update-ref refs/heads/z HEAD && > > It's possible some filesystems might be unhappy with this character, but > I guess we can see if anybody screams. I very much doubt that's going to be a problem. Windows uses UTF-16 internally, so it will have no problems with Unicode; macOS only allows UTF-8 in its file systems and I know it gracefully supports emoji; and other Unix systems don't care one way or the other because it's just some bytes. Even ISO9660 with Rock Ridge or Joliet extensions will work here, as will FAT, UDF (used on DVDs and some hard disks), and 9P. Certainly somebody could try something very esoteric, but I expect other things will break as well. I'm okay with favouring testing things we know many people _are_ using (emojis and Unicode) over things very few people are using (very esoteric file systems[0]). I could imagine a hypothetical encoding issue in the Cygwin or MSYS layer being a problem, but if it passes the testsuite in CI, I'm almost certain it will be fine. [0] I think I'm familiar with a decent number of file systems and I cannot think of any available on a modern OS that would have this problem. I'm sure some must exist, though. -- brian m. carlson (they/them) Toronto, Ontario, CA
Attachment:
signature.asc
Description: PGP signature