Re: [PATCH v2 00/11] do not overuse strbuf_split*()

Junio C Hamano <gitster@xxxxxxxxx> · Sat, 02 Aug 2025 10:09:59 -0700

Jeff King <peff@xxxxxxxx> writes:

> On Thu, Jul 31, 2025 at 03:54:22PM -0700, Junio C Hamano wrote:
>
>> strbuf is a very good data structure to work with string data
>> without having to worry about running past the end of the string.
>> 
>> But an array of strbuf is often a wrong data structure.  You rarely
>> have need to be able to edit multiple strings represented by such an
>> array simultaneously.  And strbuf_split*() that produces result in
>> such a shape is a misdesigned API function.
>
> Yeah, I agree it's a bad API, and I'd be happy if we could eventually
> get rid of it.

I'd be happy to mark it #leftoverbits for aspiring Git hackers to
look at.  Perhaps a good microproject material (there are still
several instances left---I talked Christian out of adding more in
the series he is iterating right now).

> Conversion to string_list mostly makes sense (though I especially like
> the case in sub-process.c where we can just parse with skip_prefix()).

Yes, for that one, strbuf_split() was really killing a butterfly
with a musket or cracking a nut with a sledgehammer; or whatever
translation of the idiom you'd like ;-).

> I looked over the patches and didn't see anything objectionable. But I
> tried to think of subtle incompatibilities we might run into:
>
>   1. I saw the different handling of "max" you had to deal with in one
>      patch. Yuck. It might be worth tweaking string_list_split() to
>      count "pieces" (which is how every other split function I've seen,
>      like the one in perl, works). But that can be done later (and is
>      tricky to do safely, since we wouldn't be changing the function
>      signature).

Yes, I think I whined about it in one of the commit log messages.
That part is one of the things string_list API gets wrong, which we
may want to fix.

>   2. Is the handling of repeated delimiters the same? E.g., if I split
>      on "/" and we see "foo//bar", do both split implementations yield
>      the same output. I could see either of ("foo", "", "bar") and
>      ("foo", "bar") being produced.

Same between?  strbuf_split*() seems to be happy to produce an empty
piece in the result (and there is no "omit empty ones" feature).
You can choose with STRING_LIST_SPLIT_NOEMPTY both results, but for
a straight conversion, the vanilla one without that flag would do
what we want, I think.

> And...
>
>> The most common use case of strbuf_split*() family of functions
>> seems to be to trim away the whitespaces around each piece of split
>> string.  With modern string_list_split*(), it is often no longer
>> necessary.
>
> Some of these sites do strbuf_rtrim() on the split pieces. But
> STRING_LIST_SPLIT_TRIM does both left and right. Is this OK? I think so
> because in both cases we are already splitting on space, so we wouldn't
> expect left-hand spaces (of course you could have a stray tab or
> something, but I suspect the new code matches the intent more closely).

I did wonder about these "rtrim-only" callsites strbuf_split(), but
I did not think they needed ltrim, since they were splitting at SP
and rejected when the element were empty (remember, strbuf_split()
happily creates an empty element in the result).

Their use of rtrim() is primarily when they split at SP they want to
remove that SP (remember, another misdesign of strbuf_split() is
that it leaves the delimiter itself at the end of each split piece).
Both 05/11 step for merge-tree and 06/11 step for notes are good
examples.

So in short, yes, the new code is being a bit more lenient (the
original parsers were more strict and insisted on non-trimming
behaviour).  We might consider tightening them by removing the
STRING_LIST_SPLIT_TRIM bit to be more faithful to the original, but
as you may have noticed, doing so would make it more strict at the
right end.  The original that split at SP and then rtrimmed allowed
trailing HT after the data to be removed, but I do not think it was
part of the designed behaviour of the original anyway.  So I am
inclined to say we should probably drop STRING_LIST_SPLIT_TRIM;
nobody would scream.

Thanks for carefully reading.