Hi brian
On 08/04/2025 02:22, brian m. carlson wrote:
On 2025-04-07 at 15:52:43, Phillip Wood via GitGitGadget wrote:
From: Phillip Wood <phillip.wood@xxxxxxxxxxxxx>
[RFC] rebase -m: partial support for copying extra commit headers
This patch is largely a response to
https://lore.kernel.org/git/Z-5rpWKAVPmz32jC@xxxxxx/ . I'm in two minds
about whether we should consider merging such partial support but if it
helps forges preserve extra commit headers then it may well be worth it.
I'd like to see command-line options to control this and ideally a
configuration option. Right now, we know nothing about these extra
headers, including an expected format. If a future version of Git (say,
3.0) adds a new header and the user includes invalid data in this extra
header (which happens all the time with author and committer
information), then 2.50 will propagate it on rebase and it won't be
fixed until the user uses a version of Git that understands the header
and can fsck it correctly. That's not really great, since it means we
can unknowingly spread corruption.
We could certainly add some way to make this opt-in if there is a desire
for it and you make a good point about compatibility if we add a new
commit header. I'm not sure I'd describe preserving these headers when
rebasing as spreading corruption though as we're simply rewriting
existing commits. If the user chose to merge rather than rebase we'd
still have the same issues without creating any new commits.
I am pretty sure that at $DAYJOB we'll need to have a discussion about
whether we want to propagate these headers during rebase and I'm
personally leaning against it.
My understanding is that GitHub has been using "git replay" for rebases
and therefore copying extra commit headers since the middle of 2023
[1,2]. The message I linked to in my original mail suggests that the
"change-id" header is preserved when rebasing on GitHub.
Why, you ask? I've seen at least the following types of corruption:
* Missing timezones
* Timezones with less than four digits
* Valid timezones padded to more than four digits with zeros
* Timezones which don't exist and never have (e.g., +1700)
* Timezones which are so absurdly large that they push the date to a
year when nobody alive now will still be living
* Date stamps that are larger than 2^64
* Date stamps which are smaller than 2^64 but beyond the expected life
of the Sun
* Extra angle brackets in the email field
* Nothing in between the email brackets
* Nothing before the email brackets (no name at all)
* Names which are not UTF-8 but without an encoding header
* Names which are not valid in the specified encoding
* Emails which are not valid UTF-8[0]
* Emails which don't meet the (ludicrously generous to the point of
being nearly unparseable) RFC production
* Encodings which are not valid IANA charsets
* Messages with no body and no blank line (just the newline at the end
of the final header)
* gpgsig headers that include random non-ASCII bytes and control
characters[1]
Thanks for sharing that, it is an interesting list. On the subject of
encoding I do think our documentation could be clearer that the encoding
applies to all the headers as well as the commit message. As far as I
can see it only mentions the commit message, not the author or committer
identities but repo_logmsg_reencode() re-encodes the whole commit
buffer. Out of interest do you think we could be doing a better job with
fsck to pick up some of these problems earlier?
I think "git rebase" only cares that the author identity can be parsed
by split_ident() which is fairly lenient.
I see Patrick is CC'd here and I'm interested in his thoughts, as well
as, of course, those of anyone else as well.
Yes me too
Thanks for your thoughtful and intereting reply
Phillip
[1]
https://github.blog/changelog/2023-06-28-rebase-commits-now-created-using-the-merge-ort-strategy/
[2]
https://github.blog/engineering/infrastructure/scaling-merge-ort-across-github/