On 2025-04-07 at 15:52:43, Phillip Wood via GitGitGadget wrote: > From: Phillip Wood <phillip.wood@xxxxxxxxxxxxx> > > [RFC] rebase -m: partial support for copying extra commit headers > > This patch is largely a response to > https://lore.kernel.org/git/Z-5rpWKAVPmz32jC@xxxxxx/ . I'm in two minds > about whether we should consider merging such partial support but if it > helps forges preserve extra commit headers then it may well be worth it. I'd like to see command-line options to control this and ideally a configuration option. Right now, we know nothing about these extra headers, including an expected format. If a future version of Git (say, 3.0) adds a new header and the user includes invalid data in this extra header (which happens all the time with author and committer information), then 2.50 will propagate it on rebase and it won't be fixed until the user uses a version of Git that understands the header and can fsck it correctly. That's not really great, since it means we can unknowingly spread corruption. I am pretty sure that at $DAYJOB we'll need to have a discussion about whether we want to propagate these headers during rebase and I'm personally leaning against it. Why, you ask? I've seen at least the following types of corruption: * Missing timezones * Timezones with less than four digits * Valid timezones padded to more than four digits with zeros * Timezones which don't exist and never have (e.g., +1700) * Timezones which are so absurdly large that they push the date to a year when nobody alive now will still be living * Date stamps that are larger than 2^64 * Date stamps which are smaller than 2^64 but beyond the expected life of the Sun * Extra angle brackets in the email field * Nothing in between the email brackets * Nothing before the email brackets (no name at all) * Names which are not UTF-8 but without an encoding header * Names which are not valid in the specified encoding * Emails which are not valid UTF-8[0] * Emails which don't meet the (ludicrously generous to the point of being nearly unparseable) RFC production * Encodings which are not valid IANA charsets * Messages with no body and no blank line (just the newline at the end of the final header) * gpgsig headers that include random non-ASCII bytes and control characters[1] Note that all of these must be parsed in some meaningful way because users don't want their forge to serve them a 500 despite them having sent wildly invalid data. I encountered these during part of our transition from Rugged to Git (reftable, SHA-256) and they definitely added a lot of interesting complications (plus the need for lots of tests). Considering that writing valid data should not be that hard[2] (and should definitely be a priority) but apparently is for many people, I'm very wary of us propagating headers we're not ready to fsck and I'd like to have an out for users and forges who would like to be a little more careful. With those constraints, I'm not totally opposed in principle to this feature. I see Patrick is CC'd here and I'm interested in his thoughts, as well as, of course, those of anyone else as well. [0] SMTPUTF8 (RFC 6531 et al.) specifies that mailbox names may now contain UTF-8. For instance, you can email 🔵 at this domain and it will be delivered to me. [1] It should be noted that all of our signing implementations produce only non-control printable ASCII characters plus newline. [2] Maybe I am being unfair or unduly harsh to implementers, but I do at least agree with the half of Postel's Law that says that one should be conservative with what one generates and I would hope others do as well. -- brian m. carlson (they/them) Toronto, Ontario, CA
Attachment:
signature.asc
Description: PGP signature