On Thu, 31 Jul 2025, Ezekiel Newren wrote:
On Fri, Jul 18, 2025 at 1:00 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
"Ezekiel Newren via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
+extern u64 xxh3_64(u8 const* ptr, usize size);
+
+
static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_t const *xpp,
xdlclassifier_t *cf, xdfile_t *xdf) {
unsigned long *ha;
@@ -175,14 +178,26 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
xdl_parse_lines(mf, narec, xdf);
+ if ((xpp->flags & XDF_WHITESPACE_FLAGS) == 0) {
+ for (usize i = 0; i < (usize) xdf->nrec; i++) {
+ xrecord_t *rec = xdf->recs[i];
+ rec->ha = xxh3_64(rec->ptr, rec->size);
+ }
+ } else {
+ for (usize i = 0; i < (usize) xdf->nrec; i++) {
+ xrecord_t *rec = xdf->recs[i];
+ char const* dump = (char const*) rec->ptr;
+ rec->ha = xdl_hash_record(&dump, (char const*) (rec->ptr + rec->size), xpp->flags);
+ }
+ }
As a technology demonstration and proof of concept patch, this is
very nice, but to be upstreamed for real, we'd want a variant of
xxhash that can work with the contents with whitespace squashed to
be usable with various whitespace ignoring modes of operation. When
that happens, and when the result turns out to be more performant,
we can lose the xdl_hash_record() and require only the xxhash, which
would be great.
And that variant of xxhash that understands whitespace squashing can
of course be written in Rust as a part of this effort when the
series loses its RFC status. At the same time, those who want to
use our xdiff code in third-party software (like libgit2 and vim)
may want to reimplement it in C in their copy.
Thanks.
What is the git precedent for replacement code that is easier to read
and maintain while also being more secure, but is slower? I think
hashing with whitespace handling in Rust might fall in that category.
As far as I can tell the Rust code for dealing with whitespace is
going to be slower than the C code because xdiff used a hash algorithm
(DJB2a) that can operate 1 byte at a time and combined hashing with
determining the length. Xxhash requires that the length be known
beforehand and the memory to be contiguous or to hash it in chunks.
Hashing 1 byte at a time with Xxhash is VERY slow since it's just
copying to an internal buffer until a full block is ready.
On a broader note. How do I show the mailing list the changes that
I've made to this branch/patch series? I'm not sure what the proper
procedure is or even how to do it. What commands would I run, or web
browser steps would I take to show my newest commits?
Since you've used GitGitGadget for the original submission of this patch
series, the easiest way is to force push your updated commits to your PR
branch (xdiff_rust_speedup) and comment "/submit" on the PR again.
Alternatively you can send a version 2 of your patch series using git
format-patch and git send-email, but that is a few more manual steps.
MyFirstContribution.adoc has a detailed section about this process,
including a part about a v2. [1]
[1]
https://github.com/git/git/blob/master/Documentation/MyFirstContribution.adoc#sending-patches-with-git-send-email
Best regards
Matthias