Re: [PATCH v3] range-diff: add configurable memory limit for cost matrix

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



"Paulo Casaretto via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> From: Paulo Casaretto <pcasaretto@xxxxxxxxx>
>
> When comparing large commit ranges (e.g., 250,000+ commits), range-diff
> attempts to allocate an n×n cost matrix that can exhaust available
> memory. For example, with 256,784 commits (n = 513,568), the matrix
> would require approximately 256GB of memory (513,568² × 4 bytes),
> causing either immediate segmentation faults due to integer overflow or
> system hangs.
>
> Add a memory limit check in get_correspondences() before allocating the
> cost matrix. This check uses the total size in bytes (n² × sizeof(int))
> and compares it against a configurable maximum, preventing both
> excessive memory usage and integer overflow issues.
>
> The limit is configurable via a new --max-memory option that accepts
> human-readable sizes (e.g., "1G", "500M"). The default is 4GB for 64 bit
> systems and 2GB for 32 bit systems. This allows comparing ranges of
> approximately 32,000 (16,000) commits - generous for real-world use cases
> while preventing impractical operations.
>
> When the limit is exceeded, range-diff now displays a clear error
> message showing both the requested memory size and the maximum allowed,
> formatted in human-readable units for better user experience.
>
> Example usage:
>   git range-diff --max-memory=1G branch1...branch2
>   git range-diff --max-memory=500M base..topic1 base..topic2
>
> This approach was chosen over alternatives:
> - Pre-counting commits: Would require spawning additional git processes
>   and reading all commits twice
> - Limiting by commit count: Less precise than actual memory usage
> - Streaming approach: Would require significant refactoring of the
>   current algorithm
>
> This issue was previously discussed in:
> https://lore.kernel.org/git/RFC-cover-v2-0.5-00000000000-20211210T122901Z-avarab@xxxxxxxxx/
>
> Acked-by: Johannes Schindelin <johannes.schindelin@xxxxxx>
> Signed-off-by: Paulo Casaretto <pcasaretto@xxxxxxxxx>
> ---

Looks good, especially without the reordering existing entries in
the options list.  The authorship information above looks much
better, too.

> @@ -40,6 +57,10 @@ int cmd_range_diff(int argc,
>  				  PARSE_OPT_OPTARG),
>  		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
>  				  N_("style"), N_("passed to 'git log'"), 0),
> +		OPT_CALLBACK(0, "max-memory", &range_diff_opts.max_memory,
> +			     N_("size"),
> +			     N_("maximum memory for cost matrix (default 4G)"),
> +			     parse_max_memory),
>  		OPT_PASSTHRU_ARGV(0, "remerge-diff", &diff_merges_arg, NULL,
>  				  N_("passed to 'git log'"), PARSE_OPT_NOARG),
>  		OPT_BOOL(0, "left-only", &left_only,

Among existing options (an excerpt from "git range-diff h")

    --[no-]creation-factor <n>
                          percentage by which creation is weighted

    This controls how correspondence between commits on old and new
    branches are computed.

    --no-dual-color       use simple diff colors
    --dual-color          opposite of --no-dual-color

    These control how the findings are shown, by painting the lines
    in distinct colors. 

    --[no-]notes[=<notes>]
                          passed to 'git log'
    --[no-]diff-merges <style>
                          passed to 'git log'
    --[no-]remerge-diff   passed to 'git log'

    These control what text are used to represent each commit and
    participate in comparison and display.

    --[no-]left-only      only emit output related to the first range
    --[no-]right-only     only emit output related to the second range

    These again control how the findings are shown, by omitting some
    commits from the output.

So there is no perfectly logical place to place the new option, but
between diff-merges and remerge-diff somewhat feels a bit odder
choice than other possible places.

Will queue as is.  If some users find the location in the "-h"
output too odd and disturbing, they can later send in a reordering
patch on top, but I would think the chosen location is good enough.

As #leftoverbits we might want to

 * Group range-diff specific options with OPT_GROUP()

 * Instead of having to match the full NxN matrix, perhaps reduce
   the matrix by keeping the most promising M (which is much smaller
   than N) for each N, or something?

but that (especially the latter) is totally outside the scope of
this patch.

Thanks.





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux