Re: [PATCH v2 3/4] bulk-checkin: require transaction for index_blob_bulk_checkin()

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 22 Aug 2025 13:33:45 -0700

Justin Tobler <jltobler@xxxxxxxxx> writes:

> On 25/08/22 09:49AM, Junio C Hamano wrote:
>> Justin Tobler <jltobler@xxxxxxxxx> writes:
>> 
>> >  /*
>> > - * This creates one packfile per large blob unless bulk-checkin
>> > - * machinery is "plugged".
>> > + * This writes the specified object to a packfile. Objects written here
>> > + * during the same transaction are written to the same packfile. The
>> > + * packfile is not flushed until the transaction is flushed. The caller
>> > + * is expected to ensure a valid transaction is setup for objects to be
>> > + * recorded to.
>> >   *
>> >   * This also bypasses the usual "convert-to-git" dance, and that is on
>> >   * purpose. We could write a streaming version of the converting
>> > diff --git a/object-file.c b/object-file.c
>> > index 1740aa2b2e3..bc15af42450 100644
>> > --- a/object-file.c
>> > +++ b/object-file.c
>> > @@ -1253,19 +1253,26 @@ int index_fd(struct index_state *istate, struct object_id *oid,
>> >  	 * Call xsize_t() only when needed to avoid potentially unnecessary
>> >  	 * die() for large files.
>> >  	 */
>> > -	if (type == OBJ_BLOB && path && would_convert_to_git_filter_fd(istate, path))
>> > +	if (type == OBJ_BLOB && path && would_convert_to_git_filter_fd(istate, path)) {
>> >  		ret = index_stream_convert_blob(istate, oid, fd, path, flags);
>> > -	else if (!S_ISREG(st->st_mode))
>> > +	} else if (!S_ISREG(st->st_mode)) {
>> >  		ret = index_pipe(istate, oid, fd, type, path, flags);
>> > -	else if ((st->st_size >= 0 && (size_t) st->st_size <= repo_settings_get_big_file_threshold(istate->repo)) ||
>> > -		 type != OBJ_BLOB ||
>> > -		 (path && would_convert_to_git(istate, path)))
>> > +	} else if ((st->st_size >= 0 &&
>> > +		    (size_t)st->st_size <= repo_settings_get_big_file_threshold(istate->repo)) ||
>> > +		   type != OBJ_BLOB ||
>> > +		   (path && would_convert_to_git(istate, path))) {
>> >  		ret = index_core(istate, oid, fd, xsize_t(st->st_size),
>> >  				 type, path, flags);
>> > -	else
>> > -		ret = index_blob_bulk_checkin(the_repository->objects->transaction,
>> > +	} else {
>> > +		struct odb_transaction *transaction;
>> > +
>> > +		transaction = begin_odb_transaction(the_repository->objects);
>> > +		ret = index_blob_bulk_checkin(transaction,
>> >  					      oid, fd, xsize_t(st->st_size),
>> >  					      path, flags);
>> > +		end_odb_transaction(transaction);
>> > +	}
>> 
>> Interesting.  If the caller does the odb transaction management
>> itself by calling begin/end before calling this function and fed two
>> or more large objects, the original code did the right thing with
>> .nesting set to 1.  In the new code, it still does the right thing
>> even during the call to index_blob_bulk_checkin we raise .nesting by
>> one, because the all non-zero .nesting values mean the same thing to
>> the machinery, thanks to lazy initialization of the .objdir.
>
> In the original code, if no transaction was setup prior to invoking
> index_blob_bulk_checkin() (i.e. nesting == 0), the object would be
> written to its own packfile and flushed immediately. If a transaction
> had been started further upstream (i.e nesting > 0), the object would be
> written to a packfile, but not flushed. This allowed for subseqent calls
> to index_blob_bulk_checkin() to write objects to the same packfile. Only
> when transaction ended would the packfile be flushed.
>
> With this change, index_blob_bulk_checkin() must be provided a
> transaction so it can write the object to the packfile. As there is now
> always a transaction involved, there is no longer any automatic packfile
> flushing. The caller is required to ensure a transaction handling as
> appropriate.
>
>> So, isn't the comment above the function now less accurate than
>> before?  The caller of this function does not have to do anything
>> and we do not expect the caller to "ensure a valid transaction" at
>> all, no?
>
> I'm not quite sure I follow. index_blob_bulk_checkin() now expects a
> transaction to be setup even if we intend to only write a single object.
> Thus the call site in index_fd() is adjusted to ensure there is a
> transaction via invoking begin_odb_transaction() and
> end_odb_transaction() before/after the function respectively.
>
> Just to clarify, are we talking about the comment above
> index_blob_bulk_checkin()?

Ahh, sorry, I misread the patch and missed the hunk/file boundary;
iow, thought the comment was about index_fd(), which was the source
of my confusion.

Thanks for correcting me.