Re: [PATCH v5 2/5] promisor-remote: allow a server to advertise more fields

Christian Couder <christian.couder@xxxxxxxxx> · Mon, 21 Jul 2025 16:09:10 +0200

On Thu, Jun 26, 2025 at 12:29 AM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Christian Couder <christian.couder@xxxxxxxxx> writes:
>
> > For now the "promisor-remote" protocol capability can only pass "name"
> > and "url" information from a server to a client in the form
> > "name=<remote_name>,url=<remote_url>".
> >
> > Let's make it possible to pass more information by introducing
>
> Motivation, like "in order to allow the client more efficient
> transfer", "in order to give more control on what can be fetched
> from the server", etc., needs to be given before we say "let's do
> X".  We want to do X not because we can do X; we want to do X as it
> currently seems to us the best way to achieve Y.  What is our Y in
> this case for the proposed feature?

Ok, I have changed it to the following in v6:

   To allow clients to make more informed decisions about which promisor
   remotes they accept, let's make it possible to pass more information
   by introducing a new "promisor.sendFields" configuration variable.

> > a new
> > "promisor.sendFields" configuration variable. This variable should
> > contain a comma or space separated list
>
> By making it easier for casual humans who manually write the
> configuration variable (presumably while testing) and allowing both
> comma and space as separator, this design decision is forcing one
> more rule to worry about for those who are writing the parser for
> the value.  There may be some existing configuration variables with
> such a "leninent" syntax, but I'd rather see us not make the mess
> even worse.

We indeed have a number of configuration variables accepting lists of
items separated by both comma and space. As we cannot fix those easily
for backward compatibility and they still make things a bit simpler
for users, my opinion is that we'd better bite the bullet and make
sure we have a simple and hard-to-misuse standard way to parse such
lists. Then we can allow such comma and space separated lists in many
places, and use this standard way to parse those lists wherever they
are allowed. Users would have a consistent experience when dealing
with different configuration variables.

It also seems to me that parsing such lists is not so difficult right
now. It basically only requires a call to
`string_list_split_in_place()`, followed by a call to
`string_list_remove_empty_items()`. Maybe we could refactor that into
a dedicated (possibly inline) function (maybe called
`string_list_split_config_list()`)
and make sure it is used in all the places where such lists are
accepted.

I would prefer adding that kind of refactoring in the form of a
preparatory patch (or maybe a separate patch altogether) rather than
for example allowing only spaces as separators. For now I have made no
changes in v6 regarding this.

> > of field names that will be
> > looked up in the configuration of the remote on the server to find the
> > values that will be passed to the client.
>
> I know the name "field" was discussed in earlier iterations, but
> these three lines together with "For example" in a later paragraph,
> it hints to me that this mechanism is to choose variable-value pairs
> for which among remotes.<name>.* variables to send after name=<name>
> and url=<url>; is my understanding correct?  If so, can we clarify
> the paragraphs around here even more so that I do not even have to
> ask this question?

Yeah, it seems to me that there was previously a sentence about what
fields are, but it looks like I removed it by mistake. Anyway, before
the paragraph about what the "promisor.sendFields" configuration
variable contains, I have added the following in v6:

   On the server side, information about a remote `foo` is stored in
   configuration variables named `remote.foo.<variable-name>`. To make
   it clearer and simpler, we use `field` and `field name` like this:

     * `field name` refers to the <variable-name> part of such a
       configuration variable, and

     * `field` refers to both the `field name` and the value of such a
       configuration variable.

> What do we call the third-level of a variable name in the
> configuration file?  The description on the "--regexp" option in
> "git config --help" hints one:
>
>     With `get`, interpret the name as a regular expression. Regular
>     expression matching is currently case-sensitive and done against
>     a canonicalized version of the key in which section and variable
>     names are lowercased, but subsection names are not.
>
> So a for remote.origin.partialCloneFilter, "remote" is the section
> name, "origin" is the subsection name, and "partialCloneFilter" is
> the variable name.

Yeah, I thought that "variable name" could be confusing, so I prefered
to use another word, "field", to talk about this.

> Armed with that knowledge, perhaps something like:
>
>     <<the motivation comes here, and then ...>> In order to give
>     additional information to the client, the server side can set a
>     new 'promisor.sendAdditionalVariables' configuration variable,
>     whose value is a comma separated list of variable names.
>
>     The values of the configuration variables specified by this
>     variable for the given remote [*] are sent as comma separated
>     list of variable=<value>, after name=<name>,url=<url> in the
>     promisor-remote capability.
>
>     [*] Concatenating each of these variable names listed as the
>     value of promisor.sendAdditionalVariables after remote.<name>.
>     results in the configuration variables exposed to the client.

I'd rather avoid talking about "variable name" other than to explain
what "field name" and "field" are. I think it would be too confusing,
especially if the name of the new config options are still
"promisor.sendFields" and "promisor.checkFields".

> to replace all of the above, and then it ...
>
> > Only a set of predefined fields are allowed. The only fields in this
> > set are "partialCloneFilter" and "token". The "partialCloneFilter"
> > field specifies the filter definition used by the promisor remote,
> > and the "token" field can provide an authentication credential for
> > accessing it.
>
> ... is a good thing to describe that we have these two supported.
>
> And this one ...
>
> > For example, if "promisor.sendFields" is set to "partialCloneFilter",
> > and the server has the "remote.<name>.partialCloneFilter" config
> > variable set to a value for a remote, then that value will be passed
> > in the form "partialCloneFilter=<value>" after the "name" and "url"
> > fields.
>
> ... has already been covered.  My main point is that we should
> clarify what a 'field' is and how it relates to the variables in the
> configuration file of which side (the server, not the client) a lot
> earlier than we say "these two are the only ones that are
> supported".  Before understanding what a field is, "only these two
> are valid" does not give readers much useful information.

Yeah, I hope that the definitions I have added are good enough.

> > diff --git a/Documentation/gitprotocol-v2.adoc b/Documentation/gitprotocol-v2.adoc
> > index 9a57005d77..0583fafa09 100644
> > --- a/Documentation/gitprotocol-v2.adoc
> > +++ b/Documentation/gitprotocol-v2.adoc
> > @@ -785,33 +785,59 @@ retrieving the header from a bundle at the indicated URI, and thus
> >  save themselves and the server(s) the request(s) needed to inspect the
> >  headers of that bundle or bundles.
> >
> > -promisor-remote=<pr-infos>
> > +promisor-remote=<pr-info>
> >  ~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >  The server may advertise some promisor remotes it is using or knows
> >  about to a client which may want to use them as its promisor remotes,
> > +instead of this repository. In this case <pr-info> should be of the
> >  form:
> >
> > +     pr-info = pr-fields | pr-info ";" pr-info
> >
> > +     pr-fields = field-name "=" field-value | pr-fields "," pr-fields
>
> Typically a name is at most a few words (concatenated with some
> punctuation), and url would probably be a few hundred bytes at most?
> can we afford to cram even more var=value on the same line?

We are adding only 2 relatively short optional "var=value" per remote,
so a few hundred bytes at most per remote. This means the whole
advertisement for one remote is at most around 500 bytes. As the max
pkt-line size is around 16^4, so 65536, this still allows for more
than 130 remotes to be advertised which should be enough for now.

> The syntax allows you to send more than one name/url pair on the
> same promisor-remote.  Length aside, does this pr-fields mechanism
> give the values of these fields for multiple remotes?
>
> Suppose you want to advertise promisor-remote "A" and "B", and your
> sendFields says "token".  What does your promisor-remote=<pr-info> look
> like in such a case?
>
>     name=A,url=https://git.git/A,token=foo;name=B,url=https://git.git/B,token=bar

Yes, it should look like this, and each "field-name" should appear at most once.

> Do we explicitly say that the set of var=val on promisor-remote are
> grouped into distinct sets with ';' and each set talks about only
> one remote?  I.e.
>
>     name=A,name=B,url=https://git.git/A,url=https://git.git/B,token=foo,token=bar
>
> may be syntactically allowed in the above BNF, but it needs _some_
> rule to match which URL and which token go with which name.

There is already the following:

"where all the `field-name` and `field-value` in a given `pr-fields`
are field names and values related to a single promisor remote."

This disallows having fields related to different promisor remotes
grouped together by ';'.

> Do we
> also need to say that within a single ';' separated group, a variable
> can only appear once?  IOW, this is invalid?
>
>     name=A,url=https://git.git/A,token=foo,token=bar

This should be invalid, but it wasn't explicitly said that it is
invalid. In v6 though I have added "A given `field-name` MUST NOT
appear more than once in given `pr-fields`."

> > +where all the `field-name` and `field-value` in a given `pr-fields`
> > +are field names and values related to a single promisor remote.
> >
> > +The server MUST advertise at least the "name" and "url" field names
> > +along with the associated field values, which are the name of a valid
> > +remote and its URL, in each `pr-fields`. The "name" and "url" fields
> > +MUST appear first in each pr-fields, in that order.
>
> With
>
>     pr-fields = pr-fields "," pr-fields
>
> the rule in the last sentence does not make sense.  You'd need a
> distinct BNF non-terminal to name the one you want the rule to apply
> to.  Perhaps something like
>
>         pr-info = pr-one-remote-info | pr-info ';' pr-one-remote-info
>         pr-one-remote-info = pr-fields
>         pr-fields = pr-field | pr-fields ',' pr-field
>         pr-field = variable '=' value

I think the following, which I have put in v6, should be enough but
maybe I am missing something:

    pr-info = pr-fields | pr-info ";" pr-fields
    pr-fields = pr-field | pr-fields "," pr-field
    pr-field = field-name "=" field-value

Also in the 'object-info' section of the same file, "attrs = attr |
attrs SP attrs" has the same problem and might want to be changed to
"attrs = attr | attrs SP attr".

> Then you can safely say 'name' and 'url', must be the first two that
> appear in each pr-one-remote-info.

It seems to me that this sentence which is already in the doc should be enough:

"The "name" and "url" fields MUST appear first in each pr-fields, in
that order."

> > +After these mandatory fields, the server MAY advertise the following
> > +optional fields in any order:
> > +
> > +- "partialCloneFilter": The filter specification used by the remote.
> > +Clients can use this to determine if the remote's filtering strategy
> > +is compatible with their needs (e.g., checking if both use "blob:none").
> > +It corresponds to the "remote.<name>.partialCloneFilter" config setting.
> > +
> > +- "token": An authentication token that clients can use when
> > +connecting to the remote. It corresponds to the "remote.<name>.token"
> > +config setting.
> > +
> > +No other fields are defined by the protocol at this time. Clients MUST
> > +ignore fields they don't recognize to allow for future protocol
> > +extensions.
>
> This forces us to never add support for a field that MUST be
> understood for the protocol to operate correctly.  Is it sensible?

The protocol for now allows clients to only reply with the names of
the promisor remote they accept. Clients can also set
"promisor.acceptFromServer" to "All". This means that servers should
not trust anyway that the client understood any specific field they
sent based on the client's reply.

If the server controls the promisor remotes it advertises, it can send
for example a token in the "token" field and then use other means than
this protocol to make sure that the client understood the token (and
maybe other fields) when the client actually accesses the promisor
remote.

If a client wants to make sure that the server passed a specific field
with a specific value, they can list this field's name in their
"promisor.checkFields" configuration variable which is introduced in
patch 4/5 and have their corresponding config option set to the
specific value.

> Just like the index file format defines optional extensions that can
> be ignored (implication of which is that you MUST die if you do not
> understand any non-optional ones), shouldn't we have some mechanism
> to tell "if you do not understand this, do not use the remote, or
> your repository will be broken in a horrible way"?
>
> I dunno.

If we want to make sure that clients don't access some remotes unless
they understand something, we'd better rely at least partially on
another mechanism because clients can already be configured to access
any remote. They don't need this protocol to be able to access a
remote.

In fact the information transmitted by the server is not used to
access the remote. It's only used to decide if the remote is accepted.
So the remote has to be already configured to access the remote.

Thanks.