On 2025-02-11 10:58:04, Darrick J. Wong wrote: > On Tue, Feb 11, 2025 at 06:26:57PM +0100, Andrey Albershteyn wrote: > > Add python script used to collect emails over all changes merged in > > the next release. > > > > CC: Darrick J. Wong <djwong@xxxxxxxxxx> > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > Signed-off-by: Andrey Albershteyn <aalbersh@xxxxxxxxxx> > > --- > > tools/git-contributors.py | 94 +++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 94 insertions(+) > > > > diff --git a/tools/git-contributors.py b/tools/git-contributors.py > > new file mode 100755 > > index 0000000000000000000000000000000000000000..83bbe8ce0ee1dcbd591c6d3016d553fac2a7d286 > > --- /dev/null > > +++ b/tools/git-contributors.py > > @@ -0,0 +1,94 @@ > > +#!/usr/bin/python3 > > + > > +# List all contributors to a series of git commits. > > +# Copyright(C) 2025 Oracle, All Rights Reserved. > > +# Licensed under GPL 2.0 or later > > + > > +import re > > +import subprocess > > +import io > > +import sys > > +import argparse > > +import email.utils > > + > > +DEBUG = False > > + > > +def backtick(args): > > + '''Generator function that yields lines of a program's stdout.''' > > + if DEBUG: > > + print(' '.join(args)) > > + p = subprocess.Popen(args, stdout = subprocess.PIPE) > > + for line in io.TextIOWrapper(p.stdout, encoding="utf-8"): > > + yield line > > + > > +class find_developers(object): > > + def __init__(self): > > + tags = '%s|%s|%s|%s|%s|%s|%s|%s' % ( > > + 'signed-off-by', > > + 'acked-by', > > + 'cc', > > + 'reviewed-by', > > + 'reported-by', > > + 'tested-by', > > + 'suggested-by', > > + 'reported-and-tested-by') > > + # some tag, a colon, a space, and everything after that > > + regex1 = r'^(%s):\s+(.+)$' % tags > > + > > + self.r1 = re.compile(regex1, re.I) > > + > > + def run(self, lines): > > + addr_list = [] > > + > > + for line in lines: > > + l = line.strip() > > + > > + # emailutils can handle abominations like: > > + # > > + # Reviewed-by: Bogus J. Simpson <bogus@xxxxxxxxxxx> > > + # Reviewed-by: "Bogus J. Simpson" <bogus@xxxxxxxxxxx> > > + # Reviewed-by: bogus@xxxxxxxxxxx > > + # Cc: <stable@xxxxxxxxxxxxxxx> # v6.9 > > + # Tested-by: Moo Cow <foo@xxxxxxx> # powerpc > > + m = self.r1.match(l) > > + if not m: > > + continue > > + (name, addr) = email.utils.parseaddr(m.expand(r'\g<2>')) > > + > > + # This last split removes anything after a hash mark, > > + # because someone could have provided an improperly > > + # formatted email address: > > + # > > + # Cc: stable@xxxxxxxxxxxxxxx # v6.19+ > > + # > > + # emailutils doesn't seem to catch this, and I can't > > + # fully tell from RFC2822 that this isn't allowed. I > > + # think it is because dtext doesn't forbid spaces or > > + # hash marks. > > + addr_list.append(addr.split('#')[0]) > > I think it's the case that the canonical stable cc tag format for kernel > patches as provided by the stable kernel process rules document: > > Cc: <stable@xxxxxxxxxxxxxxx> # vX.Y > > is not actually actually rfc5322 compliant, so strings like that break > Python's emailutils parsers. parseaddr() completely chokes on this, and > retuns name=='' and addr=='', because the only thing that can come after > the address portion are whitespace, EOL, or a comma followed by more > email addresses. There's definitely not supposed to be an octothorpe > followed by even more text. > > In the end I let myself be nerdsniped with even more string parsing bs, > and this loop body is the result: > > l = line.strip() > > # First, does this line match any of the headers we > # know about? > m = self.r1.match(l) > if not m: > continue > > # The split removes everything after an octothorpe > # (hash mark), because someone could have provided an > # improperly formatted email address: > # > # Cc: stable@xxxxxxxxxxxxxxx # v6.19+ > # > # This, according to my reading of RFC5322, is allowed > # because octothorpes can be part of atom text. > # However, it is interepreted as if there weren't any > # whitespace ("stable@xxxxxxxxxxxxxxx#v6.19+"). The > # grammar allows for this form, even though this is not > # a correct Internet domain name. > # > # Worse, if you follow the format specified in the > # kernel's SubmittingPatches file: > # > # Cc: <stable@xxxxxxxxxxxxxxx> # v6.9 > # > # emailutils will not know how to parse this, and > # returns empty strings. I think this is because the > # angle-addr specification allows only whitespace > # between the closing angle bracket and the CRLF. > # > # Hack around both problems by ignoring everything > # after an octothorpe, no matter where it occurs in the > # string. If someone has one in their name or the > # email address, too bad. > a = m.expand(r'\g<2>').split('#')[0] > > # emailutils can extract email addresses from headers > # that roughly follow the destination address field > # format: > # > # Reviewed-by: Bogus J. Simpson <bogus@xxxxxxxxxxx> > # Reviewed-by: "Bogus J. Simpson" <bogus@xxxxxxxxxxx> > # Reviewed-by: bogus@xxxxxxxxxxx > # Tested-by: Moo Cow <foo@xxxxxxx> > # > # Use it to extract the email address, because we don't > # care about the display name. > (name, addr) = email.utils.parseaddr(a) > addr_list.append(addr) > > <shrug> but maybe we should try that on a few branches first before > committing to this string parsing mess ... ? Not that this is any less > stupid than the previous version I shared out. :( Can we just drop anything with 'stable@'? These are patches from libxfs syncs, do they have any value for stable@ list? But the change is still make sense if anyone uses hash mark for something else, I will apply your change. -- - Andrey