Re: [PATCH v3 5/8] Add git-contributors script to notify about merges

Andrey Albershteyn <aalbersh@xxxxxxxxxx> · Wed, 12 Feb 2025 12:16:46 +0100

On 2025-02-11 10:58:04, Darrick J. Wong wrote:
> On Tue, Feb 11, 2025 at 06:26:57PM +0100, Andrey Albershteyn wrote:
> > Add python script used to collect emails over all changes merged in
> > the next release.
> > 
> > CC: Darrick J. Wong <djwong@xxxxxxxxxx>
> > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > Signed-off-by: Andrey Albershteyn <aalbersh@xxxxxxxxxx>
> > ---
> >  tools/git-contributors.py | 94 +++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 94 insertions(+)
> > 
> > diff --git a/tools/git-contributors.py b/tools/git-contributors.py
> > new file mode 100755
> > index 0000000000000000000000000000000000000000..83bbe8ce0ee1dcbd591c6d3016d553fac2a7d286
> > --- /dev/null
> > +++ b/tools/git-contributors.py
> > @@ -0,0 +1,94 @@
> > +#!/usr/bin/python3
> > +
> > +# List all contributors to a series of git commits.
> > +# Copyright(C) 2025 Oracle, All Rights Reserved.
> > +# Licensed under GPL 2.0 or later
> > +
> > +import re
> > +import subprocess
> > +import io
> > +import sys
> > +import argparse
> > +import email.utils
> > +
> > +DEBUG = False
> > +
> > +def backtick(args):
> > +    '''Generator function that yields lines of a program's stdout.'''
> > +    if DEBUG:
> > +        print(' '.join(args))
> > +    p = subprocess.Popen(args, stdout = subprocess.PIPE)
> > +    for line in io.TextIOWrapper(p.stdout, encoding="utf-8"):
> > +        yield line
> > +
> > +class find_developers(object):
> > +    def __init__(self):
> > +        tags = '%s|%s|%s|%s|%s|%s|%s|%s' % (
> > +            'signed-off-by',
> > +            'acked-by',
> > +            'cc',
> > +            'reviewed-by',
> > +            'reported-by',
> > +            'tested-by',
> > +            'suggested-by',
> > +            'reported-and-tested-by')
> > +        # some tag, a colon, a space, and everything after that
> > +        regex1 = r'^(%s):\s+(.+)$' % tags
> > +
> > +        self.r1 = re.compile(regex1, re.I)
> > +
> > +    def run(self, lines):
> > +        addr_list = []
> > +
> > +        for line in lines:
> > +            l = line.strip()
> > +
> > +            # emailutils can handle abominations like:
> > +            #
> > +            # Reviewed-by: Bogus J. Simpson <bogus@xxxxxxxxxxx>
> > +            # Reviewed-by: "Bogus J. Simpson" <bogus@xxxxxxxxxxx>
> > +            # Reviewed-by: bogus@xxxxxxxxxxx
> > +            # Cc: <stable@xxxxxxxxxxxxxxx> # v6.9
> > +            # Tested-by: Moo Cow <foo@xxxxxxx> # powerpc
> > +            m = self.r1.match(l)
> > +            if not m:
> > +                continue
> > +            (name, addr) = email.utils.parseaddr(m.expand(r'\g<2>'))
> > +
> > +            # This last split removes anything after a hash mark,
> > +            # because someone could have provided an improperly
> > +            # formatted email address:
> > +            #
> > +            # Cc: stable@xxxxxxxxxxxxxxx # v6.19+
> > +            #
> > +            # emailutils doesn't seem to catch this, and I can't
> > +            # fully tell from RFC2822 that this isn't allowed.  I
> > +            # think it is because dtext doesn't forbid spaces or
> > +            # hash marks.
> > +            addr_list.append(addr.split('#')[0])
> 
> I think it's the case that the canonical stable cc tag format for kernel
> patches as provided by the stable kernel process rules document:
> 
> Cc: <stable@xxxxxxxxxxxxxxx> # vX.Y
> 
> is not actually actually rfc5322 compliant, so strings like that break
> Python's emailutils parsers.  parseaddr() completely chokes on this, and
> retuns name=='' and addr=='', because the only thing that can come after
> the address portion are whitespace, EOL, or a comma followed by more
> email addresses.  There's definitely not supposed to be an octothorpe
> followed by even more text.
> 
> In the end I let myself be nerdsniped with even more string parsing bs,
> and this loop body is the result:
> 
> 		l = line.strip()
> 
> 		# First, does this line match any of the headers we
> 		# know about?
> 		m = self.r1.match(l)
> 		if not m:
> 			continue
> 
> 		# The split removes everything after an octothorpe
> 		# (hash mark), because someone could have provided an
> 		# improperly formatted email address:
> 		#
> 		# Cc: stable@xxxxxxxxxxxxxxx # v6.19+
> 		#
> 		# This, according to my reading of RFC5322, is allowed
> 		# because octothorpes can be part of atom text.
> 		# However, it is interepreted as if there weren't any
> 		# whitespace ("stable@xxxxxxxxxxxxxxx#v6.19+").  The
> 		# grammar allows for this form, even though this is not
> 		# a correct Internet domain name.
> 		#
> 		# Worse, if you follow the format specified in the
> 		# kernel's SubmittingPatches file:
> 		#
> 		# Cc: <stable@xxxxxxxxxxxxxxx> # v6.9
> 		#
> 		# emailutils will not know how to parse this, and
> 		# returns empty strings.  I think this is because the
> 		# angle-addr specification allows only whitespace
> 		# between the closing angle bracket and the CRLF.
> 		#
> 		# Hack around both problems by ignoring everything
> 		# after an octothorpe, no matter where it occurs in the
> 		# string.  If someone has one in their name or the
> 		# email address, too bad.
> 		a = m.expand(r'\g<2>').split('#')[0]
> 
> 		# emailutils can extract email addresses from headers
> 		# that roughly follow the destination address field
> 		# format:
> 		#
> 		# Reviewed-by: Bogus J. Simpson <bogus@xxxxxxxxxxx>
> 		# Reviewed-by: "Bogus J. Simpson" <bogus@xxxxxxxxxxx>
> 		# Reviewed-by: bogus@xxxxxxxxxxx
> 		# Tested-by: Moo Cow <foo@xxxxxxx>
> 		#
> 		# Use it to extract the email address, because we don't
> 		# care about the display name.
> 		(name, addr) = email.utils.parseaddr(a)
> 		addr_list.append(addr)
> 
> <shrug> but maybe we should try that on a few branches first before
> committing to this string parsing mess ... ?  Not that this is any less
> stupid than the previous version I shared out. :(

Can we just drop anything with 'stable@'? These are patches from
libxfs syncs, do they have any value for stable@ list?

But the change is still make sense if anyone uses hash mark for
something else, I will apply your change.

-- 
- Andrey