Re: [PATCH v6 2/4] populate: add ability to populate a filesystem from a directory

Luca Di Maio <luca.dimaio1@xxxxxxxxx> · Thu, 24 Apr 2025 18:09:45 +0200

On Wed, Apr 23, 2025 at 01:23:58PM -0700, Darrick J. Wong wrote:
> On Wed, Apr 23, 2025 at 06:03:17PM +0200, Luca Di Maio wrote:
> > +static void fail(char *msg, int i)
> > +{
> > +   fprintf(stderr, _("%s: %s [%d - %s]\n"), progname, msg, i, strerror(i));
> > +   exit(1);
> > +}
> > +
> > +static int newregfile(char *fname)
> > +{
> > +   int fd;
> > +   off_t size;
> > +
> > +   if ((fd = open(fname, O_RDONLY)) < 0 || (size = filesize(fd)) < 0) {
> > +           fprintf(stderr, _("%s: cannot open %s: %s\n"), progname, fname,
> > +                   strerror(errno));
> > +           exit(1);
> > +   }
> > +
> > +   return fd;
> > +}
>
> Why is this copy-pasting code from proto.c?  Put the new functions
> there, and then you don't need all this externing.
>

Right, this is because with a separate flag I thought it would have been
better to keep it in a separate file.
With the new behaviour you proposed in the previous mail (one -p flag,
check if file/directory) then I can unify back into proto.c, thus
removing all the exported functions changes.

> > +
> > +static void writetimestamps(struct xfs_inode *ip, struct stat statbuf)
> > +{
> > +   struct timespec64 ts;
> > +
> > +   /*
> > +    * Copy timestamps from source file to destination inode.
> > +    *  In order to not be influenced by our own access timestamp,
> > +    *  we set atime and ctime to mtime of the source file.
> > +    *  Usually reproducible archives will delete or not register
> > +    *  atime and ctime, for example:
> > +    *     https://www.gnu.org/software/tar/manual/html_section/Reproducibility.html
> > +    */
> > +   ts.tv_sec = statbuf.st_mtime;
> > +   ts.tv_nsec = statbuf.st_mtim.tv_nsec;
> > +   inode_set_atime_to_ts(VFS_I(ip), ts);
> > +   inode_set_ctime_to_ts(VFS_I(ip), ts);
> > +   inode_set_mtime_to_ts(VFS_I(ip), ts);
>
> This seems weird to me that you'd set [ac]time to mtime.  Why not open
> the source file O_ATIME and copy atime?  And why would copying ctime not
> result in a reproducible build?
>
> Not sure what you do about crtime.
>

The problem stems from the extraction of the artifact. Usually
reproducible archives will remove [ac]time and only keep mtime, but in
the moment that a file is extracted, any filesystem will assign [ac]time
to the moment of extraction.
This will add randomness not to the filesystem itself, because it will
be reproducible if acting on the same extracted archive, but it will not
be reproducible if acting on a new extraction of the same archive.

Another approach we can do is what mkfs.ext4's populate functionality is
doing: while it preserves mtime, [cr,a,c]time is set to whatever time the
mkfs command is running.

This would make it preserve the important timestamp (mtime) and move the
"problem" of the reproducible/changing timestamp to the environment,
while keeping the behaviour of mkfs.xfs sensible

What do you think?

> > +   /*
> > +    * copy over file content, attributes and
> > +    * timestamps
> > +    */
> > +   if (fd != 0) {
> > +           writefile(ip, fname, fd);
> > +           writeattrs(ip, fname, fd);
>
> Since we're adding features, should this read the fsxattr info from the
> source file, override it with the set fields in *fsxp, and set that on
> the file?  If you're going to slurp up a directory, you might as well
> get all the non-xattr file attributes.
>

Right, I thought creatproto() did that, but now I see that this is done
only for the root inode, I'll add this for others too, thanks.

> > +           libxfs_parent_finish(mp, ppargs);
> > +           tp = NULL;
>
> Shouldn't this copy xattrs and fsxattrs to directories and symlinks too?
>

Right, will add, thanks.

> > +/*
> > + * walk_dir will recursively list files and directories
> > + * and populate the mountpoint *mp with them using handle_direntry().
> > + */
> > +static void walk_dir(struct xfs_mount *mp, struct xfs_inode *pip,
> > +                       struct fsxattr *fsxp, char *cur_path)
> > +{
> > +   DIR *dir;
> > +   struct dirent *entry;
> > +
> > +   /*
> > +    * open input directory and iterate over all entries in it.
> > +    * when another directory is found, we will recursively call
> > +    * populatefromdir.
> > +    */
> > +   if ((dir = opendir(cur_path)) == NULL)
> > +           fail(_("cannot open input dir"), 1);
> > +   while ((entry = readdir(dir)) != NULL) {
> > +           handle_direntry(mp, pip, fsxp, cur_path, entry);
> > +   }
> > +   closedir(dir);
> > +}
>
> nftw() ?  Which has the nice feature of constraining the number of open
> dirs at any given time.
>
> --D
>

The problem with nftw() is that working with callback functions, we will
need to switch to static variables for state, for example to keep track
of each ip's pip, while with the recursive approach we can have some
state and basically walk_dir() behaves similar to parseproto(), making
changes to the rest of the file minimal.
This seems to involve a lot more changes than now where we're basically
just adding a limited number of functions to proto.c.

Thanks again for the review Darrick,
I'll wait for your feedback on the walk_dir() vs nftw() and the [ac]time
approach,
thanks

L.