Hi Patrick, On Thu, 20 Mar 2025, Patrick Steinhardt wrote: > In our CI systems we can observe that t0610 fails rather frequently. > This testcase races a bunch of git-update-ref(1) processes with one > another which are all trying to update a unique reference, where we > expect that all processes succeed and end up updating the reftable > stack. The error message in this case looks like the following: > > fatal: update_ref failed for ref 'refs/heads/branch-88': reftable: transaction prepare: I/O error > > Instrumenting the code with a couple of calls to `BUG()` in relevant > sites where we return `REFTABLE_IO_ERROR` quickly leads one to discover > that this error is caused when calling `flock_acquire()`, which is a > thin wrapper around our lockfile API. Curiously, the error code we get > in such cases is `EACCESS`, indicating that we are not allowed to access > the file. > > The root cause of this is an oddity of `CreateFileW()`, which is what > `_wopen()` uses internally. Quoting its documentation [1]: > > If you call CreateFile on a file that is pending deletion as a > result of a previous call to DeleteFile, the function fails. The > operating system delays file deletion until all handles to the file > are closed. GetLastError returns ERROR_ACCESS_DENIED. > > This behaviour is triggered quite often in the above testcase because > all the processes race with one another trying to acquire the lock for > the "tables.list" file. This is due to how locking works in the reftable > library when compacting a stack: > > 1. Lock the "tables.list" file and reads its contents. > > 2. Decide which tables to compact. > > 3. Lock each of the individual tables that we are about to compact. > > 4. Unlock the "tables.list" file. > > 5. Compact the individual tables into one large table. > > 6. Re-lock the "tables.list" file. > > 7. Write the new list of tables into it. > > 8. Commit the "tables.list" file. > > The important step is (4): we don't commit the file directly by renaming > it into place, but instead we delete the lockfile so that concurrent > processes can continue to append to the reftable stack while we compact > the tables. And because we use `DeleteFileW()` to do so, we may now race > with another process that wants to acquire that lockfile. So if we are > unlucky, we would now see `ERROR_ACCESS_DENIED` instead of the expected > `ERROR_FILE_EXISTS`, which the lockfile subsystem isn't prepared to > handle and thus it will bail out without retrying to acquire the lock. > > In theory, the issue is not limited to the reftable library and can be > triggered by every other user of the lockfile subsystem, as well. My gut > feeling tells me it's rather unlikely to surface elsewhere though. > > Fix the issue by translating the error to `EEXIST`. This makes the > lockfile subsystem handle the error correctly: in case a timeout is set > it will now retry acquiring the lockfile until the timeout has expired. > > With this, t0610 is now always passing on my machine whereas it was > previously failing in around 20-30% of all test runs. > > [1]: https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew Couldn't we simply handle `EACCES` the same way as `EEXIST` in step 4? This suggestion is different from v1, which would have affected all callers of `mingw_open()`. The reason I ask is that `RtlGetLastNtStatus()` is undocumented, and should therefore not be used. I know that I will be tasked with removing that call should it be introduced into Git's source code, and naturally I'd like to avoid that. I know that e.g. PostgreSQL used this undocumented function at least at some stage, but SQLite avoided it by introducing a simple poll strategy. We could also do that, but if there is already code in the reftable library that skips doing things if a `.lock` file exists, then doing the same if the `.lock` file cannot be created, too, should be a safe argument to make. Ciao, Johannes > > Signed-off-by: Patrick Steinhardt <ps@xxxxxx> > --- > compat/mingw.c | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > diff --git a/compat/mingw.c b/compat/mingw.c > index f524c54d06d..50c80b1b750 100644 > --- a/compat/mingw.c > +++ b/compat/mingw.c > @@ -21,6 +21,9 @@ > #include "gettext.h" > #define SECURITY_WIN32 > #include <sspi.h> > +#include <winternl.h> > + > +#define STATUS_DELETE_PENDING ((NTSTATUS) 0xC0000056) > > #define HCAST(type, handle) ((type)(intptr_t)handle) > > @@ -621,6 +624,8 @@ int mingw_open (const char *filename, int oflags, ...) > wchar_t wfilename[MAX_PATH]; > open_fn_t open_fn; > > + DECLARE_PROC_ADDR(ntdll.dll, NTSTATUS, NTAPI, RtlGetLastNtStatus, void); > + > va_start(args, oflags); > mode = va_arg(args, int); > va_end(args); > @@ -644,6 +649,21 @@ int mingw_open (const char *filename, int oflags, ...) > > fd = open_fn(wfilename, oflags, mode); > > + /* > + * Internally, `_wopen()` uses the `CreateFile()` API with CREATE_NEW, > + * which may error out with ERROR_ACCESS_DENIED and an NtStatus of > + * STATUS_DELETE_PENDING when the file is scheduled for deletion via > + * `DeleteFileW()`. The file essentially exists, so we map errno to > + * EEXIST instead of EACCESS so that callers don't have to special-case > + * this. > + * > + * This fixes issues for example with the lockfile interface when one > + * process has a lock that it is about to commit or release while > + * another process wants to acquire it. > + */ > + if (fd < 0 && create && GetLastError() == ERROR_ACCESS_DENIED && > + INIT_PROC_ADDR(RtlGetLastNtStatus) && RtlGetLastNtStatus() == STATUS_DELETE_PENDING) > + errno = EEXIST; > if (fd < 0 && (oflags & O_ACCMODE) != O_RDONLY && errno == EACCES) { > DWORD attrs = GetFileAttributesW(wfilename); > if (attrs != INVALID_FILE_ATTRIBUTES && (attrs & FILE_ATTRIBUTE_DIRECTORY)) > > -- > 2.49.0.472.ge94155a9ec.dirty > >