On Wed, May 14, 2025 at 03:04:17PM +0200, Christoph Hellwig wrote: > On Wed, May 14, 2025 at 10:00:28AM +0200, Carlos Maiolino wrote: > > I agree with you here, and we could slowly start marking those shared accesses > > as racy, but bots spitting false-positivies all the time doesn't help much, > > other than taking somebody's else time to look into the report. > > > > Taking as example one case in the previous report, where the report complained > > about concurrent bp->b_addr access during the buffer instantiation. > > I'd like to understand that one a bit more. It might be because the > validator doesn't understand a semaphore used as lock is a lock, but > I'll follow up there. Even so, it's not a race because at that point in time the object cannot be seen by anything other than the process that is initialising it. We initialise objects without holding the object locked all over the place - if this is something that the sanitiser cannot determine automatically, then we're also going to need annotations to indicate that the initialisation function(s) contain valid data races. > > > So, I think Dave has a point too. Like what happens with syzkaller > > and random people reporting random syzkaller warnings. > > > > While I appreciate the reports too, I think it would be fair for the reporters > > to spend some time to at least craft a RFC patch fixing the warning. > > Well, it was polite mails about their finding, which I find useful. > If we got a huge amount of spam that might be different. That's kinda of my point about it being the "thin edge of the wedge". Once we indicate that this is something we want reported so we can address, then the floodgates open. I'm wary of this, because at this point I suspect that there aren't a lot of people with sufficient time and knowledge to adequately address these issues. Asking the reporter to "send a patch to data_race() that instance" isn't a good solution to the problem because it doesn't address the wider issue indicated by the specific report. It just kicks the can down the road and introduces technical debt that we will eventually have to address. We should have learnt this lesson from lockdep - false positive whack-a-mole shut up individual reports but introduced technical debt that had to be addressed later because whack-a-mole didn't address the underlying issues. When the stack of cards fell, someone (i.e. me) had to work out how to do lockdep annotations properly (e.g. via the complex inode locking subclass stuff) to make the issues go away and require minimal long term maintenance. I don't want to see this pattern repeated. We need a sane policy for addressing the entire classes of issuesi underlying individual reports, not just apply a band-aid that silences the indivual report. So, let's look at what kcsan provides us with. We need functions like xfs_vn_getattr(), the allocation AG selection alogrithms and object initialisation functions to be marked as inherently racy, because they intentionally don't hold locks for any of the accesses they make. kcsan provides: /* comment on why __no_kcsan is needed */ __no_kcsan void xfs_foo( struct xfs_bar *bar) { ... the __no_kcsan attribute for function declarations to mark the entire function as inherently racy and so should be ignored. For variables like ip->i_delayed_blks, where we intentionally serialise modifications but do not serialise readers, we have: - uint64_t i_delayed_blks; /* count of delay alloc blks */ + uint64_t __data_racy i_delayed_blks; /* count of delay alloc blks */ This means all accesses to the variable are considered to be racy and kcsan will ignore it and not report it. We can do the same for lip->li_lsn and other variables. IOWs, we do not need to spew data_race() wrappers or random READ_ONCE/WRITE_ONCE macros all over the code to silence known false positives. If we mark the variables and functions we know have racy accesses, these one-line changes will avoid -all- false positives on those variables/from those functions. This, IMO, is a much better solution than "send a patch marking that access as data_race()". We fix the issue for all accesses and entire algorithms with simple changes to variable and/or function declarations. To reiterate my point: if we are goign to make XFS KCSAN friendly, we need to work out how to do it quickly and efficiently before we start changing code. Engaging in random whack-a-mole shootdown games will not lead to a viable long term solution, so let's not waste time on playing whack-a-mole. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx