Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)

Vlastimil Babka <vbabka@xxxxxxx> · Mon, 26 May 2025 17:31:39 +0200

On 5/26/25 17:06, Jens Axboe wrote:
> On 5/26/25 7:05 AM, Jens Axboe wrote:
>> On 5/25/25 1:12 PM, Vlastimil Babka wrote:
>> 
>> Thanks for taking a look at this! I tried to reproduce this this morning
>> and failed miserably. I then injected a delay for the above case, and it
>> does indeed then trigger for me. So far, so good.
>> 
>> I agree with your analysis, we should only be doing the dropbehind for a
>> non-zero return from __folio_end_writeback(), and that includes the
>> test_and_clear to avoid dropping the drop-behind state. But we also need
>> to check/clear this state pre __folio_end_writeback(), which then puts
>> us in a spot where it needs to potentially be re-set. Which fails pretty
>> racy...
>> 
>> I'll ponder this a bit. Good thing fsx got RWF_DONTCACHE support, or I
>> suspect this would've taken a while to run into.
> 
> Took a closer look... I may be smoking something good here, but I don't
> see what the __folio_end_writeback()() return value has to do with this
> at all. Regardless of what it returns, it should've cleared
> PG_writeback, and in fact the only thing it returns is whether or not we
> had anyone waiting on it. Which should have _zero_ bearing on whether or
> not we can clear/invalidate the range.

Yeah it's very much possible that I was wrong, folio_xor_flags_has_waiters()
looked a bit impenetrable to me, and it seemed like an simple explanation to
the splats. But as you had to add delays, this indeed smells as a race.

> To me, this smells more like a race of some sort, between dirty and
> invalidation. fsx does a lot of sub-page sized operations.
> 
> I'll poke a bit more...
>