Re: [GIT PULL] bcachefs fixes for 6.16-rc4

"John Stoffel" <john@xxxxxxxxxxx> · Mon, 7 Jul 2025 16:03:27 -0400

>>>>> "Kent" == Kent Overstreet <kent.overstreet@xxxxxxxxx> writes:

> On Tue, Jul 01, 2025 at 10:43:11AM -0400, John Stoffel wrote:
>> >>>>> "Kent" == Kent Overstreet <kent.overstreet@xxxxxxxxx> writes:
>> 
>> I wasn't sure if I wanted to chime in here, or even if it would be
>> worth it.  But whatever.
>> 
>> > On Thu, Jun 26, 2025 at 08:21:23PM -0700, Linus Torvalds wrote:
>> >> On Thu, 26 Jun 2025 at 19:23, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
>> >> >
>> >> > per the maintainer thread discussion and precedent in xfs and btrfs
>> >> > for repair code in RCs, journal_rewind is again included
>> >> 
>> >> I have pulled this, but also as per that discussion, I think we'll be
>> >> parting ways in the 6.17 merge window.
>> >> 
>> >> You made it very clear that I can't even question any bug-fixes and I
>> >> should just pull anything and everything.
>> 
>> > Linus, I'm not trying to say you can't have any say in bcachefs. Not at
>> > all.
>> 
>> > I positively enjoy working with you - when you're not being a dick,
>> > but you can be genuinely impossible sometimes. A lot of times...
>> 
>> Kent, you can be a dick too.  Prime example, the lines above.  And
>> how you've treated me and others who gave feedback on bcachefs in the
>> past.  I'm not a programmer, I'm in IT and follow this because it's
>> interesting, and I've been doing data management all my career.  So
>> new filesystems are interesting.  

> Oh yes, I can be. I apologize if I've been a dick to you personally, I
> try to be nice to my users and build good working relationships. But
> kernel development is a high stakes, high pressure, stressful job, as I
> often remind people. I don't ever take it personally, although sometimes
> we do need to cool off before we drive each other completely mad :)

I appreciate this, but honestly I'll withhold judgement until I see
how it goes more long term.  But I'm also NOT a kernel developer, I'm
an IT professional who does storage and backups and managing data.  So
my perspective is very definitely one of your users, or users-to-be.
But I've also got a CS degree and understand programming issues and
such.  

> If there was something that was unresolved, and you'd like me to
> look at it again, I'd be more than happy to. If you want to share
> what you were hitting here, I'll tell you what I know - and if it
> was from a year or more ago it's most likely been fixed.

Nope, it was over a year ago and it's behind me.  I was trying to
build the tools on Debian distro when the bcachefs-tools were a real
pain to build.  It's better now.  

>> Slow down.  

> This is the most critical phase in the 10+ year process of shipping a
> new filesystem.

Sure, but that's not what I'm trying to say here.  The kernel has, as
you most certainly know, a standard process for quickly deploying new
versions.  Linus's entire problem is that you dropped in a big chunk
of code into the late release process.  

And none of that is critical, because if you have people running 100tb
of bcachefs right now, they certainly understand that they can lose
data at any time.  Or at least they should if they have any sort of
understanding of reliable data.  bcachefs isn't there yet.  It's
getting close, but Linux has an amazingly complicated VFS and supports
all kinds of wierd edge cases.  Which sucks from the filesystem
perspective.

But you know this.  

So when you run into a major bug in the code, or potential data loss
when -rc2 or later is coming out, just revert.  Pull that code out
because it's obviously not ready.  So you wait a few months, big deal!
IT gives you and the code time to stabilize.  

If someone is losing data and you want to give them a patch to try and
fix it, great, but they can take a patch from you directly.  And post
it to your mailing list.  Put it on a git branch somewhere.  

But revery from the main linus tree.  For now.  In two months, you'll
be back with better code.  bcachefs is still listed as experimental,
so don't feel like you have to keep pushing the absolutely latest code
into the kernel.  Just slow it down a little to make sure you push
good code.  

> We're seeing continually increasing usage (hopefully by users who are
> prepared to accept that risk, but not always!), but we're not yet ready
> for true widespread deployment.

If those users are not prepared to accept the risk of an experimental
filesystem, then screw them!  They're idiots and should be treated as
such.  

I would expect to be fired from my job if I bet my company's data on
bcachefs currently.  Sure, play around and test it if you like, but if
it breaks, you get to keep both pieces.  

Same with bleeding edge kernel developement!  I might run pretty
bleeding edge kernels at home, but only for my own data that I realize
I might lose.  But I also do backups, have the data on XFS and ext4
filesystems, which are stable, and I'm not trying to do crazy things
with it.  

Do I have some test bcachefs volumes?  Sure do.  And I treat them like
lepers, if they break, I either toss them away, or I file a report,
but I certainly don't keep ANY data on there I don't want to lose.  

I'm being blunt here.  

> Shipping a project as large and complex as a filesystem must be done
> incrementally, in stages where we're deploying to gradually increasing
> numbers of users, fixing everything they find and assessing where we're
> at before opening it up to more users.

Yes!  But that process also has to include rollbacks, which git has
made so so so easy.  Just accept that _if_ 6.x-rc[12345] is buggy,
then it needs to be rolled back and subbmitted to 6.x+1-rc1 for the
next cycle after it's been baked.

Anyone running such a bleeding edge kernel and finding problems isn't
going to care about having to hand apply patches, they're already
doing crazy things!  *grin*

> Working with users, supporting with them, checking in on how it's doing,
> and getting them the fixes for what they find is how we iterate and
> improve. The job is not done until it's working well for everyone.

Yes, I agree 100% with all this. 

> Right now, everyone is concerned because this is a hotly anticipated
> project, and everyone wants to see it done right.

So which is more important?  Ship super fast and break things?  Or be
willing to revert and ship just a bit slower?  

> And in 6.16, we had two massive pull requests (30+ patches in a
> week, twice in a row); that also generates concern when people are
> wondering "is this thing stabilizing?".

Correct!

> 6.16 was largely a case of a few particularly interesting bug
> reports generating a bunch of fixes (and relatively simple and
> localized fixes, which is what we like to see) for repair corner
> cases, the biggest culprit (again) being snapshots.

Sure, fixes are great.  But why did you have to drop them into -rc2 in
a big bundle?  Why not just roll back what you had submitted and say
"it's not baked enough, it needs to wait a release"?  

> If you look at the bug tracker, especially rate of incoming bugs and the
> severity of bug reports (and also other sources of bug reports, like
> reddit and IRC) - yes, we are stabilizing fast.

Sure, and I'm happy for this.  And so are a bunch of other people!  

> There is still a lot of work to be done, but we're on the right track.

No arguement there.

> "Slowing down" is not something you do without a concrete
> reason.

And this is where you and Linus are butting heads in my opinion.  You
want to release big patches at any time.  Linus wants to stabilize
releases and development for the entire kernel.  You're concentrating
on your small area which is vitally important to you.  But not
everyone is as invested.  Others want the latest DRM drivers.  Or the
latest i2c code, or some other subsystem which they care about.  Linus
(and the process) is about the entire kernel.  

> Right now we need to be getting those fixes out to users so
> they can keep testing and finding the next bug. When someone has
> invested time and effort learning how the system works and how to
> report bugs, we don't watn them getting frustrated and leaving - we
> want to work with them, so they can keep testing and finding new
> bugs.

So post patches on your own tree that they can use, nothing stops you! 

> The signals that would tell me it's time to slow down are:

> - Regressions getting through (quantity, severity, time spent on fixing
>   them)
> - Bugs getting through that show that show that something fundamental is
>   missing (testing, hardening), or broken in our our design.
> - Frequency of bug reports going up to where I can't keep up (it's been
>   in steady, gradual decline)

> We actually do not want this to be 100% perfect before it sees users.
> That would result in a filesystem that's brittle - a glass cannon. We
> might get it to the point where it works 99% of the time, but then when
> it breaks we'd be in a panic - and if you discover it then, when it's in
> the wild, it's too late.

> The processes for how we debug and recover from failures, in the wild,
> is a huge part (perhaps the majority) of what we're working on now. That
> stuff has to be baked into the design on a deep level, and like all
> other complex design it requires continual iteration.

> That is how we'll get the reliability and robustness we hope to achieve.