Re: Replicas

Joachim Kraftmayer <joachim.kraftmayer@xxxxxxxxx> · Sat, 16 Aug 2025 10:22:10 +0300

Hi Shawn,
setting replikation size to 2 is dangerous.
It's not a question of if you lose data, but when!
With regard to data integrity, this is an extremely bad decision.
If you want, I can find the mathematical proof for you.

Regards, Joachim

joachim.kraftmayer@xxxxxxxxx

www.clyso.com

Hohenzollernstr. 27, 80801 Munich

Utting a. A. | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE2754306

Anthony D'Atri <aad@xxxxxxxxxxxxxx> schrieb am Sa., 16. Aug. 2025, 03:20:

>
>
> > I recently set up a 5-node Proxmox/Ceph cluster with 10 OSDs each node
> for a total for 50 nodes (approximately 135TB).
> > All the nodes have 768GB RAM and a single 72-Core (144 thread) Epyx CPU.
> > I set the ceph pool size to: 3 and the minimum size: 2.
>
> > The performance seemed good and everything was happy in cephland!
> > The other day I spoke to a ceph consultant and he recommended I change
> the pool size from 3/2 to 2/1.
>
> Danger, Will Robinson!
>
> > He cited several valid points: more usable storage space, faster
> rebuilds, better performance…
> > I followed his advice and changed it.  The benchmark performance was
> about the same, but the recovery time when I took a node down was improved.
> > I really do like the idea of the extra storage space!
> > So now I am confused whether I should leave it or go back to 3 replicas.
>
> While all of the above are true, you run the risk of data loss or
> corruption.  Sometimes data is a scratchpad or easily recreated, but if
> your VMs are using RBD volumes for boot drives, that likely isn't the case.
>
> With 2/1 there are certain sequences of drive / node / network / daemon
> failures that can result in the loss of data, or not knowing which if
> either copy is actually up to date.
>
> I have seen this happen with my own eyes, after I advised $company of the
> risk and they decided it wasn't a priority.  The result was that a customer
> lost data.
>
> Say you take a node down for maintenance, and while it's down, an OSD
> drive on another node fails. Most likely it's taken writes after the first
> node was taken down, so now the only current copy of data is lost.  Off
> this mortal coil.  Pushing up the daisies.  You get the idea.  See Dan's
> presentation from Ceph Day Seattle for context.
>
>
> > Anyone have any thoughts or compelling reasons I should leave it or
> change it back?
>
> Unless you have very specific needs, change it back.  You can have a
> separate size=2 pool for non-critical data if you like, with all manner of
> warnings to users.
>
>
>
>
> >
> > Shawn
> > Shawn Heil
> > Phone: (608) 836-7041 | Direct: 6084102333
> > Email: shawn.heil@xxxxxxxxxxxxxxxx <mailto:shawn.heil@xxxxxxxxxxxxxxxx>
> | Website: www.brucecompany.com <http://www.brucecompany.com/>
> > <SocialLink_Facebook_32x32_24bd5ab0-6667-4cda-9838-e5caaf24a024.png>Find
> us on Facebook <
> https://www.facebook.com/pages/The-Bruce-Company/113279807065?ref=hl>
> >
> > <BruceCompanyHomepageLogo(2025)_429ac410-efd6-49ca-8739-0317666483cd.jpg>
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx
> >
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:
> ceph-users-leave@xxxxxxx>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx