Re: Network traffic with failure domain datacenter

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Thu, 8 May 2025 15:20:09 -0400

> On May 8, 2025, at 3:14 PM, Peter Linder <peter.linder@xxxxxxxxxxxxxx> wrote:
> 
> There is also the issue that if you have a 4+8 EC pool, you ideally need at least 4+8 of whatever your failure domain is, in this case DCs.

Well, in terms of a formal stretch cluster, EC isn’t actually supported today, though if those DCs are VERY close to each other with small RTT one could fake it with something like 4+8, but without the mon quorum and min_size advantages of stretch mode.

With two DCS, formal stretch mode needs R4 pools.

> This is more than most people have.
> 
> Is this k=4, m=8? What is the benefit of this compared to an ordinary replicated pool with 3 copies?
> 
> Even if you set the failure domain to, say rack, there is no guarantee that there is no PG with more than 8 parts in a single DC without some crushmap trickery.
> 
> If this is k=8, m=4, then only 4 failures can be handled and there is no way to split 12 parts so that both DCs contain 4 or less at the same time.
> 
> You really need 3 DCs and a fast, highly available network in between.

With stretch typical 2 DCs and a tiebreaker mon elsewhere, which can have higher latency and even be a cloud VM.

> 
> /Peter
> 
> 
> 
> Den 2025-05-08 kl. 17:45, skrev Anthony D'Atri:
>> To be pedantic … backfill usually means copying data in toto, so like normal write replication it necessarily has to traverse the WAN.
>> 
>> Recovery of just a lost shard/replica in theory with the LRC plugin, but as noted that doesn’t seem like a good choice.  With the default EC plugin, there *may* be some read locality preference but it’s not something I would bank on.
>> 
>> Stretch clusters are great when you need zero RPO when you really need a single cluster and can manage client endpoint use accordingly.  But with tradeoffs, in many cases two clusters with async replication can be a better solution, depends on needs and what you’re solving for.
>> 
>>> On May 7, 2025, at 5:06 AM, Janne Johansson <icepic.dz@xxxxxxxxx> wrote:
>>> 
>>> Den ons 7 maj 2025 kl 10:59 skrev Torkil Svensgaard <torkil@xxxxxxxx>:
>>>> We are looking at a cluster split between two DCs with the DCs as
>>>> failure domains.
>>>> 
>>>> Am I right in assuming that any recovery or backfill taking place should
>>>> largely happen inside each DC and not between them? Or can no such
>>>> assumptions be made?
>>>> Pools would be EC 4+8, if that matters.
>>> Unless I am mistaken, the first/primary of each PG is the one "doing"
>>> the backfills, so if the primaries are evenly distributed between the
>>> sites, the source of all backfills would be in the remote DC in 50% of
>>> the cases.
>>> I do not think the backfills are going to calculate how it can use
>>> only "local" pieces to rebuild a missing/degraded PG piece without
>>> going over the DC-DC link even if it is theoretically possible.
>>> 
>>> -- 
>>> May the most significant bit of your life be positive.
>> It’s good to be 8-bit-clean, if you aren’t , then Kermit can compensate.
>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx