Re: Question about shard placement in erasure code pools

Eugen Block <eblock@xxxxxx> · Sat, 06 Sep 2025 08:19:10 +0000

Hi Soeren,

the EC profile only defines a couple of attributes like k and m, but  
the actual placement of the chunks is defined by the crush rule for  
this pool. So you'll have to deal with it anyway at some point. ;-)  
Especially to ensure that you only have one chunk max per host. I  
recommend to test those rules with crushtool before applying.

It's good that you plan ahead in case you'll have more datacenters  
etc., some people forget about that. Because you can't change the EC  
profile of an existing pool, you can only change the crush rule for  
that pool. So in this case, you'll always have 6 chunks to distribute  
(unless you create a new pool and move the data).

In your current setup, you'll have inactive PGs when one DC is down  
(k=4 means min_size is 5). In such cases one can reduce min_size to 4  
temporarily to continue operation. But it should be set back to 5 as  
soon as a down DC is back.
When you get more racks, you can change the rule to have one chunk per  
rack, which means you'll need at least 6 racks (I recommend to have at  
least one more failure-domain to be able to recover, otherwise the PGs  
will be degraded until the down rack is back).

Regards,
Eugen

Zitat von Soeren Malchow <soeren.malchow@xxxxxxxxxxxx>:

Dears,

maybe someon can explain/answer my question.

We are working on setting up a new ceph cluster (underneath a  
proxmox cluster), the setup is as follows

3 datacenters
3 hosts per datacenter (all in one rack)
8 physical disks per host

This is the profile configutation

ceph osd erasure-code-profile set ec_profile plugin=isa  
technique=reed_sol_van k=4 m=2 crush-root=default  
crush-failure-domain=datacenter crush-device-class=nvme  
crush-osds-per-failure-domain=2 crush-num-failure-domains=3

We have a hierarchy - created with a custom_location_hook that contains

  *
datacenter 01
     *
rack 01
        *
host 01
        *
host 02
        *
host 03
  *
datacenter 02
     *
rack 01
        *
host 01
        *
host 02
        *
host 03
  *
datacenter 03
     *
rack 01
        *
host 01
        *
host 02
        *
host 03

So this means (my understanding) that we are placing 2 shards of  
data in each datacenter.

We want to make sure that we only have max one share of data per  
host, and in future if possible also distributed across racks once  
we should ahve more per DC.

Is the bucket hierarchy automatically taken into account when  
choosing the OSD ? Or if not, where do i start, with CRUSH rules ?  
(I am hesitant to manually modify CRUSH rules.

Thanks in advance

Soeren

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx