[no subject]

**Date** **Thread**



  joachim.kraftmayer@xxxxxxxxx

  www.clyso.com

  Hohenzollernstr. 27, 80801 Munich

Utting | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE275430677


Am Sa., 13. Sept. 2025 um 08:45 Uhr schrieb Eugen Block <eblock@xxxxxx>:

> Hi,
>
> I can't say that we upgraded a lot of clusters from N to P, but those
> that we upgraded didn't show any of these symptoms you describe. But
> we always did the Filestore to Bluestore conversion before the actual
> upgrade. In SUSE Enterprise Storage (which we also supported at that
> time) this was pointed out as a requirement. I just checked the ceph
> docs, I can't find such a statement (yet).
>
> >  All our pools are of the following type: replicated size 3 min_size
> > 1 crush_rule 0 (or 1).
>
> I would recommend to increase min_size to 2, otherwise you let Ceph
> lose two of three PGs before pausing IO, this can make recovery
> difficult. Reducing min_size to 1 should only be a temporary solution
> to preserve stalling client IO during recovery.
>
> Regards,
> Eugen
>
> Zitat von Olivier Delcourt <olivier.delcourt@xxxxxxxxxxxx>:
>
> > Hi,
> >
> > After reading your posts for years, I feel compelled to ask for your
> > help/advice. First, I need to explain the context of our CEPH
> > cluster, the problems we have encountered, and finally, my questions.
> > Thanks taking the time for reading me.
> > Cheers,
> > Olivier
> >
> >
> >
> > *** Background ***
> >
> > Our CEPH cluster was created in 2015 with version 0.94.x (Hammer),
> > which has been upgraded over time to version 10.2.x (Jewel), then
> > 12.x (Luminous) and then 14.x (Nautilus). The MONitors and
> > CEPHstores have always run on Linux Debian, with versions updated
> > according to the requirements for supporting the underlying hardware
> > and/or CEPH releases.
> >
> > In terms of hardware, we have three monitors (cephmon) and 30
> > storage servers (cephstore) spread across three datacenters. These
> > servers are connected to the network via an aggregate (LACP) of two
> > 10 Gbps fibre connections, through which two VLANs pass, one for the
> > CEPH frontend network and one for the CEPH backend network. In doing
> > so, we have always given ourselves the option of separating the
> > frontend and backend into dedicated aggregates if the bandwidth
> > becomes insufficient.
> >
> > Each of the storage servers comes with HDDs whose size varies
> > depending on the server generation, as well as SSDs whose size is
> > more consistent but still varies (depending on price).
> > The idea has always been to add HDD and SSD storage to the CEPH
> > cluster when we add storage servers to expand it or replace old
> > ones. At the OSD level, the basic rule has always been followed: one
> > device = one OSD with metadatas (FileStore) on dedicated partitioned
> > SSDs (up to 6 for 32 OSDs) and, for the past few years, on a
> > partitioned NVMe RAID1 (MD).
> >
> > In total, we have:
> >
> >     100 hdd 10.90999 TB
> >      48 hdd 11.00000 TB
> >      48 hdd 14.54999 TB
> >      24 hdd 15.00000 TB
> >       9 hdd 5.45999 TB
> >     108 hdd 9.09999 TB
> >
> >      84 ssd 0.89400 TB
> >     198 ssd 0.89424 TB
> >      18 ssd 0.93599 TB
> >      32 ssd 1.45999 TB
> >      16 ssd 1.50000 TB
> >      48 ssd 1.75000 TB
> >      24 ssd 1.79999 TB
> >
> > --- RAW STORAGE ---
> > CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
> > hdd    3.6 PiB  1.7 PiB  1.9 PiB   1.9 PiB      53.45
> > ssd    480 TiB  321 TiB  158 TiB   158 TiB      33.04
> > TOTAL  4.0 PiB  2.0 PiB  2.1 PiB   2.1 PiB      51.08
> >
> > Regarding the CRUSHmap, and since at the time the CEPH cluster was
> > launched, classes (ssd/hdd) did not exist and we wanted to be able
> > to create pools on disk storage or flash storage, we created two
> > trees:
> >
> > ID     CLASS  WEIGHT      TYPE NAME                      STATUS
> > REWEIGHT  PRI-AFF
> >    -2         3660.19751  root main_storage
> >   -11         1222.29883      datacenter DC1
> >   -68          163.79984              host cephstore16
> >  -280          109.09988              host cephstore28
> >   -20          109.09988              host cephstore34
> >  -289          109.09988              host cephstore31
> >   -31          116.39990              host cephstore40
> >  -205          116.39990              host cephstore37
> >   -81          109.09988              host cephstore22
> >   -71          163.79984              host cephstore19
> >   -84          109.09988              host cephstore25
> >  -179          116.39990              host cephstore43
> >   -12         1222.29883      datacenter DC2
> >   -69          163.79984              host cephstore17
> >   -82          109.09988              host cephstore23
> >  -295          109.09988              host cephstore32
> >   -72          163.79984              host cephstore20
> >  -283          109.09988              host cephstore29
> >   -87          109.09988              host cephstore35
> >   -85          109.09988              host cephstore26
> >  -222          116.39990              host cephstore44
> >   -36          116.39990              host cephstore41
> >  -242          116.39990              host cephstore38
> >   -25         1215.59998      datacenter DC3
> >   -70          163.80000              host cephstore18
> >   -74          163.80000              host cephstore21
> >   -83           99.00000              host cephstore24
> >   -86          110.00000              host cephstore27
> >  -286          110.00000              host cephstore30
> >  -298           99.00000              host cephstore33
> >  -102          110.00000              host cephstore36
> >  -304          120.00000              host cephstore39
> >  -136          120.00000              host cephstore42
> >  -307          120.00000              host cephstore45
> >
> >    -1          516.06305  root high-speed_storage
> >   -21          171.91544      datacenter xDC1
> >   -62           16.84781              host xcephstore16
> >  -259           14.00000              host xcephstore28
> >    -3           14.00000              host xcephstore34
> >  -268           14.00000              host xcephstore31
> >  -310           14.30786              host xcephstore40
> >  -105           14.30786              host xcephstore37
> >   -46           30.68784              host xcephstore10
> >   -75           11.67993              host xcephstore22
> >   -61           16.09634              host xcephstore19
> >   -78           11.67993              host xcephstore25
> >  -322           14.30786              host xcephstore43
> >   -15          171.16397      datacenter xDC2
> >   -63           16.09634              host xcephstore17
> >   -76           11.67993              host xcephstore23
> >  -274           14.00000              host xcephstore32
> >   -65           16.09634              host xcephstore20
> >  -262           14.00000              host xcephstore29
> >   -13           14.00000              host xcephstore35
> >   -79           11.67993              host xcephstore26
> >   -51           30.68784              host xcephstore11
> >  -325           14.30786              host xcephstore44
> >  -313           14.30786              host xcephstore41
> >  -175           14.30786              host xcephstore38
> >   -28          172.98364      datacenter xDC3
> >   -56           30.68784              host xcephstore12
> >   -64           16.09200              host xcephstore18
> >   -67           16.09200              host xcephstore21
> >   -77           12.00000              host xcephstore24
> >   -80           12.00000              host xcephstore27
> >  -265           14.39999              host xcephstore30
> >  -277           14.39990              host xcephstore33
> >   -17           14.39999              host xcephstore36
> >  -204           14.30399              host xcephstore39
> >  -319           14.30399              host xcephstore42
> >  -328           14.30396              host xcephstore45
> >
> > Our allocation rules are:
> >
> > # rules
> > rule main_storage_ruleset {
> >         id 0
> >         type replicated
> >         min_size 1
> >         max_size 10
> >         step take main_storage
> >         step chooseleaf firstn 0 type datacenter
> >         step emit
> > }
> > rule high-speed_storage_ruleset {
> >         id 1
> >         type replicated
> >         min_size 1
> >         max_size 10
> >         step take high-speed_storage
> >         step chooseleaf firstn 0 type datacenter
> >         step emit
> > }
> >
> > All our pools are of the following type: replicated size 3 min_size
> > 1 crush_rule 0 (or 1).
> >
> > This CEPH cluster is currently only used for RBD. The volumes are
> > used by our ~ 1,200 KVM VMs.
> >
> >
> > *** Problems ***
> >
> > Everything was working fine until last August, when we scheduled an
> > update from CEPH 14.x (Nautilus) to 16.X (Pacific) (and an update
> > from Debian 10 to Debian 11, which was not a problem).
> >
> > * First problem:
> > We were forced to switch from FileStore to BlueStore in an emergency
> > and unscheduled manner because after upgrading the CEPH packages on
> > the first storage server, the FileStore OSDs would no longer start.
> > We did not have this problem on our small test cluster, which
> > obviously did not have the â??same upgrade lifeâ?? as the production
> > cluster. We therefore took the opportunity, DC by DC (since this is
> > our â??failure domainâ??), not only to update CEPH but also to recreate
> > the OSDs in BlueStore.
> >
> > * Second problem:
> > Since our failure domain is a DC, we had to upgrade a DC and then
> > wait for it to recover (~500 TB net). SSD storage recovery takes a
> > few hours, while HDD storage recovery takes approximately three days.
> > Here we see that our SSD-type OSDs fill up at a rate of ~ 2% every 3
> > hours (the phenomenon is also observed on HDD-type OSDs, but as we
> > have a large capacity, it is less critical).
> > Manual (re)weight changes only provided a temporary solution and,
> > despite all our attempts (OSD restart, etc.), we reached the
> > critical full_ratio threshold, which is 0.97 for us.
> > I'll leave you to imagine the effect on the virtual machines and the
> > services provided to our users.
> > We also had very strong growth in the size of the MONitor databases
> > (~3 GB -> 100 GB) (compaction did not really help).
> > Once our VMs were shut down (crashed), the cluster completed its
> > recovery (HDD-type OSDs) and, curiously, the SSD-type OSDs began to
> > â??emptyâ??.
> >
> > The day after that, we began updating the storage servers in our
> > second DC, and the phenomenon started again. We did not wait until
> > we reached full_ratio to shut down our virtualisation environment
> > and this time, the â??SSDâ?? OSDs began to â??emptyâ?? after the following
> > commands: ceph osd unset noscrub && ceph osd unset nodeep-scrub.
> >
> > In fact, we used to block scrubs and deep scrubs during massive
> > upgrades and recoveries to save I/O. This never caused any problems
> > in FileStore.
> > It should be added that since we started using the CEPH Cluster
> > (2015), scrubs have only been enabled at night so as not to impact
> > production I/O, via the following options: osd_recovery_delay_start
> > = 5, osd_scrub_begin_hour = 19, osd_scrub_end_hour = 7,
> > osd_scrub_sleep = 0.1 (the latter may be removed since classes are
> > now available) .
> >
> > After this second total recovery of the CEPH cluster and the restart
> > of the virtualisation environment, we still have the third DC (10
> > cephstore) to update from CEPH 14 to 16, and our â??SSDâ?? OSDs are
> > filling up again until the automatic activation of
> > scrubs/deep-scrubs at 7 p.m.
> > Since then, progress has stopped, the use of the various OSDs is
> > stable and more or less evenly distributed (via active upmap
> > balancer).
> >
> >
> > *** Questions / Assumptions / Opinions ***
> >
> > Have you ever encountered a similar phenomenon? We agree that having
> > different versions of OSDs coexisting is not a good solution and is
> > not desirable in the medium term, but we are dependent on recovery
> > time (and, in addition, on the issue I am presenting to you here).
> >
> > Our current hypothesis, following the restoration of stability and
> > the fact that we have never had this problem with OSDs in FileStore,
> > is that there is some kind of â??housekeepingâ?? of BlueStore OSDs via
> > scrubs. Does that make sense? Any clues ? ideas ?
> >
> > I also read on the Internet (somewhere...) that in any case, when
> > the cluster is not â??healthyâ??, scrubs are suspended by default.
> > Indeed, in our case:
> >
> > root@cephstore16:~# ceph daemon osd.11636 config show | grep
> > â??osd_scrub_during_recoveryâ??
> >     â??osd_scrub_during_recoveryâ??: â??falseâ??,
> >
> > This could explain why, during the three days of recovery, no
> > cleaning is performed and if bluestore does not perform maintenance,
> > it fills up?
> >
> > (It would be possible to temporarily change this behaviour via: ceph
> > tell â??osd.*â?? injectargs --osd-scrub-during-recovery=1 (to be tested).)
> >
> > Do you have any suggestions for things to check? Although we have
> > experience with FileStore, we have not yet had time to gain
> > experience with BlueStore.
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx