joachim.kraftmayer@xxxxxxxxx www.clyso.com Hohenzollernstr. 27, 80801 Munich Utting | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE275430677 Am Sa., 13. Sept. 2025 um 08:45 Uhr schrieb Eugen Block <eblock@xxxxxx>: > Hi, > > I can't say that we upgraded a lot of clusters from N to P, but those > that we upgraded didn't show any of these symptoms you describe. But > we always did the Filestore to Bluestore conversion before the actual > upgrade. In SUSE Enterprise Storage (which we also supported at that > time) this was pointed out as a requirement. I just checked the ceph > docs, I can't find such a statement (yet). > > > All our pools are of the following type: replicated size 3 min_size > > 1 crush_rule 0 (or 1). > > I would recommend to increase min_size to 2, otherwise you let Ceph > lose two of three PGs before pausing IO, this can make recovery > difficult. Reducing min_size to 1 should only be a temporary solution > to preserve stalling client IO during recovery. > > Regards, > Eugen > > Zitat von Olivier Delcourt <olivier.delcourt@xxxxxxxxxxxx>: > > > Hi, > > > > After reading your posts for years, I feel compelled to ask for your > > help/advice. First, I need to explain the context of our CEPH > > cluster, the problems we have encountered, and finally, my questions. > > Thanks taking the time for reading me. > > Cheers, > > Olivier > > > > > > > > *** Background *** > > > > Our CEPH cluster was created in 2015 with version 0.94.x (Hammer), > > which has been upgraded over time to version 10.2.x (Jewel), then > > 12.x (Luminous) and then 14.x (Nautilus). The MONitors and > > CEPHstores have always run on Linux Debian, with versions updated > > according to the requirements for supporting the underlying hardware > > and/or CEPH releases. > > > > In terms of hardware, we have three monitors (cephmon) and 30 > > storage servers (cephstore) spread across three datacenters. These > > servers are connected to the network via an aggregate (LACP) of two > > 10 Gbps fibre connections, through which two VLANs pass, one for the > > CEPH frontend network and one for the CEPH backend network. In doing > > so, we have always given ourselves the option of separating the > > frontend and backend into dedicated aggregates if the bandwidth > > becomes insufficient. > > > > Each of the storage servers comes with HDDs whose size varies > > depending on the server generation, as well as SSDs whose size is > > more consistent but still varies (depending on price). > > The idea has always been to add HDD and SSD storage to the CEPH > > cluster when we add storage servers to expand it or replace old > > ones. At the OSD level, the basic rule has always been followed: one > > device = one OSD with metadatas (FileStore) on dedicated partitioned > > SSDs (up to 6 for 32 OSDs) and, for the past few years, on a > > partitioned NVMe RAID1 (MD). > > > > In total, we have: > > > > 100 hdd 10.90999 TB > > 48 hdd 11.00000 TB > > 48 hdd 14.54999 TB > > 24 hdd 15.00000 TB > > 9 hdd 5.45999 TB > > 108 hdd 9.09999 TB > > > > 84 ssd 0.89400 TB > > 198 ssd 0.89424 TB > > 18 ssd 0.93599 TB > > 32 ssd 1.45999 TB > > 16 ssd 1.50000 TB > > 48 ssd 1.75000 TB > > 24 ssd 1.79999 TB > > > > --- RAW STORAGE --- > > CLASS SIZE AVAIL USED RAW USED %RAW USED > > hdd 3.6 PiB 1.7 PiB 1.9 PiB 1.9 PiB 53.45 > > ssd 480 TiB 321 TiB 158 TiB 158 TiB 33.04 > > TOTAL 4.0 PiB 2.0 PiB 2.1 PiB 2.1 PiB 51.08 > > > > Regarding the CRUSHmap, and since at the time the CEPH cluster was > > launched, classes (ssd/hdd) did not exist and we wanted to be able > > to create pools on disk storage or flash storage, we created two > > trees: > > > > ID CLASS WEIGHT TYPE NAME STATUS > > REWEIGHT PRI-AFF > > -2 3660.19751 root main_storage > > -11 1222.29883 datacenter DC1 > > -68 163.79984 host cephstore16 > > -280 109.09988 host cephstore28 > > -20 109.09988 host cephstore34 > > -289 109.09988 host cephstore31 > > -31 116.39990 host cephstore40 > > -205 116.39990 host cephstore37 > > -81 109.09988 host cephstore22 > > -71 163.79984 host cephstore19 > > -84 109.09988 host cephstore25 > > -179 116.39990 host cephstore43 > > -12 1222.29883 datacenter DC2 > > -69 163.79984 host cephstore17 > > -82 109.09988 host cephstore23 > > -295 109.09988 host cephstore32 > > -72 163.79984 host cephstore20 > > -283 109.09988 host cephstore29 > > -87 109.09988 host cephstore35 > > -85 109.09988 host cephstore26 > > -222 116.39990 host cephstore44 > > -36 116.39990 host cephstore41 > > -242 116.39990 host cephstore38 > > -25 1215.59998 datacenter DC3 > > -70 163.80000 host cephstore18 > > -74 163.80000 host cephstore21 > > -83 99.00000 host cephstore24 > > -86 110.00000 host cephstore27 > > -286 110.00000 host cephstore30 > > -298 99.00000 host cephstore33 > > -102 110.00000 host cephstore36 > > -304 120.00000 host cephstore39 > > -136 120.00000 host cephstore42 > > -307 120.00000 host cephstore45 > > > > -1 516.06305 root high-speed_storage > > -21 171.91544 datacenter xDC1 > > -62 16.84781 host xcephstore16 > > -259 14.00000 host xcephstore28 > > -3 14.00000 host xcephstore34 > > -268 14.00000 host xcephstore31 > > -310 14.30786 host xcephstore40 > > -105 14.30786 host xcephstore37 > > -46 30.68784 host xcephstore10 > > -75 11.67993 host xcephstore22 > > -61 16.09634 host xcephstore19 > > -78 11.67993 host xcephstore25 > > -322 14.30786 host xcephstore43 > > -15 171.16397 datacenter xDC2 > > -63 16.09634 host xcephstore17 > > -76 11.67993 host xcephstore23 > > -274 14.00000 host xcephstore32 > > -65 16.09634 host xcephstore20 > > -262 14.00000 host xcephstore29 > > -13 14.00000 host xcephstore35 > > -79 11.67993 host xcephstore26 > > -51 30.68784 host xcephstore11 > > -325 14.30786 host xcephstore44 > > -313 14.30786 host xcephstore41 > > -175 14.30786 host xcephstore38 > > -28 172.98364 datacenter xDC3 > > -56 30.68784 host xcephstore12 > > -64 16.09200 host xcephstore18 > > -67 16.09200 host xcephstore21 > > -77 12.00000 host xcephstore24 > > -80 12.00000 host xcephstore27 > > -265 14.39999 host xcephstore30 > > -277 14.39990 host xcephstore33 > > -17 14.39999 host xcephstore36 > > -204 14.30399 host xcephstore39 > > -319 14.30399 host xcephstore42 > > -328 14.30396 host xcephstore45 > > > > Our allocation rules are: > > > > # rules > > rule main_storage_ruleset { > > id 0 > > type replicated > > min_size 1 > > max_size 10 > > step take main_storage > > step chooseleaf firstn 0 type datacenter > > step emit > > } > > rule high-speed_storage_ruleset { > > id 1 > > type replicated > > min_size 1 > > max_size 10 > > step take high-speed_storage > > step chooseleaf firstn 0 type datacenter > > step emit > > } > > > > All our pools are of the following type: replicated size 3 min_size > > 1 crush_rule 0 (or 1). > > > > This CEPH cluster is currently only used for RBD. The volumes are > > used by our ~ 1,200 KVM VMs. > > > > > > *** Problems *** > > > > Everything was working fine until last August, when we scheduled an > > update from CEPH 14.x (Nautilus) to 16.X (Pacific) (and an update > > from Debian 10 to Debian 11, which was not a problem). > > > > * First problem: > > We were forced to switch from FileStore to BlueStore in an emergency > > and unscheduled manner because after upgrading the CEPH packages on > > the first storage server, the FileStore OSDs would no longer start. > > We did not have this problem on our small test cluster, which > > obviously did not have the â??same upgrade lifeâ?? as the production > > cluster. We therefore took the opportunity, DC by DC (since this is > > our â??failure domainâ??), not only to update CEPH but also to recreate > > the OSDs in BlueStore. > > > > * Second problem: > > Since our failure domain is a DC, we had to upgrade a DC and then > > wait for it to recover (~500 TB net). SSD storage recovery takes a > > few hours, while HDD storage recovery takes approximately three days. > > Here we see that our SSD-type OSDs fill up at a rate of ~ 2% every 3 > > hours (the phenomenon is also observed on HDD-type OSDs, but as we > > have a large capacity, it is less critical). > > Manual (re)weight changes only provided a temporary solution and, > > despite all our attempts (OSD restart, etc.), we reached the > > critical full_ratio threshold, which is 0.97 for us. > > I'll leave you to imagine the effect on the virtual machines and the > > services provided to our users. > > We also had very strong growth in the size of the MONitor databases > > (~3 GB -> 100 GB) (compaction did not really help). > > Once our VMs were shut down (crashed), the cluster completed its > > recovery (HDD-type OSDs) and, curiously, the SSD-type OSDs began to > > â??emptyâ??. > > > > The day after that, we began updating the storage servers in our > > second DC, and the phenomenon started again. We did not wait until > > we reached full_ratio to shut down our virtualisation environment > > and this time, the â??SSDâ?? OSDs began to â??emptyâ?? after the following > > commands: ceph osd unset noscrub && ceph osd unset nodeep-scrub. > > > > In fact, we used to block scrubs and deep scrubs during massive > > upgrades and recoveries to save I/O. This never caused any problems > > in FileStore. > > It should be added that since we started using the CEPH Cluster > > (2015), scrubs have only been enabled at night so as not to impact > > production I/O, via the following options: osd_recovery_delay_start > > = 5, osd_scrub_begin_hour = 19, osd_scrub_end_hour = 7, > > osd_scrub_sleep = 0.1 (the latter may be removed since classes are > > now available) . > > > > After this second total recovery of the CEPH cluster and the restart > > of the virtualisation environment, we still have the third DC (10 > > cephstore) to update from CEPH 14 to 16, and our â??SSDâ?? OSDs are > > filling up again until the automatic activation of > > scrubs/deep-scrubs at 7 p.m. > > Since then, progress has stopped, the use of the various OSDs is > > stable and more or less evenly distributed (via active upmap > > balancer). > > > > > > *** Questions / Assumptions / Opinions *** > > > > Have you ever encountered a similar phenomenon? We agree that having > > different versions of OSDs coexisting is not a good solution and is > > not desirable in the medium term, but we are dependent on recovery > > time (and, in addition, on the issue I am presenting to you here). > > > > Our current hypothesis, following the restoration of stability and > > the fact that we have never had this problem with OSDs in FileStore, > > is that there is some kind of â??housekeepingâ?? of BlueStore OSDs via > > scrubs. Does that make sense? Any clues ? ideas ? > > > > I also read on the Internet (somewhere...) that in any case, when > > the cluster is not â??healthyâ??, scrubs are suspended by default. > > Indeed, in our case: > > > > root@cephstore16:~# ceph daemon osd.11636 config show | grep > > â??osd_scrub_during_recoveryâ?? > > â??osd_scrub_during_recoveryâ??: â??falseâ??, > > > > This could explain why, during the three days of recovery, no > > cleaning is performed and if bluestore does not perform maintenance, > > it fills up? > > > > (It would be possible to temporarily change this behaviour via: ceph > > tell â??osd.*â?? injectargs --osd-scrub-during-recovery=1 (to be tested).) > > > > Do you have any suggestions for things to check? Although we have > > experience with FileStore, we have not yet had time to gain > > experience with BlueStore. > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx