Hi, Thanks Anthony - as always a very useful and comprehensive response yes, there were only 139 OSD and , indeed, raw capacity increased I also noticed that the "max available " column from "cepf df" is getting higher (177 TB) so, it seems the capacity is being added The drives used in the SSD_class are 7TB Micron_5400 The failed daemon is the manager - will follow your advice and add one more Blue_store warning was set to 600 - will change it to 300 Too many MDS daemons are due to the fact that I have over 150 CEPHFS clients and thought that deploying a daemon on each host for each filesystem is going to provide better performance - was I wrong ? High number of remapped PGs are due to improperly using osd.all-available-devices unmaged command so adding the new drives triggered automatically detecting and adding them I am not sure what I did wrong though - see below output from "ceph orch ls" before adding the drive Shouldn't setting it like that prevent automatic discovery ? NAME PORTS RUNNING REFRESHED AGE PLACEMENT alertmanager ?:9093,9094 1/1 7m ago 5w count:1 ceph-exporter 7/7 10m ago 5w * crash 7/7 10m ago 5w * grafana ?:3001 1/1 7m ago 4w count:1 ingress.rgw.default.default 192.168.122.239:1969,8080 4/4 10m ago 4w ceph-host-4;ceph-host-5 loki ?:3100 1/1 7m ago 4w ceph-host-2;count:1 mds.hdd_ec_archive 5/5 10m ago 5w ceph-host-1;ceph-host-5;ceph-host-4;ceph-host-6;ceph-host-7;count:5 mds.project_ec_ssd 7/7 10m ago 4w ceph-host-1;ceph-host-2;ceph-host-3;ceph-host-4;ceph-host-5;ceph-host-6;ceph-host-7;count:7 mds.projects_rep_nvme 7/7 10m ago 4w ceph-host-1;ceph-host-2;ceph-host-3;ceph-host-4;ceph-host-5;ceph-host-6;ceph-host-7;count:7 mds.projects_rep_ssd 7/7 10m ago 4w ceph-host-1;ceph-host-2;ceph-host-3;ceph-host-4;ceph-host-5;ceph-host-6;ceph-host-7;count:7 mgr 2/2 7m ago 5w count:2 mon 5/5 8m ago 5w count:5 node-exporter ?:9100 7/7 10m ago 5w * osd.all-available-devices 0 - 4w <unmanaged> osd.hdd_osds 72 10m ago 5w * osd.nvme_osds 25 10m ago 5w * osd.ssd_osds 84 10m ago 3w ceph-host-1 prometheus ?:9095 1/1 7m ago 5w count:1 promtail ?:9080 7/7 10m ago 4w ceph-host-1;ceph-host-2;ceph-host-3;ceph-host-4;ceph-host-5;ceph-host-6;ceph-host-7;count:7 rgw.default.default ?:80 0/2 10m ago 4w count-per-host:1;label:rgw 0.090000 { "active": true, Resources mentioned are very useful running upmap-remapped.py did bring the number of PGs to be remapped close to zero Just wanted to clarify the steps needed because , doing it like below, I am , eventually, ending up again with lots of unmapped PGs and with an "unhappy" balancer 1. # ceph osd set norebalance # ceph balancer off 2. upmap-remapped.py | sh 3. change target_max_misplaced_ratio to a higher number than default 0.005 (since we want to rebalance faster and client performance is not a huge issue ) 4. enable balancer 5.wait Doing it like this will, eventually, increase the number misplaced PGs until it is higher than the ratio when , I guess, the balancer stops ( "optimize_result": "Too many objects (0.115742 > 0.090000) are misplaced; try again later ) Should I repeat the process when the number of objects misplaced is higher than the ratio or what is the proper way of doing it ? Steven On Sun, 31 Aug 2025 at 11:08, Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: > > > On Aug 31, 2025, at 4:15 AM, Steven Vacaroaia <stef97@xxxxxxxxx> wrote: > > Hi, > > I have added 42 x 18TB HDD disks ( 6 on each of the 7 servers ) > > > The ultimate answer to Ceph, the cluster, and everything! > > My expectation was that the pools configured to use "hdd_class" will > have their capacity increased ( e.g. default.rgw.buckets.data which is > uses an EC 4+2 pool for data ) > > > First, did the raw capacity increase when you added these drives? > > --- RAW STORAGE --- > CLASS SIZE AVAIL USED RAW USED %RAW USED > hdd_class 1.4 PiB 814 TiB 579 TiB 579 TiB 41.54 > > > Was the number of OSDs previously 139? > > It seems it is not happening ...yet ?! > Is it because the peering is still going ? > > > Ceph nomenclature can be mystifying at first. And sometimes at thirteenth. > > Peering is daemons checking in with each other to ensure they’re in > agreement. > > I think you mean backfill / balancing. > > The available space reported by “ceph df” for a *pool* is a function of: > > * Raw space available in the associated CRUSH rule’s device class (or if > the rule isn’t ideal, all device classes) > * The cluster’s three full ratios # ceph osd dump | grep ratio > * The fullness of the single most-full OSD in the device class > > BTW I learned only yesterday that you can restrict `ceph osd df` by > specifying a device class, so try running > > `ceph osd df hdd_class | tail -10` > > Notably, this will show you the min/max variance among OSDs of just that > device class, and the standard deviation. > When you have multiple OSD sizes, these figures are much less useful when > calculated across the whole cluster by “ceph osd df” > > # ceph osd df hdd > ... > 318 hdd 18.53969 1.00000 19 TiB 15 TiB 15 TiB 15 KiB 67 GiB > 3.6 TiB 80.79 1.04 127 up > 319 hdd 18.53969 1.00000 19 TiB 15 TiB 14 TiB 936 KiB 60 GiB > 3.7 TiB 79.87 1.03 129 up > 320 hdd 18.53969 1.00000 19 TiB 15 TiB 14 TiB 33 KiB 72 GiB > 3.7 TiB 79.99 1.03 129 up > 30 hdd 18.53969 1.00000 19 TiB 3.3 TiB 2.9 TiB 129 KiB 11 > GiB 15 TiB 17.55 0.23 26 up > TOTAL 5.4 PiB 4.2 PiB 4.1 PiB 186 MiB 17 TiB > 1.2 PiB 77.81 > MIN/MAX VAR: 0.23/1.09 STDDEV: 4.39 > > You can even run this for a specific OSD so you don’t have to get creative > with an egrep regex or exercise your pattern-matching skills, though the > summary values naturally aren’t useful. > > # ceph osd df osd.30 > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META > AVAIL %USE VAR PGS STATUS > 30 hdd 18.53969 1.00000 19 TiB 3.3 TiB 2.9 TiB 129 KiB 12 GiB > 15 TiB 17.55 1.00 26 up > TOTAL 19 TiB 3.3 TiB 2.9 TiB 130 KiB 12 GiB > 15 TiB 17.55 > MIN/MAX VAR: 1.00/1.00 STDDEV: 0 > > Here there’s a wide variation among the hdd OSDs because osd.30 had been > down for a while and was recently restarted due to a host reboot, so it’s > slowly filling with data. > > > ssd_class 6.98630 > > > That seems like an unusual size, what are these? Are they SAN LUNs? > > Below are outputs from > ceph -s > ceph df > ceph osd df tree > > > Thanks for providing the needful up front. > > cluster: > id: 0cfa836d-68b5-11f0-90bf-7cc2558e5ce8 > health: HEALTH_WARN > 1 OSD(s) experiencing slow operations in BlueStore > > > This warning state by default persists for a long time after it clears, > I’m not sure why but I like to set this lower: > > # ceph config dump | grep blue > global advanced > bluestore_slow_ops_warn_lifetime 300 > > > > 1 failed cephadm daemon(s) > > 39 daemons have recently crashed > > > That’s a bit worrisome, what happened? > > `ceph crash ls` > > > 569 pgs not deep-scrubbed in time > 2609 pgs not scrubbed in time > > > Scrubs don’t happen during recovery, when complete these should catch up. > > services: > mon: 5 daemons, quorum > ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-7,ceph-host-6 (age 2m) > mgr: ceph-host-1.lqlece(active, since 18h), standbys: ceph-host-2.suiuxi > > > I’m paranoid and would suggest deploying at least one more mgr. > > mds: 19/19 daemons up, 7 standby > > > Yikes why so many? > > osd: 181 osds: 181 up (since 4d), 181 in (since 14h) > > > What happened 14 hours ago? It seems unusual for these durations to vary > so much. > > 2770 remapped pgs > > > That’s an indication of balancing or backfill in progress. > > flags noautoscale > > data: > volumes: 4/4 healthy > pools: 16 pools, 7137 pgs > objects: 256.82M objects, 484 TiB > usage: 742 TiB used, 1.5 PiB / 2.2 PiB avail > pgs: 575889786/1468742421 objects misplaced (39.210%) > > > 39% is a lot of misplaced objects, this would be consistent with you > having successfully added those OSDs. > Here is where the factor of the most-full OSD comes in. > > Technically backfill is a subset of recovery, but in practice people > usually think in terms: > > Recovery: PGs healing from OSDs having failed or been down > Backfill: Rebalancing of data due to topology changes, including adjusted > CRUSH rules, expansion, etc. > > > 4247 active+clean > 2763 active+remapped+backfill_wait > 77 active+clean+scrubbing > 43 active+clean+scrubbing+deep > 7 active+remapped+backfilling > > > Configuration options throttle how much backfill goes on in parallel to > keep the cluster from DoSing itself. Here I suspect that you’re running a > recent release with the notorious mclock op scheduling shortcomings, which > is a tangent. > > > I suggest checking out these two resources re upmap-remapped.py : > > https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering Ceph > Operations with Upmap.pdf > > https://community.ibm.com/community/user/blogs/anthony-datri/2025/07/30/gracefully-expanding-your-ibm-storage-ceph > > > > This tool, in conjunction with the balancer module, will do the backfill > more elegantly with various benefits. > > > --- RAW STORAGE --- > CLASS SIZE AVAIL USED RAW USED %RAW USED > hdd_class 1.4 PiB 814 TiB 579 TiB 579 TiB 41.54 > > > I hope the formatting below comes through, makes it a lot easier to read a > table. > > > --- POOLS --- > POOL ID PGS STORED OBJECTS USED %USED > MAX AVAIL > .mgr 1 1 277 MiB 71 831 MiB 0 93 > TiB > .rgw.root 2 32 1.6 KiB 6 72 KiB 0 93 > TiB > default.rgw.log 3 32 63 KiB 210 972 KiB 0 93 > TiB > default.rgw.control 4 32 0 B 8 0 B 0 93 > TiB > default.rgw.meta 5 32 1.4 KiB 8 72 KiB 0 93 > TiB > default.rgw.buckets.data 6 2048 289 TiB 100.35M 434 TiB 68.04 136 > TiB > default.rgw.buckets.index 7 1024 5.4 GiB 521 16 GiB 0 93 > TiB > default.rgw.buckets.non-ec 8 32 551 B 1 13 KiB 0 93 > TiB > metadata_fs_ssd 9 128 6.1 GiB 15.69M 18 GiB 0 93 > TiB > ssd_ec_project 10 1024 108 TiB 44.46M 162 TiB 29.39 260 > TiB > metadata_fs_hdd 11 128 9.9 GiB 8.38M 30 GiB 0.01 93 > TiB > hdd_ec_archive 12 1024 90 TiB 87.94M 135 TiB 39.79 136 > TiB > metadata_fs_nvme 13 32 260 MiB 177 780 MiB 0 93 > TiB > metadata_fs_ssd_rep 14 32 17 MiB 103 51 MiB 0 93 > TiB > ssd_rep_projects 15 1024 132 B 1 12 KiB 0 130 > TiB > nvme_rep_projects 16 512 3.5 KiB 30 336 KiB 093 > TiB > > > Do you have multiple EC RBD pools and/or multiple CephFSes? > > > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP > META AVAIL %USE VAR PGS STATUS TYPE NAME > -1 2272.93311 - 2.2 PiB 742 TiB 731 TiB 63 GiB > 3.1 TiB 1.5 PiB 32.64 1.00 - root default > -7 254.54175 - 255 TiB 104 TiB 102 TiB 9.6 GiB > 455 GiB 151 TiB 40.78 1.25 - host ceph-host-1 > > > ... > > > 137 hdd_class 18.43300 1.00000 18 TiB 14 TiB 14 TiB 6 KiB > 50 GiB 4.3 TiB 76.47 2.34 449 up osd.137 > 152 hdd_class 18.19040 1.00000 18 TiB 241 GiB 239 GiB 10 KiB > 1.8 GiB 18 TiB 1.29 0.04 7 up osd.152 > 3.1 TiB 1.5 PiB 32.64 > MIN/MAX VAR: 0.00/2.46 STDDEV: 26.17 > > > There ya go. osd.152 must be one of the new OSDs. Note that only 7 PGs > are currently resident and that it holds just 4% of the average amount of > data on the entire set of OSDs. > Run the focused `osd df` above and that number will change slightly. > > Here is your least full hdd_class OSD: > > 151 hdd_class 18.19040 1.00000 18 TiB 38 GiB 37 GiB 6 KiB > 1.1 GiB 18 TiB 0.20 0.01 1 up osd.151 > > And the most full: > > 180 hdd_class 18.19040 1.00000 18 TiB 198 GiB 197 GiB 10 KiB > 1.7 GiB 18 TiB 1.07 0.03 5 up osd.180 > > > I suspect that the most-full is at 107% of average due to the bolus of > backfill and/or the balancer not being active. Using upmap-remapped as > described above can help avoid this kind of overload. > > In a nutshell, the available space will gradually increase as data is > backfilled, especially if you have the balancer enabled. > > > > > > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx