Re: squid 19.2.2 - added disk does not reflect in available capacity on pools

Steven Vacaroaia <stef97@xxxxxxxxx> · Mon, 1 Sep 2025 10:11:41 -0400

Hi,
Thanks Anthony - as always a very useful and comprehensive response

yes, there were only 139 OSD and , indeed, raw capacity increased
I also noticed that the "max available "  column from "cepf df" is getting
higher (177 TB) so, it seems the capacity is being added

The drives used in the SSD_class are 7TB Micron_5400

The failed daemon is the manager - will follow your advice and add one more
Blue_store warning was set to 600 - will change it to 300

Too many MDS daemons are due to the fact that I have over 150 CEPHFS clients
and thought that deploying a daemon on each host for each filesystem is
going to
provide better performance  - was I wrong ?

High number of  remapped PGs are due to improperly using
osd.all-available-devices unmaged command
so adding the new drives triggered  automatically detecting and adding them

I am not sure what I did wrong  though - see below output from "ceph orch
ls" before adding the drive
Shouldn't setting it like that prevent automatic discovery  ?

NAME                         PORTS                      RUNNING  REFRESHED
 AGE  PLACEMENT

alertmanager                 ?:9093,9094                    1/1  7m ago
5w   count:1

ceph-exporter                                               7/7  10m ago
 5w   *

crash                                                       7/7  10m ago
 5w   *

grafana                      ?:3001                         1/1  7m ago
4w   count:1

ingress.rgw.default.default  192.168.122.239:1969,8080      4/4  10m ago
 4w   ceph-host-4;ceph-host-5

loki                         ?:3100                         1/1  7m ago
4w   ceph-host-2;count:1

mds.hdd_ec_archive                                          5/5  10m ago
 5w   ceph-host-1;ceph-host-5;ceph-host-4;ceph-host-6;ceph-host-7;count:5

mds.project_ec_ssd                                          7/7  10m ago
 4w
ceph-host-1;ceph-host-2;ceph-host-3;ceph-host-4;ceph-host-5;ceph-host-6;ceph-host-7;count:7

mds.projects_rep_nvme                                       7/7  10m ago
 4w
ceph-host-1;ceph-host-2;ceph-host-3;ceph-host-4;ceph-host-5;ceph-host-6;ceph-host-7;count:7

mds.projects_rep_ssd                                        7/7  10m ago
 4w
ceph-host-1;ceph-host-2;ceph-host-3;ceph-host-4;ceph-host-5;ceph-host-6;ceph-host-7;count:7

mgr                                                         2/2  7m ago
5w   count:2

mon                                                         5/5  8m ago
5w   count:5

node-exporter                ?:9100                         7/7  10m ago
 5w   *

osd.all-available-devices                                     0  -
 4w   <unmanaged>

osd.hdd_osds                                                 72  10m ago
 5w   *

osd.nvme_osds                                                25  10m ago
 5w   *

osd.ssd_osds                                                 84  10m ago
 3w   ceph-host-1

prometheus                   ?:9095                         1/1  7m ago
5w   count:1

promtail                     ?:9080                         7/7  10m ago
 4w
ceph-host-1;ceph-host-2;ceph-host-3;ceph-host-4;ceph-host-5;ceph-host-6;ceph-host-7;count:7

rgw.default.default          ?:80                           0/2  10m ago
 4w   count-per-host:1;label:rgw

0.090000
{
    "active": true,

Resources mentioned are very useful
running upmap-remapped.py did  bring the number of PGs to be remapped close
to zero

Just wanted to clarify the steps needed  because , doing it like below,
I am , eventually, ending up again with lots of unmapped PGs and with an
"unhappy"  balancer

1. # ceph osd set norebalance
    # ceph balancer off

2. upmap-remapped.py | sh

3. change target_max_misplaced_ratio to a higher number than default 0.005
 (since we want to rebalance faster and client performance is not a huge
issue )

4. enable balancer

5.wait

Doing it like this will, eventually, increase the number misplaced PGs
until it is higher than the ratio when , I guess, the balancer stops
(  "optimize_result": "Too many objects (0.115742 > 0.090000) are
misplaced; try again later )

Should I repeat the process when the number of objects misplaced is higher
than the ratio or what is the proper way of doing it ?

Steven

On Sun, 31 Aug 2025 at 11:08, Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:

>
>
> On Aug 31, 2025, at 4:15 AM, Steven Vacaroaia <stef97@xxxxxxxxx> wrote:
>
> Hi,
>
> I have added 42 x 18TB HDD disks ( 6 on each of the 7 servers )
>
>
> The ultimate answer to Ceph, the cluster, and everything!
>
> My expectation was that the pools configured to use "hdd_class" will
> have their capacity  increased ( e.g. default.rgw.buckets.data which is
> uses an EC 4+2 pool  for data )
>
>
> First, did the raw capacity increase when you added these drives?
>
> --- RAW STORAGE ---
> CLASS          SIZE    AVAIL     USED  RAW USED  %RAW USED
> hdd_class   1.4 PiB  814 TiB  579 TiB   579 TiB      41.54
>
>
> Was the number of OSDs previously 139?
>
> It seems it is not happening ...yet ?!
> Is it because the peering is still going ?
>
>
> Ceph nomenclature can be mystifying at first.  And sometimes at thirteenth.
>
> Peering is daemons checking in with each other to ensure they’re in
> agreement.
>
> I think you mean backfill / balancing.
>
> The available space reported by “ceph df” for a *pool* is a function of:
>
> * Raw space available in the associated CRUSH rule’s device class (or if
> the rule isn’t ideal, all device classes)
> * The cluster’s three full ratios # ceph osd dump | grep ratio
> * The fullness of the single most-full OSD in the device class
>
> BTW I learned only yesterday that you can restrict `ceph osd df` by
> specifying a device class, so try running
>
> `ceph osd df hdd_class | tail -10`
>
> Notably, this will show you the min/max variance among OSDs of just that
> device class, and the standard deviation.
> When you have multiple OSD sizes, these figures are much less useful when
> calculated across the whole cluster by “ceph osd df”
>
> # ceph osd df hdd
> ...
> 318    hdd  18.53969   1.00000   19 TiB   15 TiB   15 TiB   15 KiB  67 GiB
>  3.6 TiB  80.79  1.04  127      up
> 319    hdd  18.53969   1.00000   19 TiB   15 TiB   14 TiB  936 KiB  60 GiB
>  3.7 TiB  79.87  1.03  129      up
> 320    hdd  18.53969   1.00000   19 TiB   15 TiB   14 TiB   33 KiB  72 GiB
>  3.7 TiB  79.99  1.03  129      up
>  30    hdd  18.53969   1.00000   19 TiB  3.3 TiB  2.9 TiB  129 KiB   11
> GiB   15 TiB  17.55  0.23   26      up
>                          TOTAL  5.4 PiB  4.2 PiB  4.1 PiB  186 MiB  17 TiB
>  1.2 PiB  77.81
> MIN/MAX VAR: 0.23/1.09  STDDEV: 4.39
>
> You can even run this for a specific OSD so you don’t have to get creative
> with an egrep regex or exercise your pattern-matching skills, though the
> summary values naturally aren’t useful.
>
> # ceph osd df osd.30
> ID  CLASS  WEIGHT    REWEIGHT  SIZE    RAW USE  DATA     OMAP     META
>  AVAIL   %USE   VAR   PGS  STATUS
> 30    hdd  18.53969   1.00000  19 TiB  3.3 TiB  2.9 TiB  129 KiB  12 GiB
>  15 TiB  17.55  1.00   26      up
>                         TOTAL  19 TiB  3.3 TiB  2.9 TiB  130 KiB  12 GiB
>  15 TiB  17.55
> MIN/MAX VAR: 1.00/1.00  STDDEV: 0
>
> Here there’s a wide variation among the hdd OSDs because osd.30 had been
> down for a while and was recently restarted due to a host reboot, so it’s
> slowly filling with data.
>
>
> ssd_class     6.98630
>
>
> That seems like an unusual size, what are these? Are they SAN LUNs?
>
> Below are outputs from
> ceph -s
> ceph df
> ceph osd df tree
>
>
> Thanks for providing the needful up front.
>
>  cluster:
>    id:     0cfa836d-68b5-11f0-90bf-7cc2558e5ce8
>    health: HEALTH_WARN
>            1 OSD(s) experiencing slow operations in BlueStore
>
>
> This warning state by default persists for a long time after it clears,
> I’m not sure why but I like to set this lower:
>
> # ceph config dump | grep blue
> global                                                            advanced
>  bluestore_slow_ops_warn_lifetime           300
>
>
>
>            1 failed cephadm daemon(s)
>
>            39 daemons have recently crashed
>
>
> That’s a bit worrisome, what happened?
>
> `ceph crash ls`
>
>
>            569 pgs not deep-scrubbed in time
>            2609 pgs not scrubbed in time
>
>
> Scrubs don’t happen during recovery, when complete these should catch up.
>
>  services:
>    mon: 5 daemons, quorum
> ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-7,ceph-host-6 (age 2m)
>    mgr: ceph-host-1.lqlece(active, since 18h), standbys: ceph-host-2.suiuxi
>
>
> I’m paranoid and would suggest deploying at least one more mgr.
>
>    mds: 19/19 daemons up, 7 standby
>
>
> Yikes why so many?
>
>    osd: 181 osds: 181 up (since 4d), 181 in (since 14h)
>
>
> What happened 14 hours ago?  It seems unusual for these durations to vary
> so much.
>
> 2770 remapped pgs
>
>
> That’s an indication of balancing or backfill in progress.
>
>         flags noautoscale
>
>  data:
>    volumes: 4/4 healthy
>    pools:   16 pools, 7137 pgs
>    objects: 256.82M objects, 484 TiB
>    usage:   742 TiB used, 1.5 PiB / 2.2 PiB avail
>    pgs:     575889786/1468742421 objects misplaced (39.210%)
>
>
> 39% is a lot of misplaced objects, this would be consistent with you
> having successfully added those OSDs.
> Here is where the factor of the most-full OSD comes in.
>
> Technically backfill is a subset of recovery, but in practice people
> usually think in terms:
>
> Recovery: PGs healing from OSDs having failed or been down
> Backfill: Rebalancing of data due to topology changes, including adjusted
> CRUSH rules, expansion, etc.
>
>
>             4247 active+clean
>             2763 active+remapped+backfill_wait
>             77   active+clean+scrubbing
>             43   active+clean+scrubbing+deep
>             7    active+remapped+backfilling
>
>
> Configuration options throttle how much backfill goes on in parallel to
> keep the cluster from DoSing itself.  Here I suspect that you’re running a
> recent release with the notorious mclock op scheduling shortcomings, which
> is a tangent.
>
>
> I suggest checking out these two resources re upmap-remapped.py :
>
> https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering Ceph
> Operations with Upmap.pdf
>
> https://community.ibm.com/community/user/blogs/anthony-datri/2025/07/30/gracefully-expanding-your-ibm-storage-ceph
>
>
>
> This tool, in conjunction with the balancer module, will do the backfill
> more elegantly with various benefits.
>
>
> --- RAW STORAGE ---
> CLASS          SIZE    AVAIL     USED  RAW USED  %RAW USED
> hdd_class   1.4 PiB  814 TiB  579 TiB   579 TiB      41.54
>
>
> I hope the formatting below comes through, makes it a lot easier to read a
> table.
>
>
> --- POOLS ---
> POOL                        ID   PGS   STORED  OBJECTS     USED  %USED
>  MAX AVAIL
> .mgr                         1     1  277 MiB       71  831 MiB      0 93
> TiB
> .rgw.root                    2    32  1.6 KiB        6   72 KiB      0 93
> TiB
> default.rgw.log              3    32   63 KiB      210  972 KiB      0 93
> TiB
> default.rgw.control          4    32      0 B        8      0 B      0 93
> TiB
> default.rgw.meta             5    32  1.4 KiB        8   72 KiB      0 93
> TiB
> default.rgw.buckets.data     6  2048  289 TiB  100.35M  434 TiB  68.04 136
> TiB
> default.rgw.buckets.index    7  1024  5.4 GiB      521   16 GiB      0 93
> TiB
> default.rgw.buckets.non-ec   8    32    551 B        1   13 KiB      0 93
> TiB
> metadata_fs_ssd              9   128  6.1 GiB   15.69M   18 GiB      0 93
> TiB
> ssd_ec_project              10  1024  108 TiB   44.46M  162 TiB  29.39 260
> TiB
> metadata_fs_hdd             11   128  9.9 GiB    8.38M   30 GiB   0.01 93
> TiB
> hdd_ec_archive              12  1024   90 TiB   87.94M  135 TiB  39.79 136
> TiB
> metadata_fs_nvme            13    32  260 MiB      177  780 MiB      0 93
> TiB
> metadata_fs_ssd_rep         14    32   17 MiB      103   51 MiB      0 93
> TiB
> ssd_rep_projects            15  1024    132 B        1   12 KiB      0 130
> TiB
> nvme_rep_projects           16   512  3.5 KiB       30  336 KiB      093
> TiB
>
>
> Do you have multiple EC RBD pools and/or multiple CephFSes?
>
>
> ID   CLASS       WEIGHT      REWEIGHT  SIZE     RAW USE  DATA     OMAP
> META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
> -1              2272.93311         -  2.2 PiB  742 TiB  731 TiB   63 GiB
> 3.1 TiB  1.5 PiB  32.64  1.00    -          root default
> -7               254.54175         -  255 TiB  104 TiB  102 TiB  9.6 GiB
> 455 GiB  151 TiB  40.78  1.25    -              host ceph-host-1
>
>
> ...
>
>
> 137   hdd_class    18.43300   1.00000   18 TiB   14 TiB   14 TiB    6 KiB
> 50 GiB   4.3 TiB  76.47  2.34  449      up          osd.137
> 152   hdd_class    18.19040   1.00000   18 TiB  241 GiB  239 GiB   10 KiB
> 1.8 GiB   18 TiB   1.29  0.04    7      up          osd.152
> 3.1 TiB  1.5 PiB  32.64
> MIN/MAX VAR: 0.00/2.46  STDDEV: 26.17
>
>
> There ya go.  osd.152 must be one of the new OSDs.  Note that only 7 PGs
> are currently resident and that it holds just 4% of the average amount of
> data on the entire set of OSDs.
> Run the focused `osd df` above and that number will change slightly.
>
> Here is your least full hdd_class OSD:
>
> 151   hdd_class    18.19040   1.00000   18 TiB   38 GiB   37 GiB    6 KiB
> 1.1 GiB   18 TiB   0.20  0.01    1      up          osd.151
>
> And the most full:
>
> 180   hdd_class    18.19040   1.00000   18 TiB  198 GiB  197 GiB   10 KiB
> 1.7 GiB   18 TiB   1.07  0.03    5      up          osd.180
>
>
> I suspect that the most-full is at 107% of average due to the bolus of
> backfill and/or the balancer not being active.  Using upmap-remapped as
> described above can help avoid this kind of overload.
>
> In a nutshell, the available space will gradually increase as data is
> backfilled, especially if you have the balancer enabled.
>
>
>
>
>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx