Re: [v19.2.3] All OSDs are not created with a managed spec

Gilles Mocellin <gilles.mocellin@xxxxxxxxxxxxxx> · Tue, 19 Aug 2025 18:19:16 +0200

Le 2025-08-18 17:00, Gilles Mocellin a écrit :
Le 2025-08-18 16:21, Anthony D'Atri a écrit :

Yes I have zapped all drives before each try...

Did you subsequently check for success with `ceph device ls` and 
`lsblk`?

I've found that sometimes the orch zap doesn't succeed fully and one 
must manually stop and remove LVMs before the drive can be truly 
zapped.

I think yes, but I will retry.
Another thing I've never done before, is using encryption.
Perhaps it adds delays with my config, leading to timeouts...

Does someone know which exact ceph-volume command is launched by such a 
spec file, if I want to launch it manually ?

service_type: osd
service_id: throughput_optimized
service_name: osd.throughput_optimized
placement:
  host_pattern: '*'
spec:
  unmanaged: false
  data_devices:
    rotational: 1
  db_devices:
    rotational: 0
  encrypted: true
  filter_logic: AND
  objectstore: bluestore
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

H,

New try, from scratch every device clean (no Physical Volume).

I've found that issue and recomandations :
https://access.redhat.com/solutions/6545511
https://www.ibm.com/docs/en/storage-ceph/8.0.0?topic=80-bug-fixes

So I set a higher timeout for cephadm commands :
ceph config set global mgr/cephadm/default_cephadm_command_timeout 1800
Default was 900.
I see le orchestrator launching ceph-volume commands with a timeout of 
1795  (it was 895 before, don't know why it's 5s less than the 
config...).

But, I've done that change after creating the OSD spec, so some daemon 
have failed/timed out still :

root@fidcl-lyo1-sto-sds-lab-01:~# ceph health detail
HEALTH_WARN Failed to apply 1 service(s): osd.throughput_optimized; 7 
failed cephadm daemon(s); noout flag(s) set
[WRN] CEPHADM_APPLY_SPEC_FAIL: Failed to apply 1 service(s): 
osd.throughput_optimized
    osd.throughput_optimized: Command timed out on host cephadm deploy 
(osd daemon) (default 1800 second timeout)
[WRN] CEPHADM_FAILED_DAEMON: 7 failed cephadm daemon(s)
    daemon osd.115 on fidcl-lyo1-sto-sds-lab-01 is in unknown state
    daemon osd.24 on fidcl-lyo1-sto-sds-lab-02 is in unknown state
    daemon osd.116 on fidcl-lyo1-sto-sds-lab-03 is in unknown state
    daemon osd.118 on fidcl-lyo1-sto-sds-lab-04 is in unknown state
    daemon osd.23 on fidcl-lyo1-sto-sds-lab-05 is in unknown state
    daemon osd.14 on fidcl-lyo1-sto-sds-lab-06 is in unknown state
    daemon osd.117 on fidcl-lyo1-sto-sds-lab-07 is in unknown state

As in the RedHat issue, I've launched on every host :
systemctl daemon-reload
systemctl reset-failed

But nothing. Until the end of cephadm OSD spec creation launch.

Then, every thing is normal on ceph health, but I only have 82 OSD up 
out of 119 :

root@fidcl-lyo1-sto-sds-lab-01:~# ceph -s
  cluster:
    id:     46030d0e-7d08-11f0-a50b-246e96bd90a4
    health: HEALTH_WARN
            noout flag(s) set

  services:
    mon: 5 daemons, quorum 
fidcl-lyo1-sto-sds-lab-01,fidcl-lyo1-sto-sds-lab-02,fidcl-lyo1-sto-sds-lab-03,fidcl-lyo1-sto-sds-lab-05,fidcl-lyo1-sto-sds-lab-04 
(age 78m)
    mgr: fidcl-lyo1-sto-sds-lab-01.ymlinv(active, since 89m), standbys: 
fidcl-lyo1-sto-sds-lab-02.otnpcx, fidcl-lyo1-sto-sds-lab-03.zasagv
    osd: 119 osds: 82 up (since 38m), 119 in (since 62m)
         flags noout

  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 577 KiB
    usage:   1.8 TiB used, 90 TiB / 91 TiB avail
    pgs:     1 active+clean

The OSDs are only visible in ceph osd ls command and in the dashboard.
Daemons are not started, but PV/VG/LV are created.

I will forget dmcrypt, I think with LVs tags, it's really easier to find 
the link between an OSD and it's LVs...
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx