I'm fighting with a ceph upgrade, going 18.2.7 to 19.2.3. This time again the ceph-volume activate step is taking to long, triggering failures due to systemd service timing out so the orch daemon fails (though the osd does eventually come up, the daemon is still dead, and upgrade halts). I can also reproduce the slowdown of startup with cephadm ceph-volume raw list (I don't use raw devices, but the ceph-volume activation method hardcodes checking raw first https://github.com/ceph/ceph/blob/4d5ad8c1ef04f38d14402f0d89f2df2b7d254c2c/src/ceph-volume/ceph_volume/activate/main.py#L46 ) That's takes 6s on 18.2.7, but 4m32s minutes on 19.2.3 ! I have 42 spinning drives per host (with multipath). It's spending all of it's time in the new method: self.exclude_lvm_osd_devices() and the list of items to scan, given all the duplication from multipath + and mapper names, it ends up with 308 items to scan in my setup. With good old print debugging, i found that while the threadpool speeds things up a bit, it simply takes to long to construct all those Device() objects. In fact, just creating a single Device() object, since it needs to call disk.get_devices() at least once, since this list does not include all devices, it filters out things like "/dev/mapper/mpathxx" from the list, but the code always regenerates (the same) device list if the path isn't found: if not sys_info.devices.get(self.path): sys_info.devices = disk.get_devices() will now force it to re-generate this list >400 times (initial 32 times in parallel, followed by about 400 more which will never match the device name). In the end, it's again O(n^2) computational time to list these raw devices with ceph-volume. So with 32 threads in the pool, it's also now requires running heavy load for 5 minutes before completing this trivial task every time the deamon needs to start. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx