I started now to iterate over all osds in the tree and some of the osds
are completely unresponsive:
[18:27:18] black1.place6:~# for osd in $(ceph osd tree | grep osd. | awk '{ print $4 }'); do echo $osd; ceph tell $osd injectargs '--osd-max-backfills 1'; done
osd.20
osd.56
osd.62
osd.63
^CTraceback (most recent call last):
File "/usr/bin/ceph", line 1266, in <module>
retval = main()
File "/usr/bin/ceph", line 1182, in main
prefix='get_command_descriptions')
File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1459, in json_command
inbuf, timeout, verbose)
File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1329, in send_command_retry
return send_command(*args, **kwargs)
File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1361, in send_command
cluster.osd_command, osdid, cmd, inbuf, timeout=timeout)
File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1311, in run_in_thread
t.join(timeout=timeout)
File "/usr/lib/python3.7/threading.py", line 1036, in join
self._wait_for_tstate_lock(timeout=max(timeout, 0))
File "/usr/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
osd.64
osd.65
What's the best way to figure out why osd.63 does not react to the tell
command?
Best regards,
Nico
Nico Schottelius <nico.schottelius@xxxxxxxxxxx> writes:
> Hello Stefan,
>
> Stefan Kooman <stefan@xxxxxx> writes:
>
>> Hi,
>>
>>> However as soon as we issue either of the above tell commands, it just
>>> hangs. Furthermore when ceph tell hangs, pg are also becoming stuck in
>>> "Activating" and "Peering" states.
>>>
>>> It seems to be related, as soon as we stop ceph tell (ctrl-c it), a few
>>> minutes later the pgs are peered/active.
>>>
>>> We can reproduce this problem also with very busy osds, which have been
>>> moved to another host - they also do not react to the ceph tell commands.
>>
>> Does this also happen when you issue a osd specific "tell", i.e. ceph
>> tell osd.13 injectargs '--osd-max-backfills 4'
>>
>> Does this also happen when you loop over it one by one?
>
> It does hang for some of them, but if I "ping" / select specific OSDs,
> this does not happen.
>
>>> Did anyone see this before and/or do you have a hint on how to debug
>>> ceph tell as it is not a daemon on its own?
>>
>> IIRC I have seen this, but not in combination with PGs peering /
>> activating. Has the config change become effective on alls OSDs: verify
>> with ceph daemon osd.13 config get osd_max_backfills (for all OSDs)
>
> Just checked - most OSDs did not apply the new setting, setting it
> explicitly on them works however.
>
> Best regards,
>
> Nico
--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx