Re: Pgs troubleshooting

"GLE, Vivien" <Vivien.GLE@xxxxxxxx> · Mon, 4 Aug 2025 07:39:12 +0000

Hi,

I got 3 incomplete PG that I put as mark-complete because they were empty (I think I lost data from them)

1 was recovery_unfound, I mark_unfound_lost revert this one

but I have beetwen 5-25 deep_scrubbing PGs, I believe this is not normal ? (it's been since 5 days)

Vivien

________________________________
De : Eugen Block <eblock@xxxxxx>
Envoyé : vendredi 1 août 2025 15:58:22
À : GLE, Vivien
Cc : ceph-users@xxxxxxx
Objet : Re:  Re: Pgs troubleshooting

Dont worry, I just wanted to point out that careful reading is crucial. :-)
So you got the OSDs back up, but were you also able to recover the pg?

Zitat von "GLE, Vivien" <Vivien.GLE@xxxxxxxx>:

> I lost all perspective and didn't read carefully this message..
> Sorry for that
>
>
> Thanks for your help I'm very grateful
>
>
> Vivien
>
> ________________________________
> De : Eugen Block <eblock@xxxxxx>
> Envoyé : vendredi 1 août 2025 15:27:56
> À : GLE, Vivien
> Cc : ceph-users@xxxxxxx
> Objet : Re:  Re: Pgs troubleshooting
>
> That’s why I mentioned this two days ago:
>
> cephadm shell -- ceph-objectstore-tool --op list …
>
> That’s how you can execute commands directly with cephadm shell, this
> is useful for batch operations like a for loop or similar. Of course,
> first entering the shell and then execute commands works quite as well.
>
> Zitat von "GLE, Vivien" <Vivien.GLE@xxxxxxxx>:
>
>> I was using ceph-objectstore-tool the wrong way by doing it on host
>> instead of inside container via cephadm shell --name osd.x
>>
>>
>> ________________________________
>> De : GLE, Vivien <Vivien.GLE@xxxxxxxx>
>> Envoyé : vendredi 1 août 2025 09:02:59
>> À : Eugen Block
>> Cc : ceph-users@xxxxxxx
>> Objet :  Re: Pgs troubleshooting
>>
>> Hi,
>>
>>
>> What is the good way of using objectstore tool ?
>>
>>
>> My OSD are up ! I purged ceph-* on my host following this thread :
>> https://www.reddit.com/r/ceph/comments/1me3kvd/containerized_ceph_base_os_experience/
>>
>>
>> " Make sure that the base OS does not have any ceph packages
>> installed, with Ubuntu in the past had issues with ceph-common being
>> installed on the host OS and it trying to take ownership of the
>> containerized ceph deployment. If you run into any issues check the
>> base OS for ceph-* packages and uninstall. "
>>
>>
>> I believe the only good way to use ceph commands is in cephadm
>>
>>
>> Thanks for your help !
>>
>> ________________________________
>> De : Eugen Block <eblock@xxxxxx>
>> Envoyé : jeudi 31 juillet 2025 19:42:21
>> À : GLE, Vivien
>> Cc : ceph-users@xxxxxxx
>> Objet : Re:  Re: Pgs troubleshooting
>>
>> To use the objectstore tool within the container you don’t have to
>> specify the cluster’s FSID because it’s mapped into the container. By
>> using the objectstore tool you might have changed the ownership of the
>> directory, change it back to the previous state. Other OSDs will show
>> you which uid/user and/or gid/group that is.
>>
>> Zitat von "GLE, Vivien" <Vivien.GLE@xxxxxxxx>:
>>
>>> I'm sorry for the confusion !
>>>
>>> I paste the wrong output.
>>>
>>>
>>> ceph-objectstore-tool --data-path /var/lib/ceph/Id/osd.1 --op list
>>> --pgid 11.4 --no-mon-config
>>>
>>> OSD.1 log
>>>
>>> 2025-07-31T12:06:56.273+0000 7a9c2bf47680  0 set uid:gid to 167:167
>>> (ceph:ceph)
>>> 2025-07-31T12:06:56.273+0000 7a9c2bf47680  0 ceph version 19.2.2
>>> (0eceb0defba60152a8182f7bd87d164b639885b8) squid (stable), process
>>> ceph-osd, pid 7
>>> 2025-07-31T12:06:56.273+0000 7a9c2bf47680  0 pidfile_write: ignore
>>> empty --pid-file
>>> 2025-07-31T12:06:56.274+0000 7a9c2bf47680  1 bdev(0x57bd64210e00
>>> /var/lib/ceph/osd/ceph-1/block) open path
>>> /var/lib/ceph/osd/ceph-1/block
>>> 2025-07-31T12:06:56.274+0000 7a9c2bf47680 -1 bdev(0x57bd64210e00
>>> /var/lib/ceph/osd/ceph-1/block) open open got: (13) Permission denied
>>> 2025-07-31T12:06:56.274+0000 7a9c2bf47680 -1  ** ERROR: unable to
>>> open OSD superblock on /var/lib/ceph/osd/ceph-1: (2) No such file or
>>> directory
>>>
>>> ----------------------
>>>
>>> I retried  on OSD.2 with PG 2.1 to see if I disabled instead of just
>>> stopped the OSD.2 before objectstore-tool operation will change
>>> something but same error occurred
>>>
>>>
>>>
>>> ________________________________
>>> De : Eugen Block <eblock@xxxxxx>
>>> Envoyé : jeudi 31 juillet 2025 13:27:51
>>> À : GLE, Vivien
>>> Cc : ceph-users@xxxxxxx
>>> Objet : Re:  Re: Pgs troubleshooting
>>>
>>> Why did you look at OSD.2? According to the query output you provided
>>> I would have looked at OSD.1 (acting set). And you pasted the output
>>> of PG 11.4, now you’re trying to list PG 2.1, that is quite confusing.
>>>
>>>
>>> Zitat von "GLE, Vivien" <Vivien.GLE@xxxxxxxx>:
>>>
>>>> I dont get why is he searching in this path because there is nothing
>>>> and this is the command I used to check bluestore
>>>>
>>>>
>>>> ceph-objectstore-tool --data-path /var/lib/ceph/"ID"/osd.2 --op list
>>>> --pgid 2.1 --no-mon-config
>>>>
>>>> ________________________________
>>>> De : GLE, Vivien
>>>> Envoyé : jeudi 31 juillet 2025 09:38:25
>>>> À : Eugen Block
>>>> Cc : ceph-users@xxxxxxx
>>>> Objet : RE:  Re: Pgs troubleshooting
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>> Or could reducing min_size to 1 help here (Thanks, Anthony)? I’m not
>>>>> entirely sure and am on vacation. 😅 it could be worth a try. But don’t
>>>>> forget to reset min_size back to 2 afterwards.
>>>>
>>>>
>>>> Did it but nothing really changed, how many time should I wait to
>>>> see if it does something ?
>>>>
>>>>
>>>>> No, you use the ceph-objectstore-tool to export the PG from the intact
>>>>> OSD (you need to stop it though, set noout flag), make sure you have
>>>>> enough disk space.
>>>>
>>>>
>>>> I stopped my OSD and noout to check if my PG is stored in bluestore
>>>> (he is not) but when I tried to restart my OSD, OSD superblock was
>>>> gone
>>>>
>>>>
>>>> 2025-07-31T08:33:14.696+0000 7f0c7c889680  1 bdev(0x60945520ae00
>>>> /var/lib/ceph/osd/ceph-2/block) open path
>>>> /var/lib/ceph/osd/ceph-2/block
>>>> 2025-07-31T08:33:14.697+0000 7f0c7c889680 -1 bdev(0x60945520ae00
>>>> /var/lib/ceph/osd/ceph-2/block) open open got: (13) Permission denied
>>>> 2025-07-31T08:33:14.697+0000 7f0c7c889680 -1  ** ERROR: unable to
>>>> open OSD superblock on /var/lib/ceph/osd/ceph-2: (2) No such file or
>>>> directory
>>>>
>>>> Did I miss something?
>>>>
>>>> Thanks
>>>> Vivien
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> De : Eugen Block <eblock@xxxxxx>
>>>> Envoyé : mercredi 30 juillet 2025 16:56:50
>>>> À : GLE, Vivien
>>>> Cc : ceph-users@xxxxxxx
>>>> Objet :  Re: Pgs troubleshooting
>>>>
>>>> Or could reducing min_size to 1 help here (Thanks, Anthony)? I’m not
>>>> entirely sure and am on vacation. 😅 it could be worth a try. But don’t
>>>> forget to reset min_size back to 2 afterwards.
>>>>
>>>> Zitat von "GLE, Vivien" <Vivien.GLE@xxxxxxxx>:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>> did the two replaced OSDs fail at the sime time (before they were
>>>>>> completely drained)? This would most likely mean that both those
>>>>>> failed OSDs contained the other two replicas of this PG
>>>>>
>>>>>
>>>>> Unfortunately yes
>>>>>
>>>>>
>>>>>> This would most likely mean that both those
>>>>>> failed OSDs contained the other two replicas of this PG. A pg query
>>>>>> should show which OSDs are missing.
>>>>>
>>>>>
>>>>> If I understand well I need to move my PG on the OSD 1 ?
>>>>>
>>>>>
>>>>> ceph -w
>>>>>
>>>>>
>>>>>  osd.1 [ERR] 11.4 has 2 objects unfound and apparently lost
>>>>>
>>>>>
>>>>> ceph pg query 11.4
>>>>>
>>>>>
>>>>>
>>>>>      "up": [
>>>>>                     1,
>>>>>                     4,
>>>>>                     5
>>>>>                 ],
>>>>>                 "acting": [
>>>>>                     1,
>>>>>                     4,
>>>>>                     5
>>>>>                 ],
>>>>>                 "avail_no_missing": [],
>>>>>                 "object_location_counts": [
>>>>>                     {
>>>>>                         "shards": "3,4,5",
>>>>>                         "objects": 2
>>>>>                     }
>>>>>                 ],
>>>>>                 "blocked_by": [],
>>>>>                 "up_primary": 1,
>>>>>                 "acting_primary": 1,
>>>>>                 "purged_snaps": []
>>>>>             },
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> Vivien
>>>>>
>>>>> ________________________________
>>>>> De : Eugen Block <eblock@xxxxxx>
>>>>> Envoyé : mardi 29 juillet 2025 16:48:41
>>>>> À : ceph-users@xxxxxxx
>>>>> Objet :  Re: Pgs troubleshooting
>>>>>
>>>>> Hi,
>>>>>
>>>>> did the two replaced OSDs fail at the sime time (before they were
>>>>> completely drained)? This would most likely mean that both those
>>>>> failed OSDs contained the other two replicas of this PG. A pg query
>>>>> should show which OSDs are missing.
>>>>> You could try with objectstore-tool to export the PG from the
>>>>> remaining OSD and import it on different OSDs. Or you mark the data as
>>>>> lost if you don't care about the data and want a healthy state quickly.
>>>>>
>>>>> Regards,
>>>>> Eugen
>>>>>
>>>>> Zitat von "GLE, Vivien" <Vivien.GLE@xxxxxxxx>:
>>>>>
>>>>>> Thanks for your help ! This is my new pg stat with no more peering
>>>>>> pgs (after rebooting some OSD)
>>>>>>
>>>>>> ceph pg stat ->
>>>>>>
>>>>>> 498 pgs: 1 active+recovery_unfound+degraded, 3
>>>>>> recovery_unfound+undersized+degraded+remapped+peered, 14
>>>>>> active+clean+scrubbing+deep, 480 active+clean;
>>>>>>
>>>>>> 36 GiB data, 169 GiB used, 6.2 TiB / 6.4 TiB avail; 8.8 KiB/s rd, 0
>>>>>> B/s wr, 12 op/s; 715/41838 objects degraded (1.709%); 5/13946
>>>>>> objects unfound (0.036%)
>>>>>>
>>>>>> ceph pg ls recovery_unfound -> shows that PG are replica 3, tried to
>>>>>> repair but nothing happened
>>>>>>
>>>>>>
>>>>>> ceph -w ->
>>>>>>
>>>>>> osd.1 [ERR] 11.4 has 2 objects unfound and apparently lost
>>>>>>
>>>>>>
>>>>>>
>>>>>> ________________________________
>>>>>> De : Frédéric Nass <frederic.nass@xxxxxxxxx>
>>>>>> Envoyé : mardi 29 juillet 2025 14:03:37
>>>>>> À : GLE, Vivien
>>>>>> Cc : ceph-users@xxxxxxx
>>>>>> Objet : Re:  Pgs troubleshooting
>>>>>>
>>>>>> Hi Vivien,
>>>>>>
>>>>>> Unless you ran 'ceph pg stat' command when peering was occuring, the
>>>>>> 37 peering PGs might indicate a temporary peering issue with one or
>>>>>> more OSDs. If that's the case then restarting associated OSDs could
>>>>>> help with the peering or ceph pg. You could list those PGs and
>>>>>> associated OSDs with 'ceph pg ls peering' and trigger peering by
>>>>>> either restarting one common OSD or by using 'ceph pg repeer <pg_id>'.
>>>>>>
>>>>>> Regarding the unfound object and its associated backfill_unfound PG,
>>>>>> you could identify this PG with 'ceph pg ls backfill_unfound' and
>>>>>> investigate this PG with 'ceph pg <pg_id> query'. Depending on the
>>>>>> output, you could try running a 'ceph pg repair <pg_id>'. Could you
>>>>>> confirm that this PG is not part of a size=2 pool?
>>>>>>
>>>>>> Best regards,
>>>>>> Frédéric.
>>>>>>
>>>>>> --
>>>>>> Frédéric Nass
>>>>>> Ceph Ambassador France | Senior Ceph Engineer @ CLYSO
>>>>>> Try our Ceph Analyzer -- https://analyzer.clyso.com/
>>>>>> https://clyso.com |
>>>>>> frederic.nass@xxxxxxxxx<mailto:frederic.nass@xxxxxxxxx>
>>>>>>
>>>>>>
>>>>>> Le mar. 29 juil. 2025 à 14:19, GLE, Vivien
>>>>>> <Vivien.GLE@xxxxxxxx<mailto:Vivien.GLE@xxxxxxxx>> a écrit :
>>>>>> Hi,
>>>>>>
>>>>>> After replacing 2 OSD (data corruption), this is the stats of my
>>>>>> testing ceph cluster
>>>>>>
>>>>>> ceph pg stat
>>>>>>
>>>>>> 498 pgs: 37 peering, 1 active+remapped+backfilling, 1
>>>>>> active+clean+remapped, 1 active+recovery_wait+undersized+remapped, 1
>>>>>> backfill_unfound+undersized+degraded+remapped+peered, 1
>>>>>> remapped+peering, 12 active+clean+scrubbing+deep, 1
>>>>>> active+undersized, 442 active+clean, 1
>>>>>> active+recovering+undersized+remapped
>>>>>>
>>>>>> 34 GiB data, 175 GiB used, 6.2 TiB / 6.4 TiB avail; 1.7 KiB/s rd, 1
>>>>>> op/s; 31/39768 objects degraded (0.078%); 6/39768 objects misplaced
>>>>>> (0.015%); 1/13256 objects unfound (0.008%)
>>>>>>
>>>>>> ceph osd stat
>>>>>> 7 osds: 7 up (since 20h), 7 in (since 20h); epoch: e427538; 4
>>>>>> remapped pgs
>>>>>>
>>>>>> Anyone had an idea of where to start to get a healthy cluster ?
>>>>>>
>>>>>> Thanks !
>>>>>>
>>>>>> Vivien
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
>>>>>> To unsubscribe send an email to
>>>>>> ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx