Re: ceph-ansible LARGE OMAP in RGW pool

Danish Khan <danish52.jmi@xxxxxxxxx> · Fri, 4 Apr 2025 18:30:58 +0530

Hi Frédéric,

Thank you for checking in.

There was some OSDs replacement activity on the same ceph cluster, Hence I
didn't try this yet.

I will try these steps next week. But I was able to overwrite the
curson.png file in master zone and then initiated the sync.

Previously 4 shards were in recovering but after overwriting the file and
reinitializing the sync only 3 shards are in recovering state.

I will update you about this next week once our activity is completed.

Regards,
Danish

On Thu, Apr 3, 2025 at 1:47 PM Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
wrote:

> Hi Danish,
>
> I was wondering if you've sorted it out? Let us know.
>
> Regards,
> Frédéric.
>
> ----- Le 26 Mar 25, à 12:44, Frédéric Nass frederic.nass@xxxxxxxxxxxxxxxx
> a écrit :
>
> > Hi Danish,
> >
> > The "unable to find head object data pool for..." could be an incorrect
> warning
> > since it pops out for 'most of the objects'. [1]
> >
> > Regarding the only object named 'cursor.png' that fails to sync, one
> thing you
> > could try (since you can't delete it with an s3 client) is to rewrite it
> with
> > an s3 client (copy) and then retry the delete.
> > If it fails with an s3 client, you could try with 'radosgw-admin object
> put'
> > and/or 'radosgw-admin object rm'.
> >
> > It that still fails, then here's what you can do to (at least deal with
> the
> > bucket synchronization issue) to remove this object from the index:
> >
> > 1/ Set bucket_name, index_pool_name, and bucket_id (jq command is
> required)
> >
> > $ bucket_name="bucket-test"
> > $ index_pool_name=".rgw.sme.index"
> > $ bucket_id=$(radosgw-admin bucket stats --bucket=${bucket_name} | jq -r
> .id)
> >
> > 2/ Retrieve all index shards along with their omap keys
> >
> > $ mkdir "$bucket_id"
> > $ for i in $(rados -p $index_pool_name ls | grep "$bucket_id"); do echo
> $i ;
> > rados -p $index_pool_name listomapkeys $i > "${bucket_id}/${i}" ; done
> >
> > 3/ Identify in which shard the 'cursor.png' object listed (be sure to
> identify
> > the right object. You may have several WP using the same image...)
> >
> > $ grep 'cursor.png' ${bucket_id}/.dir.${bucket_id}* | sed -e
> > "s/^${bucket_id}\///g" > remove_from_index.txt
> >
> > 4/ Make sure the content of remove_from_index.txt file only has one line
> > corresponding to the object you want to remove from the index :
> >
> > $ cat remove_from_index.txt
> >
> .dir.0f448533-3c6c-4cb8-bde9-c9763ac17751.738183.1.6:48/wp-content/plugins/plugins/yellow-pencil-visual-theme-customizer/images/cursor.png
> >
> > 5/ Remove the object from the index shard
> >
> > while IFS=':' read -r object key ; do echo "Removing Key ${key}" ; rados
> -p
> > ${index_pool_name} rmomapkey "${object}" "${key}" ; done <
> > remove_from_index.txt
> >
> > Restart both RGWs and check the sync state again.
> >
> > Next, you might want to check for inconsistencies between the index and
> the
> > actual data. You could use the rgw-orphan-list script for this [2]. And
> of
> > course, upgrade your cluster.
> >
> > Hope this helps,
> >
> > Regards,
> > Frédéric.
> >
> > [1] [ https://bugzilla.redhat.com/show_bug.cgi?id=2126787 |
> > https://bugzilla.redhat.com/show_bug.cgi?id=2126787 ]
> > [2] [
> >
> https://www.ibm.com/docs/en/storage-ceph/8.0?topic=management-finding-orphan-leaky-objects
> > |
> >
> https://www.ibm.com/docs/en/storage-ceph/8.0?topic=management-finding-orphan-leaky-objects
> > ]
> >
> > ----- Le 26 Mar 25, à 6:12, Danish Khan <danish52.jmi@xxxxxxxxx> a
> écrit :
> >
> >
> >
> > Dear Frédéric,
> >
> > Unfortunately, I am still using Octopus version and these commands are
> showing
> > unrecognized.
> >
> > Versioning is also not enabled on the bucket.
> > I tried running :
> > radosgw-admin bucket check --bucket=<bucket> --fix
> >
> > which run for few minutes giving lot of output, which contained below
> lines for
> > most of the objects:
> > WARNING: unable to find head object data pool for
> >
> "<bucket>:wp-content/uploads/sites/74/2025/03/mutation-house-no-629.pdf",
> not
> > updating version pool/epoch
> >
> > Is this issue fixable in octopus or should I plan to upgrade ceph
> cluster till
> > Quincy version?
> >
> > Regards,
> > Danish
> >
> >
> > On Wed, Mar 26, 2025 at 2:41 AM Frédéric Nass < [
> > mailto:frederic.nass@xxxxxxxxxxxxxxxx | frederic.nass@xxxxxxxxxxxxxxxx
> ] >
> > wrote:
> >
> > BQ_BEGIN
> >
> > Hi Danish,
> >
> > Can you specify the version of Ceph used and whether versioning is
> enabled on
> > this bucket?
> >
> > There are 2 ways to clean up orphan entries in a bucket index that I'm
> aware of
> > :
> >
> > - One (the preferable way) is to rely on radosgw-admin command to check
> and
> > hopefully fix the issue, cleaning up the index from orphan entries or
> even
> > rebuilding the index entirely if necessary.
> >
> > There's been new radosgw-admin commands coded recently [1] to cleanup
> leftover
> > OLH index entries and unlinked instance objects within versioned buckets.
> >
> > If this bucket is versioned, I would advise you try and run the new
> check / fix
> > commands mentioned in this [2] release note :
> > radosgw-admin bucket check unlinked [--fix]
> >
> > radosgw-admin bucket check olh [--fix]
> > - Another one (as a second chance) is to act at the rados layer,
> identifying in
> > which shard the orphan index entry is listed (listomapkeys) and remove
> it from
> > the specified shard (rmomapkey). I could elaborate on that later if
> needed.
> >
> > Regards,
> > Frédéric.
> >
> > [1] [ https://tracker.ceph.com/issues/62075 |
> > https://tracker.ceph.com/issues/62075 ]
> > [2] [ https://ceph.io/en/news/blog/2023/v18-2-1-reef-released/ |
> > https://ceph.io/en/news/blog/2023/v18-2-1-reef-released/ ]
> >
> >
> >
> >
> > De : Danish Khan < [ mailto:danish52.jmi@xxxxxxxxx |
> danish52.jmi@xxxxxxxxx ] >
> > Envoyé : mardi 25 mars 2025 17:16
> > À : Frédéric Nass
> > Cc: ceph-users
> > Objet : Re:  ceph-ansible LARGE OMAP in RGW pool
> >
> > Hi Frédéric,
> >
> > Thank you for replying.
> >
> > I followed the steps mentioned in [
> https://tracker.ceph.com/issues/62845 |
> > https://tracker.ceph.com/issues/62845 ] and was able to trim all the
> errors.
> >
> > Everything seemed to be working fine until the same error appeared again.
> >
> > I am still assuming the main culprit of this issue is one missing object
> and all
> > the errors are showing this object only.
> >
> > I am able to list this object using s3cmd tool but I am unable to
> perform any
> > action on this object, I am even unable to delete it, overwrite it or
> get it.
> >
> > I tried stopping all RGWs one by one and even tried after stopping all
> the RGWS
> > but recovery is still not getting completed.
> >
> > And the LARGE OMAP is now only increasing.
> >
> > Is there a way I can delete it from index or from ceph end directly from
> pool so
> > that it don't try to recover it?
> >
> > Regards,
> > Danish
> >
> >
> >
> > On Tue, Mar 25, 2025 at 11:29 AM Frédéric Nass < [
> > mailto:frederic.nass@xxxxxxxxxxxxxxxx | frederic.nass@xxxxxxxxxxxxxxxx
> ] >
> > wrote:
> >
> > BQ_BEGIN
> >
> > Hi Danish,
> >
> > While reviewing the backports for upcoming v18.2.5, I came across this
> [1].
> > Could be your issue.
> >
> > Can you try the suggested workaround (--marker=9) and report back?
> >
> > Regards,
> > Frédéric.
> >
> > [1] [ https://tracker.ceph.com/issues/62845 |
> > https://tracker.ceph.com/issues/62845 ]
> >
> >
> > De : Danish Khan < [ mailto:danish52.jmi@xxxxxxxxx |
> danish52.jmi@xxxxxxxxx ] >
> > Envoyé : vendredi 14 mars 2025 23:11
> > À : Frédéric Nass
> > Cc: ceph-users
> > Objet : Re:  ceph-ansible LARGE OMAP in RGW pool
> >
> > Dear Frédéric,
> >
> > 1/ Identify the shards with the most sync errors log entries:
> >
> > I have identified the shard which is causing the issue is shard 31, but
> almost
> > all the error shows only one object of a bucket. And the object exists
> in the
> > master zone. but I'm not sure why the replication site is unable to sync
> it.
> >
> > 2/ For each shard, list every sync error log entry along with their ids:
> >
> > radosgw-admin sync error list --shard-id=X
> >
> > The output of this command shows same shard and same objects mostly
> (shard 31
> > and object
> > /plugins/plugins/yellow-pencil-visual-theme-customizer/images/cursor.png)
> >
> > 3/ Remove them **except the last one** with:
> >
> > radosgw-admin sync error trim --shard-id=X
> --marker=1_1682101321.201434_8669.1
> > Trimming did remove a few entries from the error log. But still there
> are many
> > error logs for the same object which I am unable to trim.
> >
> > Now the trim command is executing successfully but not doing anything.
> >
> > I am still getting error about the object which is not syncing in
> radosgw log:
> >
> > 2025-03-15T03:05:48.060+0530 7fee2affd700 0
> >
> RGW-SYNC:data:sync:shard[80]:entry[mbackup:70134e66-872072ee2d32.2205852207.1:48]:bucket_sync_sources[target=:[]):source_bucket=:[]):source_zone=872072ee2d32]:bucket[mbackup:70134e66-872072ee2d32.2205852207.1:48<-mod-backup:70134e66-872072ee2d32.2205852207.1:48]:full_sync[mod-backup:70134e66-872072ee2d32.2205852207.1:48]:entry[wp-content/plugins/plugins/yellow-pencil-visual-theme-customizer/images/cursor.png]:
> > ERROR: failed to sync object:
> >
> mbackup:70134e66-872072ee2d32.2205852207.1:48/wp-content/plugins/plugins/yellow-pencil-visual-theme-customizer/images/cursor.png
> >
> > I am getting this error from appox two months, And if I remember
> correctly, we
> > are getting LARGE OMAP warning from then only.
> >
> > I will try to delete this object from the Master zone on Monday and will
> see if
> > this fixes the issue.
> >
> > Do you have any other suggestions on this, which I should consider?
> >
> > Regards,
> > Danish
> >
> >
> >
> >
> >
> >
> > On Thu, Mar 13, 2025 at 6:07 PM Frédéric Nass < [
> > mailto:frederic.nass@xxxxxxxxxxxxxxxx | frederic.nass@xxxxxxxxxxxxxxxx
> ] >
> > wrote:
> >
> > BQ_BEGIN
> > Hi Danish,
> >
> > Can you access this KB article [1]? A free developer account should
> allow you
> > to.
> >
> > It pretty much describes what you're facing and suggests to trim the
> sync error
> > log of recovering shards. Actually, every log entry **except the last
> one**.
> >
> > 1/ Identify the shards with the most sync errors log entries:
> >
> > radosgw-admin sync error list --max-entries=1000000 | grep shard_id |
> sort -n |
> > uniq -c | sort -h
> >
> > 2/ For each shard, list every sync error log entry along with their ids:
> >
> > radosgw-admin sync error list --shard-id=X
> >
> > 3/ Remove them **except the last one** with:
> >
> > radosgw-admin sync error trim --shard-id=X
> --marker=1_1682101321.201434_8669.1
> >
> > the --marker above being the log entry id.
> >
> > Are the replication threads running on the same RGWs that S3 clients are
> using?
> >
> > If so, using dedicated RGWs for the sync job might help you avoid
> non-recovering
> > shards in the future, as described in Matthew's post [2]
> >
> > Regards,
> > Frédéric.
> >
> > [1] [ https://access.redhat.com/solutions/7023912 |
> > https://access.redhat.com/solutions/7023912 ]
> > [2] [ https://www.spinics.net/lists/ceph-users/msg83988.html |
> > https://www.spinics.net/lists/ceph-users/msg83988.html ]
> >
> > ----- Le 12 Mar 25, à 11:15, Danish Khan [ mailto:danish52.jmi@xxxxxxxxx
> |
> > danish52.jmi@xxxxxxxxx ] a écrit :
> >
> >> Dear All,
> >>
> >> My ceph cluster is giving Large OMAP warning from approx 2-3 Months. I
> >> tried a few things like :
> >> *Deep scrub of PGs*
> >> *Compact OSDs*
> >> *Trim log*
> >> But these didn't work out.
> >>
> >> I guess the main issue is that 4 shards in replication site are always
> >> recovering from 2-3 months.
> >>
> >> Any suggestions are highly appreciated.
> >>
> >> Sync status:
> >> root@drhost1:~# radosgw-admin sync status
> >> realm e259e0a92 (object-storage)
> >> zonegroup 7a8606d2 (staas)
> >> zone c8022ad1 (repstaas)
> >> metadata sync syncing
> >> full sync: 0/64 shards
> >> incremental sync: 64/64 shards
> >> metadata is caught up with master
> >> data sync source: 2072ee2d32 (masterstaas)
> >> syncing
> >> full sync: 0/128 shards
> >> incremental sync: 128/128 shards
> >> data is behind on 3 shards
> >> behind shards: [7,90,100]
> >> oldest incremental change not applied:
> >> 2025-03-12T13:14:10.268469+0530 [7]
> >> 4 shards are recovering
> >> recovering shards: [31,41,55,80]
> >>
> >>
> >> Master site:
> >> 1. *root@master1:~# for obj in $(rados ls -p masterstaas.rgw.log); do
> echo
> >> "$(rados listomapkeys -p masterstaas.rgw.log $obj | wc -l) $obj";done |
> >> sort -nr | head -10*
> >> 1225387 data_log.91
> >> 1225065 data_log.86
> >> 1224662 data_log.87
> >> 1224448 data_log.92
> >> 1224018 data_log.89
> >> 1222156 data_log.93
> >> 1201489 data_log.83
> >> 1174125 data_log.90
> >> 363498 data_log.84
> >> 258709 data_log.6
> >>
> >>
> >> 2. *root@master1:~# for obj in data_log.91 data_log.86 data_log.87
> >> data_log.92 data_log.89 data_log.93 data_log.83 data_log.90; do rados
> stat
> >> -p masterstaas.rgw.log $obj; done*
> >> masterstaas.rgw.log/data_log.91 mtime 2025-02-24T15:09:25.000000+0530,
> size
> >> 0
> >> masterstaas.rgw.log/data_log.86 mtime 2025-02-24T15:01:25.000000+0530,
> size
> >> 0
> >> masterstaas.rgw.log/data_log.87 mtime 2025-02-24T15:02:25.000000+0530,
> size
> >> 0
> >> masterstaas.rgw.log/data_log.92 mtime 2025-02-24T15:11:01.000000+0530,
> size
> >> 0
> >> masterstaas.rgw.log/data_log.89 mtime 2025-02-24T14:54:55.000000+0530,
> size
> >> 0
> >> masterstaas.rgw.log/data_log.93 mtime 2025-02-24T14:53:25.000000+0530,
> size
> >> 0
> >> masterstaas.rgw.log/data_log.83 mtime 2025-02-24T14:16:21.000000+0530,
> size
> >> 0
> >> masterstaas.rgw.log/data_log.90 mtime 2025-02-24T15:05:25.000000+0530,
> size
> >> 0
> >>
> >> *3. ceph cluster log :*
> >> 2025-02-22T04:18:27.324886+0530 osd.173 (osd.173) 19 : cluster [WRN]
> Large
> >> omap object found. Object: 124:b2ddf551:::data_log.93:head PG:
> 124.8aafbb4d
> >> (124.d) Key count: 1218170 Size (bytes): 297085860
> >> 2025-02-22T04:18:28.735886+0530 osd.65 (osd.65) 308 : cluster [WRN]
> Large
> >> omap object found. Object: 124:f2081d70:::data_log.92:head PG:
> 124.eb8104f
> >> (124.f) Key count: 1220420 Size (bytes): 295240028
> >> 2025-02-22T04:18:30.668884+0530 mon.master1 (mon.0) 7974038 : cluster
> [WRN]
> >> Health check update: 3 large omap objects (LARGE_OMAP_OBJECTS)
> >> 2025-02-22T04:18:31.127585+0530 osd.18 (osd.18) 224 : cluster [WRN]
> Large
> >> omap object found. Object: 124:d1061236:::data_log.86:head PG:
> 124.6c48608b
> >> (124.b) Key count: 1221047 Size (bytes): 295398274
> >> 2025-02-22T04:18:33.189561+0530 osd.37 (osd.37) 32665 : cluster [WRN]
> Large
> >> omap object found. Object: 124:9a2e04b7:::data_log.87:head PG:
> 124.ed207459
> >> (124.19) Key count: 1220584 Size (bytes): 295290366
> >> 2025-02-22T04:18:35.007117+0530 osd.77 (osd.77) 135 : cluster [WRN]
> Large
> >> omap object found. Object: 124:6b9e929a:::data_log.89:head PG:
> 124.594979d6
> >> (124.16) Key count: 1219913 Size (bytes): 295127816
> >> 2025-02-22T04:18:36.189141+0530 mon.master1 (mon.0) 7974039 : cluster
> [WRN]
> >> Health check update: 5 large omap objects (LARGE_OMAP_OBJECTS)
> >> 2025-02-22T04:18:36.340247+0530 osd.112 (osd.112) 259 : cluster [WRN]
> Large
> >> omap object found. Object: 124:0958bece:::data_log.83:head PG:
> 124.737d1a90
> >> (124.10) Key count: 1200406 Size (bytes): 290280292
> >> 2025-02-22T04:18:38.523766+0530 osd.73 (osd.73) 1064 : cluster [WRN]
> Large
> >> omap object found. Object: 124:fddd971f:::data_log.91:head PG:
> 124.f8e9bbbf
> >> (124.3f) Key count: 1221183 Size (bytes): 295425320
> >> 2025-02-22T04:18:42.619926+0530 osd.92 (osd.92) 285 : cluster [WRN]
> Large
> >> omap object found. Object: 124:7dc404fa:::data_log.90:head PG:
> 124.5f2023be
> >> (124.3e) Key count: 1169895 Size (bytes): 283025576
> >> 2025-02-22T04:18:44.242655+0530 mon.master1 (mon.0) 7974043 : cluster
> [WRN]
> >> Health check update: 8 large omap objects (LARGE_OMAP_OBJECTS)
> >>
> >> Replica site:
> >> 1. *for obj in $(rados ls -p repstaas.rgw.log); do echo "$(rados
> >> listomapkeys -p repstaas.rgw.log $obj | wc -l) $obj";done | sort -nr |
> head
> >> -10*
> >>
> >> 432850 data_log.91
> >> 432384 data_log.87
> >> 432323 data_log.93
> >> 431783 data_log.86
> >> 431510 data_log.92
> >> 427959 data_log.89
> >> 414522 data_log.90
> >> 407571 data_log.83
> >> 151015 data_log.84
> >> 109790 data_log.4
> >>
> >>
> >> 2. *ceph cluster log:*
> >> grep -ir "Large omap object found" /var/log/ceph/
> >> /var/log/ceph/ceph-mon.drhost1.log:2025-03-12T14:49:59.997+0530
> >> 7fc4ad544700 0 log_channel(cluster) log [WRN] : Search the cluster log
> >> for 'Large omap object found' for more details.
> >> /var/log/ceph/ceph.log:2025-03-12T14:49:02.078108+0530 osd.10 (osd.10)
> 21 :
> >> cluster [WRN] Large omap object found. Object:
> >> 6:b2ddf551:::data_log.93:head PG: 6.8aafbb4d (6.d) Key count: 432323
> Size
> >> (bytes): 105505884
> >> /var/log/ceph/ceph.log:2025-03-12T14:49:02.389288+0530 osd.48 (osd.48)
> 37 :
> >> cluster [WRN] Large omap object found. Object:
> >> 6:d1061236:::data_log.86:head PG: 6.6c48608b (6.b) Key count: 431782
> Size
> >> (bytes): 104564674
> >> /var/log/ceph/ceph.log:2025-03-12T14:49:07.166954+0530 osd.24 (osd.24)
> 24 :
> >> cluster [WRN] Large omap object found. Object:
> >> 6:0958bece:::data_log.83:head PG: 6.737d1a90 (6.10) Key count: 407571
> Size
> >> (bytes): 98635522
> >> /var/log/ceph/ceph.log:2025-03-12T14:49:09.100110+0530 osd.63 (osd.63)
> 5 :
> >> cluster [WRN] Large omap object found. Object:
> >> 6:9a2e04b7:::data_log.87:head PG: 6.ed207459 (6.19) Key count: 432384
> Size
> >> (bytes): 104712350
> >> /var/log/ceph/ceph.log:2025-03-12T14:49:08.703760+0530 osd.59 (osd.59)
> 11 :
> >> cluster [WRN] Large omap object found. Object:
> >> 6:6b9e929a:::data_log.89:head PG: 6.594979d6 (6.16) Key count: 427959
> Size
> >> (bytes): 103773777
> >> /var/log/ceph/ceph.log:2025-03-12T14:49:11.126132+0530 osd.40 (osd.40)
> 24 :
> >> cluster [WRN] Large omap object found. Object:
> >> 6:f2081d70:::data_log.92:head PG: 6.eb8104f (6.f) Key count: 431508 Size
> >> (bytes): 104520406
> >> /var/log/ceph/ceph.log:2025-03-12T14:49:13.799473+0530 osd.43 (osd.43)
> 61 :
> >> cluster [WRN] Large omap object found. Object:
> >> 6:fddd971f:::data_log.91:head PG: 6.f8e9bbbf (6.1f) Key count: 432850
> Size
> >> (bytes): 104418869
> >> /var/log/ceph/ceph.log:2025-03-12T14:49:14.398480+0530 osd.3 (osd.3) 55
> :
> >> cluster [WRN] Large omap object found. Object:
> >> 6:7dc404fa:::data_log.90:head PG: 6.5f2023be (6.1e) Key count: 414521
> Size
> >> (bytes): 100396561
> >> /var/log/ceph/ceph.log:2025-03-12T14:50:00.000484+0530 mon.drhost1
> (mon.0)
> >> 207423 : cluster [WRN] Search the cluster log for 'Large omap object
> >> found' for more details.
> >>
> >> Regards,
> >> Danish
> >> _______________________________________________
> >> ceph-users mailing list -- [ mailto:ceph-users@xxxxxxx |
> ceph-users@xxxxxxx ]
> >> To unsubscribe send an email to [ mailto:ceph-users-leave@xxxxxxx |
> >> ceph-users-leave@xxxxxxx ]
> >
> >
> >
> >
> > BQ_END
> >
> >
> > De : Danish Khan < [ mailto:danish52.jmi@xxxxxxxxx |
> danish52.jmi@xxxxxxxxx ] >
> > Envoyé : mardi 25 mars 2025 17:16
> > À : Frédéric Nass
> > Cc: ceph-users
> > Objet : Re:  ceph-ansible LARGE OMAP in RGW pool
> >
> >
> > Hi Frédéric,
> >
> > Thank you for replying.
> >
> > I followed the steps mentioned in [
> https://tracker.ceph.com/issues/62845 |
> > https://tracker.ceph.com/issues/62845 ] and was able to trim all the
> errors.
> >
> > Everything seemed to be working fine until the same error appeared again.
> >
> > I am still assuming the main culprit of this issue is one missing object
> and all
> > the errors are showing this object only.
> >
> > I am able to list this object using s3cmd tool but I am unable to
> perform any
> > action on this object, I am even unable to delete it, overwrite it or
> get it.
> >
> > I tried stopping all RGWs one by one and even tried after stopping all
> the RGWS
> > but recovery is still not getting completed.
> >
> > And the LARGE OMAP is now only increasing.
> >
> > Is there a way I can delete it from index or from ceph end directly from
> pool so
> > that it don't try to recover it?
> >
> > Regards,
> > Danish
> >
> >
> >
> > On Tue, Mar 25, 2025 at 11:29 AM Frédéric Nass < [
> > mailto:frederic.nass@xxxxxxxxxxxxxxxx | frederic.nass@xxxxxxxxxxxxxxxx
> ] >
> > wrote:
> >
> > BQ_BEGIN
> >
> > Hi Danish,
> >
> > While reviewing the backports for upcoming v18.2.5, I came across this
> [1].
> > Could be your issue.
> >
> > Can you try the suggested workaround (--marker=9) and report back?
> >
> > Regards,
> > Frédéric.
> >
> > [1] [ https://tracker.ceph.com/issues/62845 |
> > https://tracker.ceph.com/issues/62845 ]
> >
> >
> > De : Danish Khan < [ mailto:danish52.jmi@xxxxxxxxx |
> danish52.jmi@xxxxxxxxx ] >
> > Envoyé : vendredi 14 mars 2025 23:11
> > À : Frédéric Nass
> > Cc: ceph-users
> > Objet : Re:  ceph-ansible LARGE OMAP in RGW pool
> >
> >
> > Dear Frédéric,
> >
> > 1/ Identify the shards with the most sync errors log entries:
> >
> > I have identified the shard which is causing the issue is shard 31, but
> almost
> > all the error shows only one object of a bucket. And the object exists
> in the
> > master zone. but I'm not sure why the replication site is unable to sync
> it.
> >
> > 2/ For each shard, list every sync error log entry along with their ids:
> >
> > radosgw-admin sync error list --shard-id=X
> >
> > The output of this command shows same shard and same objects mostly
> (shard 31
> > and object
> > /plugins/plugins/yellow-pencil-visual-theme-customizer/images/cursor.png)
> >
> > 3/ Remove them **except the last one** with:
> >
> > radosgw-admin sync error trim --shard-id=X
> --marker=1_1682101321.201434_8669.1
> > Trimming did remove a few entries from the error log. But still there
> are many
> > error logs for the same object which I am unable to trim.
> >
> > Now the trim command is executing successfully but not doing anything.
> >
> > I am still getting error about the object which is not syncing in
> radosgw log:
> >
> > 2025-03-15T03:05:48.060+0530 7fee2affd700 0
> >
> RGW-SYNC:data:sync:shard[80]:entry[mbackup:70134e66-872072ee2d32.2205852207.1:48]:bucket_sync_sources[target=:[]):source_bucket=:[]):source_zone=872072ee2d32]:bucket[mbackup:70134e66-872072ee2d32.2205852207.1:48<-mod-backup:70134e66-872072ee2d32.2205852207.1:48]:full_sync[mod-backup:70134e66-872072ee2d32.2205852207.1:48]:entry[wp-content/plugins/plugins/yellow-pencil-visual-theme-customizer/images/cursor.png]:
> > ERROR: failed to sync object:
> >
> mbackup:70134e66-872072ee2d32.2205852207.1:48/wp-content/plugins/plugins/yellow-pencil-visual-theme-customizer/images/cursor.png
> >
> > I am getting this error from appox two months, And if I remember
> correctly, we
> > are getting LARGE OMAP warning from then only.
> >
> > I will try to delete this object from the Master zone on Monday and will
> see if
> > this fixes the issue.
> >
> > Do you have any other suggestions on this, which I should consider?
> >
> > Regards,
> > Danish
> >
> >
> >
> >
> >
> >
> > On Thu, Mar 13, 2025 at 6:07 PM Frédéric Nass < [
> > mailto:frederic.nass@xxxxxxxxxxxxxxxx | frederic.nass@xxxxxxxxxxxxxxxx
> ] >
> > wrote:
> >
> > BQ_BEGIN
> > Hi Danish,
> >
> > Can you access this KB article [1]? A free developer account should
> allow you
> > to.
> >
> > It pretty much describes what you're facing and suggests to trim the
> sync error
> > log of recovering shards. Actually, every log entry **except the last
> one**.
> >
> > 1/ Identify the shards with the most sync errors log entries:
> >
> > radosgw-admin sync error list --max-entries=1000000 | grep shard_id |
> sort -n |
> > uniq -c | sort -h
> >
> > 2/ For each shard, list every sync error log entry along with their ids:
> >
> > radosgw-admin sync error list --shard-id=X
> >
> > 3/ Remove them **except the last one** with:
> >
> > radosgw-admin sync error trim --shard-id=X
> --marker=1_1682101321.201434_8669.1
> >
> > the --marker above being the log entry id.
> >
> > Are the replication threads running on the same RGWs that S3 clients are
> using?
> >
> > If so, using dedicated RGWs for the sync job might help you avoid
> non-recovering
> > shards in the future, as described in Matthew's post [2]
> >
> > Regards,
> > Frédéric.
> >
> > [1] [ https://access.redhat.com/solutions/7023912 |
> > https://access.redhat.com/solutions/7023912 ]
> > [2] [ https://www.spinics.net/lists/ceph-users/msg83988.html |
> > https://www.spinics.net/lists/ceph-users/msg83988.html ]
> >
> > ----- Le 12 Mar 25, à 11:15, Danish Khan [ mailto:danish52.jmi@xxxxxxxxx
> |
> > danish52.jmi@xxxxxxxxx ] a écrit :
> >
> >> Dear All,
> >>
> >> My ceph cluster is giving Large OMAP warning from approx 2-3 Months. I
> >> tried a few things like :
> >> *Deep scrub of PGs*
> >> *Compact OSDs*
> >> *Trim log*
> >> But these didn't work out.
> >>
> >> I guess the main issue is that 4 shards in replication site are always
> >> recovering from 2-3 months.
> >>
> >> Any suggestions are highly appreciated.
> >>
> >> Sync status:
> >> root@drhost1:~# radosgw-admin sync status
> >> realm e259e0a92 (object-storage)
> >> zonegroup 7a8606d2 (staas)
> >> zone c8022ad1 (repstaas)
> >> metadata sync syncing
> >> full sync: 0/64 shards
> >> incremental sync: 64/64 shards
> >> metadata is caught up with master
> >> data sync source: 2072ee2d32 (masterstaas)
> >> syncing
> >> full sync: 0/128 shards
> >> incremental sync: 128/128 shards
> >> data is behind on 3 shards
> >> behind shards: [7,90,100]
> >> oldest incremental change not applied:
> >> 2025-03-12T13:14:10.268469+0530 [7]
> >> 4 shards are recovering
> >> recovering shards: [31,41,55,80]
> >>
> >>
> >> Master site:
> >> 1. *root@master1:~# for obj in $(rados ls -p masterstaas.rgw.log); do
> echo
> >> "$(rados listomapkeys -p masterstaas.rgw.log $obj | wc -l) $obj";done |
> >> sort -nr | head -10*
> >> 1225387 data_log.91
> >> 1225065 data_log.86
> >> 1224662 data_log.87
> >> 1224448 data_log.92
> >> 1224018 data_log.89
> >> 1222156 data_log.93
> >> 1201489 data_log.83
> >> 1174125 data_log.90
> >> 363498 data_log.84
> >> 258709 data_log.6
> >>
> >>
> >> 2. *root@master1:~# for obj in data_log.91 data_log.86 data_log.87
> >> data_log.92 data_log.89 data_log.93 data_log.83 data_log.90; do rados
> stat
> >> -p masterstaas.rgw.log $obj; done*
> >> masterstaas.rgw.log/data_log.91 mtime 2025-02-24T15:09:25.000000+0530,
> size
> >> 0
> >> masterstaas.rgw.log/data_log.86 mtime 2025-02-24T15:01:25.000000+0530,
> size
> >> 0
> >> masterstaas.rgw.log/data_log.87 mtime 2025-02-24T15:02:25.000000+0530,
> size
> >> 0
> >> masterstaas.rgw.log/data_log.92 mtime 2025-02-24T15:11:01.000000+0530,
> size
> >> 0
> >> masterstaas.rgw.log/data_log.89 mtime 2025-02-24T14:54:55.000000+0530,
> size
> >> 0
> >> masterstaas.rgw.log/data_log.93 mtime 2025-02-24T14:53:25.000000+0530,
> size
> >> 0
> >> masterstaas.rgw.log/data_log.83 mtime 2025-02-24T14:16:21.000000+0530,
> size
> >> 0
> >> masterstaas.rgw.log/data_log.90 mtime 2025-02-24T15:05:25.000000+0530,
> size
> >> 0
> >>
> >> *3. ceph cluster log :*
> >> 2025-02-22T04:18:27.324886+0530 osd.173 (osd.173) 19 : cluster [WRN]
> Large
> >> omap object found. Object: 124:b2ddf551:::data_log.93:head PG:
> 124.8aafbb4d
> >> (124.d) Key count: 1218170 Size (bytes): 297085860
> >> 2025-02-22T04:18:28.735886+0530 osd.65 (osd.65) 308 : cluster [WRN]
> Large
> >> omap object found. Object: 124:f2081d70:::data_log.92:head PG:
> 124.eb8104f
> >> (124.f) Key count: 1220420 Size (bytes): 295240028
> >> 2025-02-22T04:18:30.668884+0530 mon.master1 (mon.0) 7974038 : cluster
> [WRN]
> >> Health check update: 3 large omap objects (LARGE_OMAP_OBJECTS)
> >> 2025-02-22T04:18:31.127585+0530 osd.18 (osd.18) 224 : cluster [WRN]
> Large
> >> omap object found. Object: 124:d1061236:::data_log.86:head PG:
> 124.6c48608b
> >> (124.b) Key count: 1221047 Size (bytes): 295398274
> >> 2025-02-22T04:18:33.189561+0530 osd.37 (osd.37) 32665 : cluster [WRN]
> Large
> >> omap object found. Object: 124:9a2e04b7:::data_log.87:head PG:
> 124.ed207459
> >> (124.19) Key count: 1220584 Size (bytes): 295290366
> >> 2025-02-22T04:18:35.007117+0530 osd.77 (osd.77) 135 : cluster [WRN]
> Large
> >> omap object found. Object: 124:6b9e929a:::data_log.89:head PG:
> 124.594979d6
> >> (124.16) Key count: 1219913 Size (bytes): 295127816
> >> 2025-02-22T04:18:36.189141+0530 mon.master1 (mon.0) 7974039 : cluster
> [WRN]
> >> Health check update: 5 large omap objects (LARGE_OMAP_OBJECTS)
> >> 2025-02-22T04:18:36.340247+0530 osd.112 (osd.112) 259 : cluster [WRN]
> Large
> >> omap object found. Object: 124:0958bece:::data_log.83:head PG:
> 124.737d1a90
> >> (124.10) Key count: 1200406 Size (bytes): 290280292
> >> 2025-02-22T04:18:38.523766+0530 osd.73 (osd.73) 1064 : cluster [WRN]
> Large
> >> omap object found. Object: 124:fddd971f:::data_log.91:head PG:
> 124.f8e9bbbf
> >> (124.3f) Key count: 1221183 Size (bytes): 295425320
> >> 2025-02-22T04:18:42.619926+0530 osd.92 (osd.92) 285 : cluster [WRN]
> Large
> >> omap object found. Object: 124:7dc404fa:::data_log.90:head PG:
> 124.5f2023be
> >> (124.3e) Key count: 1169895 Size (bytes): 283025576
> >> 2025-02-22T04:18:44.242655+0530 mon.master1 (mon.0) 7974043 : cluster
> [WRN]
> >> Health check update: 8 large omap objects (LARGE_OMAP_OBJECTS)
> >>
> >> Replica site:
> >> 1. *for obj in $(rados ls -p repstaas.rgw.log); do echo "$(rados
> >> listomapkeys -p repstaas.rgw.log $obj | wc -l) $obj";done | sort -nr |
> head
> >> -10*
> >>
> >> 432850 data_log.91
> >> 432384 data_log.87
> >> 432323 data_log.93
> >> 431783 data_log.86
> >> 431510 data_log.92
> >> 427959 data_log.89
> >> 414522 data_log.90
> >> 407571 data_log.83
> >> 151015 data_log.84
> >> 109790 data_log.4
> >>
> >>
> >> 2. *ceph cluster log:*
> >> grep -ir "Large omap object found" /var/log/ceph/
> >> /var/log/ceph/ceph-mon.drhost1.log:2025-03-12T14:49:59.997+0530
> >> 7fc4ad544700 0 log_channel(cluster) log [WRN] : Search the cluster log
> >> for 'Large omap object found' for more details.
> >> /var/log/ceph/ceph.log:2025-03-12T14:49:02.078108+0530 osd.10 (osd.10)
> 21 :
> >> cluster [WRN] Large omap object found. Object:
> >> 6:b2ddf551:::data_log.93:head PG: 6.8aafbb4d (6.d) Key count: 432323
> Size
> >> (bytes): 105505884
> >> /var/log/ceph/ceph.log:2025-03-12T14:49:02.389288+0530 osd.48 (osd.48)
> 37 :
> >> cluster [WRN] Large omap object found. Object:
> >> 6:d1061236:::data_log.86:head PG: 6.6c48608b (6.b) Key count: 431782
> Size
> >> (bytes): 104564674
> >> /var/log/ceph/ceph.log:2025-03-12T14:49:07.166954+0530 osd.24 (osd.24)
> 24 :
> >> cluster [WRN] Large omap object found. Object:
> >> 6:0958bece:::data_log.83:head PG: 6.737d1a90 (6.10) Key count: 407571
> Size
> >> (bytes): 98635522
> >> /var/log/ceph/ceph.log:2025-03-12T14:49:09.100110+0530 osd.63 (osd.63)
> 5 :
> >> cluster [WRN] Large omap object found. Object:
> >> 6:9a2e04b7:::data_log.87:head PG: 6.ed207459 (6.19) Key count: 432384
> Size
> >> (bytes): 104712350
> >> /var/log/ceph/ceph.log:2025-03-12T14:49:08.703760+0530 osd.59 (osd.59)
> 11 :
> >> cluster [WRN] Large omap object found. Object:
> >> 6:6b9e929a:::data_log.89:head PG: 6.594979d6 (6.16) Key count: 427959
> Size
> >> (bytes): 103773777
> >> /var/log/ceph/ceph.log:2025-03-12T14:49:11.126132+0530 osd.40 (osd.40)
> 24 :
> >> cluster [WRN] Large omap object found. Object:
> >> 6:f2081d70:::data_log.92:head PG: 6.eb8104f (6.f) Key count: 431508 Size
> >> (bytes): 104520406
> >> /var/log/ceph/ceph.log:2025-03-12T14:49:13.799473+0530 osd.43 (osd.43)
> 61 :
> >> cluster [WRN] Large omap object found. Object:
> >> 6:fddd971f:::data_log.91:head PG: 6.f8e9bbbf (6.1f) Key count: 432850
> Size
> >> (bytes): 104418869
> >> /var/log/ceph/ceph.log:2025-03-12T14:49:14.398480+0530 osd.3 (osd.3) 55
> :
> >> cluster [WRN] Large omap object found. Object:
> >> 6:7dc404fa:::data_log.90:head PG: 6.5f2023be (6.1e) Key count: 414521
> Size
> >> (bytes): 100396561
> >> /var/log/ceph/ceph.log:2025-03-12T14:50:00.000484+0530 mon.drhost1
> (mon.0)
> >> 207423 : cluster [WRN] Search the cluster log for 'Large omap object
> >> found' for more details.
> >>
> >> Regards,
> >> Danish
> >> _______________________________________________
> >> ceph-users mailing list -- [ mailto:ceph-users@xxxxxxx |
> ceph-users@xxxxxxx ]
> >> To unsubscribe send an email to [ mailto:ceph-users-leave@xxxxxxx |
> >> ceph-users-leave@xxxxxxx ]
> >
> > BQ_END
> >
> >
> > BQ_END
> >
> >
> > BQ_END
> >
> >
> > BQ_END
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx