Re: Corrupt btree index includes rows that don't match

Erik Johnston <erikj@xxxxxxxxxx> · Wed, 9 Jul 2025 18:02:12 +0100

    Hi again,

      Thanks very much for the replies last week. We’ve been continuing
      to investigate this problem, and I thought I’d share an update on
      where we are.

      To recap: the situation is that, looking at our backup from
      2025-06-26 via pageinspect, we have btree index rows which point
      to either non-existent heap TIDs, or to heap TIDs with data which
      does not correspond to the index row. In fact it looks like we
      have entire index pages which point only to non-existent heap
      TIDs.

      (I previously said that these index rows were marked as ‘dead’ in
      the backup. We now suspect this is an artifact of the restore
      process: we believe they are live in the backup, but were marked
      as dead during the restore.)

      Empirically, and surprisingly to us, when one does a SELECT from
      an index entry that points to a non-existent TID, the index entry
      is quietly ignored.

      We therefore suspect that this index corruption has been present
      for some time (possibly years); more recently those non-existent
      heap TIDs have been recycled, and that is when we have noticed the
      effects of the problem.

      As far as we can tell, the corruption only affects one index on
      one table, and only a specific region of that index/table.
      Specifically, it only appears to affect rows which would have been
      inserted between 2018 and January 2021. At least 1B rows appear to
      be affected (the table as a whole has 29B rows).

      One thing that surprised us is that ‘amcheck’ didn’t find any sign
      of the corruption. We’re not completely sure if this is because we
      are holding it wrong, or because it’s simply out of scope or
      unsupported for amcheck. Any advice on this, or suggestions for
      other tooling we could use to check the consistency of our other
      indexes, would be much appreciated.

      We’re still very interested in trying to understand the root cause
      of the corruption, mostly to confirm that it’s not an ongoing
      problem. Thanks Tom for the suggestion of
https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4934d3875.
      We agree with your assessment that this is unlikely. For one
      thing, it looks like that bug could only conceivably cause this
      corruption if it affected an UPDATE query, and we’re reasonably
      sure we never do any UPDATE queries on that table. (The table is
      mostly append-only. We do sometimes run cleanup/compression jobs
      which amount to large amounts of interleaved DELETEs and INSERTs,
      but no UPDATEs.)

      Back in 2021, we were running Postgres 10.11. We’ve taken a pass
      through the release notes since then to see if we can find any
      likely-looking bugs. We found the one that causes BRIN index
      corruption (this is not a BRIN index), and the one that causes
      CREATE INDEX CONCURRENTLY to end up with too *few* entries (this
      one has the opposite problem), but no particularly likely
      candidate. Any other suggestions would be welcome here.

      At the moment, a historical hardware-level problem seems like it
      might be the most likely culprit, though we are a bit mystified
      about how any hardware failure could have caused such widespread
      damage to a single index, whilst apparently leaving the rest of
      the database intact. 

    Any thoughts or suggestions are very much appreciated.

      Thanks,

      Erik

    On 04/07/2025 15:59, Erik Johnston
      wrote:

          On Fri, 4 Jul 2025, 15:38
            Ron Johnson, <ronljohnsonjr@xxxxxxxxx>
            wrote:

              On Fri, Jul 4, 2025 at 9:49 AM Erik
                Johnston <erikj@xxxxxxxxxx>
                wrote:

                    Hi, a quick update:
                    - We have discovered that the corruption was
                      present from before libicu update.

                      - We ran `pg_amcheck --index
                      state_groups_state_type_idx --heapallindexed
                      matrix`, which returned nothing

                      - We believe that means that (and matches what we
                      see sampling) the index has gained extra entries,
                      i.e. that for a given state group it does return
                      all the relevant rows in the table plus
                      extra rows.

                    We are also seeing old state groups starting to
                      point at rows that have only just been inserted.
                      For example, querying for 353864583 on the primary
                      it returns that row plus four rows that have been
                      inserted today, but on the backup from last week
                      an index only scan for 353864583 only returns one
                      row. This makes it feel like the corruption is
                      ongoing? Nothing should have modified that state
                      group in the interim (they are generally
                      immutable).
                    This naively feels like when inserting a new row
                      we sometimes add the row to the index twice: once
                      pointing from the correct state group to the new
                      row, and once from an old state group to the new
                      row?

              Are checksums enabled in the instance?

        Alas not. 

        We've also now found that the index on the
          backup does in fact point to those ctids after all, but they
          are marked as dead. So at some point between then and when we
          inserted the new row at that ctid today those entries were
          marked undead. 

    -- 

      _{Copyright © 2023 Element
        - All rights reserved. The Element name, logo and device are
        registered trademarks of New Vector Ltd. Registered number:
        10873661. Registered in England and Wales. Registered address:
        10 Queen Street Place, London, United Kingdom, EC4R 1AG.

        This message is intended for the addressee only and may contain
        private and confidential information or material which may be
        privileged. If this message has come to you in error please
        delete it immediately and do not copy it or show it to any other
        person.}

Copyright © 2025 Element - All rights reserved. The Element name, logo and device are registered trademarks of New Vector Ltd. Registered number: 10873661. Registered in England and Wales. Registered address: 10 Queen Street Place, London, United Kingdom, EC4R 1AG.
This message is intended for the addressee only and may contain private and confidential information or material which may be privileged. If this message has come to you in error please delete it immediately and do not copy it or show it to any other person.