Re: Logical replication, need to reclaim big disk space

Moreno Andreo <moreno.andreo@xxxxxxxxxx> · Tue, 20 May 2025 11:17:25 +0200



    On 19/05/25 20:49, Achilleas Mantzios
      wrote:

    
      On 19/5/25 17:38, Moreno Andreo wrote:
      
        
        On 19/05/25 14:41, Achilleas
          Mantzios wrote:

        
          On 5/19/25 09:14, Moreno Andreo
            wrote:

          
            On 16/05/25 21:33, Achilleas Mantzios wrote: 

            On 16/5/25 18:45, Moreno Andreo
              wrote: 

              
              Hi, 

                    we are moving our old binary data approach, moving
                them from bytea fields in a table to external storage
                (making database smaller and related operations faster
                and smarter). 

                In short, we have a job that runs in background and
                copies data from the table to an external file and then
                sets the bytea field to NULL. 

                (UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE
                id = <uuid>) 

                
                This results, at the end of the operations, to a table
                that's less than one tenth in size. 

                We have a multi-tenant architecture (100s of schemas
                with identical architecture, all inheriting from public)
                and we are performing the task on one table per schema.
                

              So? toasted data are kept on separate TOAST tables, unless
              those bytea cols are selected, you won't even touch them.
              I cannot understand what you are trying to achieve here. 

              
              Years ago, when I made the mistake to go for a coffee and
              let my developers "improvise" , the result was a design
              similar to what you are trying to achieve. Years after, I
              am seriously considering moving those data back to
              PostgreSQL. 

            
            The "related operations" I was talking about are backups and
            database maintenance when needed, cluster/replica
            management, etc. With a smaller database size they would be
            easier in timing and effort, right? 

          
          Ok, but you'll lose replica functionality for those blobs,
          which means you don't care about them, correct me if I am
          wrong.

        
        I'm not saying I don't care about them, the opposite, they are
        protected with Object Versioning and soft deletion, this should
        assure a good protection against e.g. ransomware, if someone
        manages to get in there (and if this happens, we'll have bigger
        troubles than this).

      
      PostgreSQL has become very popular because of ppl who care about
      their data.

    
    Yeah, it's always been famous for its robustness, and that's why I
    chose PostgreSQL more than 10 years ago, and, in spite of how a
    "normal" user treats his PC, we never had corruption (only where
    FS/disk were failing, but that's not PG fault)

    
          We
            are mostly talking about costs, here. To give things their
            names, I'm moving bytea contents (85% of total data) to
            files into Google Cloud Storage buckets, that has a fraction
            of the cost of the disks holding my database (on GCE, to be
            clear ). 

          
          May I ask the size of the bytea data (uncompressed) ?.

        
        single records vary from 150k to 80 MB, the grand total is more
        than 8,5 TB in a circa 10 TB data footprint

      
          This
            data is not accessed frequently (just by the owner when he
            needs to do it), so no need to keep it on expensive
            hardware. 

            I've already read in these years that keeping many big bytea
            fields in databases is not recommended, but might have
            misunderstood this. 

          
          Ok, I assume those are unimportant data, but let me ask,
            what is the longevity or expected legitimacy of those ? I
            haven't worked with those just reading  :
          https://cloud.google.com/storage/pricing?_gl=1*1b25r8o*_up*MQ..&gclid=CjwKCAjwravBBhBjEiwAIr30VKfaOJytxmk7J29vjG4rBBkk2EUimPU5zPibST73nm3XRL2h0O9SxRoCaogQAvD_BwE&gclsrc=aw.ds#storage-pricing
          would you choose e.g. "Anywhere Cache storage" ?

            
        Absolutely not, this is *not* unimportant data, and we are using
        Standard Storage, for 0,02$/GB/month + operations, that compared
        to a 0.17$/GB/month of an SSD or even more for the Hyperdisks we
        are using, is a good price drop.

      
      How about hosting your data in your own storage and spend
      0$/GB/month ?
    If we could host on our own hardware I'd not be here talking. Maybe
    we would have a 10-node full-mesh multimaster architecture with
    barman backup on 2 separate SANs.

    But we are a small company that has to balance performance,
    consistency, security and, last but not latter, costs. And margins
    are tightening.

     
          Another
            way would have been to move these tables to a different
            tablespace, in cheaper storage, but it still would have been
            3 times the buckets cost. 

            
          can you actually mount those Cloud Storage Buckets under a
            supported FS in linux and just move them to tablespaces
            backed by this storage ?

          
        Never tried, I mounted this via FUSE and had some simple
        operations in the past, but not sure it can handle database
        operations in terms of I/O bandwidth

        
          Why
            are you considering to get data back to database tables? 

          
          Because now if we need to migrate from cloud to on-premise, or
          just upgrade or move the specific server which holds those
          data I will have an extra headache. Also this is a single
          point of failure, or best case a cause for fragmented
          technology introduced just for the sake of keeping things out
          of the DB.

        
        This is managed as an hierarchical disk structure, so the
        calling server may be literally everywhere, it just needs an
        account (or a service account) to get in there ,

      
      and you are locked in a proprietary solution. and at their mercy
      of any future increases in cost.

    
    Since we cannot host on our hardware, the only thing is to keep an
    eye on costs and migrate (yeah, more work) when it's becoming
    expensive. Every solution is proprietary, if you want to run on
    cloud. Even the VMs where PostgreSQL is running.

    
              The problem is: this is generating
                BIG table bloat, as you may imagine. 

                Running a VACUUM FULL on an ex-22GB table on a
                standalone test server is almost immediate. 

                If I had only one server, I'll process a table a time,
                with a nightly script, and issue a VACUUM FULL to tables
                that have already been processed. 

                
                But I'm in a logical replication architecture (we are
                using a multimaster system called pgEdge, but I don't
                think it will make big difference, since it's based on
                logical replication), and I'm building a test cluster. 

                
              So you use PgEdge , but you wanna lose all the benefits of
              multi-master , since your binary data won't be replicated
              ... 

            
            I don't think I need it to be replicated, since this data
            cannot be "edited", so either it's there or it's been
            deleted. Buckets have protections for data deletions or
            events like ransomware attacks and such. 

            Also multi-master was an absolute requirement one year ago
            because of a project we were building, but it has been
            abandoned and now a simple logical replication would be
            enough, but let's do one thing a time. 

          
          Multi-master is cool, you can configure your pooler / clients
          to take advantage of this for full load balanced architecture,
          but if not a strict requirement , you can live without it, as
          so many of us, and employ other means of load balancing the
          reads.

        
        That's what we are doing, it's a really cool feature, but I
        experienced (maybe because it uses old pglogical extension) that
        the replication is a bit fragile, especially when dealing with
        those bytea fields (when I ingest big loads, say 25-30 GB or
        more), it happened to break replication, and recreating a
        replica from scratch with "normal size" tables is not a big
        deal, since it can be achieved automatically, because they
        normally fit in shared memory and can be transferred by the
        replicator, but you can imagine what would be the effort and the
        downtime necessary to create a base backup, transfer it to the
        replica, build the DB and restart a 10-TB database (ATM we are
        running with a 2-node cluster). 

      
      Break this in batches, use modern techniques for robust data
      loading, in smaller transactions, if you have to.

    
    Normally it's run via COPY commands, I can throttle COPY or break it
    in batches. At the moment, while the schema is offline, we
    disconnect replication from the bytea tables, feed them, wait for
    checkpoints to return normal and then resume replication between
    tables before putting schema online. This is safe, even if far from
    being optimized. It's a migration tool, it won't be used forever,
    just to move customers from their current architecture to new cloud
    one.

    
              I've been instructed to issue
                VACUUM FULL on both nodes, nightly, but before
                proceeding I read on docs that VACUUM FULL can disrupt
                logical replication, so I'm a bit concerned on how to
                proceed. Rows are cleared one a time (one transaction,
                one row, to keep errors to the record that issued them)
                

          Mind if you shared the specific doc ?

        
        Obviously I can't find it from a quick search, I'll search
        deeper, I don't think it went off a dream :-).

        
              PgEdge is based on the old pg_logical, the old 2ndQuadrant
              extension, not the native logical replication we have
              since pgsql 10. But I might be mistaken. 

            
            Don't know about this, it keeps running on latest pg
            versions (we are about to upgrade to 17.4, if I'm not
            wrong), but I'll ask 

            
              I read about extensions like
                pg_squeeze, but I wonder if they are still not dangerous
                for replication. 

                
              What's pgEdge take on that, I mean the bytea thing you are
              trying to achieve here. 

            
            They are positive, it's they that suggested to do VACUUM
            FULL on both nodes... I'm quite new to replication, so I'm
            searching some advise here. 

          
          As I told you, pgEdge logical replication (old 2ndquadrant
            BDR) != native logical replication. You may look here :
          https://github.com/pgEdge/spock
            

          If multi-master is not a must you could convert  to vanilla
            postgresql and focus on standard physical and logical
            replication.

          
        No, multimaster is cool, but as I said, the project has been
        discontinued and it's not a must anymore. This is the first
        step, actually. We are planning to return to plain PostgreSQL,
        or CloudSQL for PostgreSQL, using logical replication (that
        seems the most reliable of the two). We created a test case for
        both the options, and they seem to be OK for now, even if I have
        still to do adequate stress tests. And when I'll do the
        migration, I'd like to be migrating plain data only and leave
        blobs where they are.

      
      as you wish. But this design has inherent data infra
        fragmentation as you understand. 

      
      Personally I like to let the DB take care of the data, and I
        take care of the DB, not a plethora of extra systems that we
        need to keep connected and consistent.

      
    We followed this idea when the application (old version) was on
    customer premises, so backups and operations were simple and getting
    in trouble (e.g. customer deleting a directory from their PC) has
    happened a very few times, just when they launched disk cleanup on
    windows :-)

    
    Now we host a full cloud solution, so we got rid of many potential
    problems generated by the end user, but bumped into other, as you
    certainly imagine. We have to keep it consistent, fast, reliable,
    keeping an eye on costs.

    You are right, but the more I was working with this solution, the
    more I'm having the impression of dealing with something heavy, hard
    to mantain because of these rarely-accessed files that sum up most
    of my data. Maybe it's just my impression, maybe I need some
    expertise in an area that's still quite new for me.

    At the moment that seems a good compromise between stability and
    costs. Maybe in one year I'll be in your position (considering
    getting everything back), but for now we are thinking forward in
    that way.

    
    This been said, the original question :-)

    Would be VACUUM FULL a risky operation? Has it to be done on all
    nodes, obviously in a low-traffic and low-access timing (night)?

    
              Thanks for your help. 

                Moreno.-