Re: Seeking Suggestions for Best Practices: Archiving and Migrating Historical Data in PostgreSQL

Ron Johnson <ronljohnsonjr@xxxxxxxxx> · Fri, 30 May 2025 15:31:01 -0400

That's an unanswerable question, as I would not use Windows.  😁
Seriously though, since it's an image-heavy database full of PDF and TIFF files, I'd do what I did on Linux when needing to migrate/upgrade a 6TB (including indices) db from PG 9.6 to PG 14, and took four hours:
pg_dump -Z1 --jobs=16

On Fri, May 30, 2025 at 2:39 PM Andy Hartman <hartman60home@xxxxxxxxx> wrote:
What would you use for backup if PG hosted on Windows

On Fri, May 30, 2025 at 2:10 PM Ron Johnson <ronljohnsonjr@xxxxxxxxx> wrote:
Hmm... that was a few years ago, back when v12 was new.  It took about  a month (mainly because they didn't want me running exports during "office hours").

There were 120 INSERT & SELECT (no UPDATE or DELETE) tables, so I was able to add indices on date columns, create by-month views.  (We migrated the dozen or so relatively small UPDATE tables on cut-over day.  On that same day, I migrated the current month and the previous month's data in those 120 tables.

I made separate cron jobs to:
- export views from Oracle into COPY-style tab-separated flat files, 
- lz4-compress views that had finished exporting, and
- scp files that were finished compressing, to an AWS EC2 VM.

These jobs pipelined, so there was always a job exporting, always a job ready to compress tsv files, and another job ready to scp the lz4 files.  When there was nothing for a step to do, the job would sleep for a couple of minutes, then check if there was more work to do.

On the AWS EC2 VM, a different cron job waited for files to finish transferring, then loaded them into the correct table. Just like with the source host jobs, the "load" job would sleep a bit and then check for more work. I manually applied Indices.

The AWS RDS PG12 database was about 4TB.  Snapshots were handled by AWS.  If this had been one of my on-prem systems, I'd have used pgbackrest.  (pgbackrest is impressively fast: takes good advantage of PG's 1GB file max, and globs "small" files into one big file.)

On Fri, May 30, 2025 at 12:15 PM Andy Hartman <hartman60home@xxxxxxxxx> wrote:
what was the duration start to finish of the migration of the 6tb of data. then what do you use for a quick backup after archived PG data 

Thanks.

On Fri, May 30, 2025 at 11:29 AM Ron Johnson <ronljohnsonjr@xxxxxxxxx> wrote:
On Fri, May 30, 2025 at 3:51 AM Motog Plus <mplus7535@xxxxxxxxx> wrote:
Hi Team,
We are currently planning a data archival initiative for our production PostgreSQL databases and would appreciate suggestions or insights from the community regarding best practices and proven approaches.

**Scenario:**
- We have a few large tables (several hundred million rows) where we want to archive historical data (e.g., older than 1 year).
- The archived data should be moved to a separate PostgreSQL database (on a same or different server).
- Our goals are: efficient data movement, minimal downtime, and safe deletion from the source after successful archival.

- PostgreSQL version: 15.12
- Both source and target databases are PostgreSQL.

We explored using `COPY TO` and `COPY FROM` with CSV files, uploaded to a SharePoint or similar storage system. However, our infrastructure team raised concerns around the computational load of large CSV processing and potential security implications with file transfers.

We’d like to understand:
- What approaches have worked well for you in practice?

This is how I migrated 6TB of data from an Oracle database to Postgresql, and then implemented quarterly archiving of the PG database:
- COPY FROM (SELECT * FROM live_table WHERE date_fld in some_manageable_date_range) TO STDOUT.
- Compress
- scp
- COPY TO archive_table.
- Index
- DELETE FROM live_table WHERE date_fld in some_manageable_date_range  (This I only did in the PG archive process

(Naturally, the Oracle migration used Oracle-specific commands.)

- Are there specific tools or strategies you’d recommend for ongoing archival?

I write generic bash loops to which you pass an array that contains the table name, PK, date column and date range.

Given a list of tables, it did the COPY FROM, lz4 and scp.  Once that finished successfully, another script dropped archive indices on the current table, COPY TO and CREATE INDEX statements.  A third script did the deletes.

This works even when the live database tables are all connected via FK.  You just need to carefully order the tables in your script.

- Any performance or consistency issues we should watch out for?

My rules for scripting are "bite-sized pieces" and "check those return codes!".

Your insights or any relevant documentation/pointers would be immensely helpful.

Index support uber alles.  When deleting from a table which relies on a foreign key link to a table which _does_ have a date field, don't hesitate to join on that table.

And DELETE of bite-sized chunks is faster than people give it credit for.

-- 
Death to <Redacted>, and butter sauce.Don't boil me, I'm still alive.
<Redacted> lobster!

-- 
Death to <Redacted>, and butter sauce.Don't boil me, I'm still alive.
<Redacted> lobster!

-- 
Death to <Redacted>, and butter sauce.Don't boil me, I'm still alive.
<Redacted> lobster!