Keith Fiske: Auto-archiving and Data Retention Management in Postgres with pg_partman

You could be saving money every month on databases costs with a smarter data
retention policy. One of the primary reasons, and a huge benefit of partitioning
is using it to automatically archive your data. For example, you might have a
huge log table. For business purposes, you need to keep this data for 30 days.
This table grows continually over time and keeping all the data makes database
maintenance challenging. With time-based partitioning, you can simply archive
off data older than 30 days.

The nature of most relational databases means that deleting large volumes of
data can be very inefficient and that space is not immediately, if ever,
returned to the file system. PostgreSQL does not return the space it reserves to
the file system when normal deletion operations are run except under very
specific conditions:

  1. the page(s) at the end of the relation are completely emptied
  2. a VACUUM FULL/CLUSTER is run against the relation (exclusively locking it
    until complete)

If you find yourself needing that space back more immediately, or without
intrusive locking, then partitioning can provide a much simpler means of
removing old data: drop the table. The removal is nearly instantaneous (barring
any transactions locking the table) and immediately returns the space to the
file system. pg_partman, the Postgres extension for partitioning, provides a
very easy way to manage this for time and integer based partitioning.

pg_partman daily partition example

Recently pg_partman 5.1 was released
that includes new features such as list partitioning for single value integers,
controlled maintenance run ordering, and experimental support for numeric
partitioning. This new version also includes several bug fixes, so please update
to the latest release when possible! All examples were done using this latest
version.

https://github.com/pgpartman/pg_partman

First lets get a simple, time-based daily partition set going

CREATE TABLE public.time_stuff
    (col1 int
        , col2 text default 'stuff'
        , col3 timestamptz NOT NULL DEFAULT now() )
    PARTITION BY RANGE (col3);
SELECT partman.create_parent('public.time_stuff', 'col3', '1 day');
d+ time_stuff
                                             Partitioned table "public.time_stuff"
 Column |           Type           | Collation | Nullable |    Default    | Storage  | Compression | Stats target | Description
--------+--------------------------+-----------+----------+---------------+----------+-------------+--------------+-------------
 col1   | integer                  |           |          |               | plain    |             |              |
 col2   | text                     |           |          | 'stuff'::text | extended |             |              |
 col3   | timestamp with time zone |           | not null | now()         | plain    |             |              |
Partition key: RANGE (col3)
Partitions: time_stuff_p20240408 FOR VALUES FROM ('2024-04-08 00:00:00-04') TO ('2024-04-09 00:00:00-04'),
            time_stuff_p20240409 FOR VALUES FROM ('2024-04-09 00:00:00-04') TO ('2024-04-10 00:00:00-04'),
            time_stuff_p20240410 FOR VALUES FROM ('2024-04-10 00:00:00-04') TO ('2024-04-11 00:00:00-04'),
            time_stuff_p20240411 FOR VALUES FROM ('2024-04-11 00:00:00-04') TO ('2024-04-12 00:00:00-04'),
            time_stuff_p20240412 FOR VALUES FROM ('2024-04-12 00:00:00-04') TO ('2024-04-13 00:00:00-04'),
            time_stuff_p20240413 FOR VALUES FROM ('2024-04-13 00:00:00-04') TO ('2024-04-14 00:00:00-04'),
            time_stuff_p20240414 FOR VALUES FROM ('2024-04-14 00:00:00-04') TO ('2024-04-15 00:00:00-04'),
            time_stuff_p20240415 FOR VALUES FROM ('2024-04-15 00:00:00-04') TO ('2024-04-16 00:00:00-04'),
            time_stuff_p20240416 FOR VALUES FROM ('2024-04-16 00:00:00-04') TO ('2024-04-17 00:00:00-04'),
            time_stuff_default DEFAULT

Setting data retention policies

This partition set was created on April 12, 2024, so a default setup will create
4 partitions before and 4 partitions after. The first setting to configure for
retention, and the only one that is required, is the retention column in
the part_config table. For this example, we’ll set a retention of 2 days.
We’re also going to increase the premake value just to see that normal
maintenance is working as well.

UPDATE partman.part_config SET retention = '2 days', premake = 6 WHERE parent_table = 'public.time_stuff';

By default, pg_partman also does not create new child tables if there is no data
in the partition set, so lets add some data in as well.

INSERT INTO public.time_stuff (col1, col3)
VALUES (generate_series(1,10), CURRENT_TIMESTAMP);
SELECT * FROM partman.part_config
WHERE parent_table = 'public.time_stuff';

-[ RECORD 1 ]--------------+-----------------------------------
parent_table               | public.time_stuff
control                    | col3
partition_interval         | 1 day
partition_type             | range
premake                    | 6
automatic_maintenance      | on
template_table             | partman.template_public_time_stuff
retention                  | 2 days
retention_schema           |
retention_keep_index       | t
retention_keep_table       | t
epoch                      | none
constraint_cols            |
optimize_constraint        | 30
infinite_time_partitions   | f
datetime_string            | YYYYMMDD
jobmon                     | t
sub_partition_set_full     | f
undo_in_progress           | f
inherit_privileges         | f
constraint_valid           | t
ignore_default_data        | t
default_table              | t
date_trunc_interval        |
maintenance_order          |
retention_keep_publication | f
maintenance_last_run       |

In pg_partman, retention management is handled at the same time as new partition
creation. So a simple call to run_maintenance_proc() will handle both.

CALL partman.run_maintenance_proc();
d+ time_stuff
                                             Partitioned table "public.time_stuff"
 Column |           Type           | Collation | Nullable |    Default    | Storage  | Compression | Stats target | Description
--------+--------------------------+-----------+----------+---------------+----------+-------------+--------------+-------------
 col1   | integer                  |           |          |               | plain    |             |              |
 col2   | text                     |           |          | 'stuff'::text | extended |             |              |
 col3   | timestamp with time zone |           | not null | now()         | plain    |             |              |
Partition key: RANGE (col3)
Partitions: time_stuff_p20240410 FOR VALUES FROM ('2024-04-10 00:00:00-04') TO ('2024-04-11 00:00:00-04'),
            time_stuff_p20240411 FOR VALUES FROM ('2024-04-11 00:00:00-04') TO ('2024-04-12 00:00:00-04'),
            time_stuff_p20240412 FOR VALUES FROM ('2024-04-12 00:00:00-04') TO ('2024-04-13 00:00:00-04'),
            time_stuff_p20240413 FOR VALUES FROM ('2024-04-13 00:00:00-04') TO ('2024-04-14 00:00:00-04'),
            time_stuff_p20240414 FOR VALUES FROM ('2024-04-14 00:00:00-04') TO ('2024-04-15 00:00:00-04'),
            time_stuff_p20240415 FOR VALUES FROM ('2024-04-15 00:00:00-04') TO ('2024-04-16 00:00:00-04'),
            time_stuff_p20240416 FOR VALUES FROM ('2024-04-16 00:00:00-04') TO ('2024-04-17 00:00:00-04'),
            time_stuff_p20240417 FOR VALUES FROM ('2024-04-17 00:00:00-04') TO ('2024-04-18 00:00:00-04'),
            time_stuff_p20240418 FOR VALUES FROM ('2024-04-18 00:00:00-04') TO ('2024-04-19 00:00:00-04'),
            time_stuff_default DEFAULT

Now you can see the two partitions older than 2 days ago have been removed and
two new partitions have been created to include 6 days ahead. There are some
other more advanced options for retention available in pg_partman as well.
You’ll see above that the retention_keep_table option is set to true by
default. This means that while the child tables are no longer part of the
retention set, those tables do still exist in the database. pg_partman tries to
keep all default options set in a manner to reduce accidental data loss.

dt public.time_stuff*
                     List of relations
 Schema |         Name         |       Type        | Owner
--------+----------------------+-------------------+-------
 public | time_stuff           | partitioned table | keith
 public | time_stuff_default   | table             | keith
 public | time_stuff_p20240408 | table             | keith
 public | time_stuff_p20240409 | table             | keith
 public | time_stuff_p20240410 | table             | keith
 public | time_stuff_p20240411 | table             | keith
 public | time_stuff_p20240412 | table             | keith
 public | time_stuff_p20240413 | table             | keith
 public | time_stuff_p20240414 | table             | keith
 public | time_stuff_p20240415 | table             | keith
 public | time_stuff_p20240416 | table             | keith
 public | time_stuff_p20240417 | table             | keith
 public | time_stuff_p20240418 | table             | keith

Dropping tables and indexes

If you’d like these tables to actually be dropped, you can set
the retention_keep_table to false. Or if you’d like to keep the tables live in
the database, but don’t need the indexes taking up space anymore, you can
leave retention_keep_table set to true, but set retention_keep_index false
instead. In the example below, I have reset the partition set back to its
original state after running create_parent() and then running this update.

UPDATE partman.part_config
SET retention = '2 days', premake = 6, retention_keep_table = false
WHERE parent_table = 'public.time_stuff';

CALL partman.run_maintenance_proc();

Now if we look at the tables that actually exist, we can see the oldest two
tables are gone.

dt public.time*
                     List of relations
 Schema |         Name         |       Type        | Owner
--------+----------------------+-------------------+-------
 public | time_stuff           | partitioned table | keith
 public | time_stuff_default   | table             | keith
 public | time_stuff_p20240410 | table             | keith
 public | time_stuff_p20240411 | table             | keith
 public | time_stuff_p20240412 | table             | keith
 public | time_stuff_p20240413 | table             | keith
 public | time_stuff_p20240414 | table             | keith
 public | time_stuff_p20240415 | table             | keith
 public | time_stuff_p20240416 | table             | keith
 public | time_stuff_p20240417 | table             | keith
 public | time_stuff_p20240418 | table             | keith

Retention outside the database

Another scenario is if you don’t need the data live in the database, but you
still want to keep a backup of it outside of the database. In this case, we’re
going to use the retention_schema option which detaches the child tables from
the partition set and then moves them to the schema named in this option. Again,
the partition set has been reset to the initial state
after create_parent() and then we run this:

CREATE SCHEMA old_tables;

UPDATE partman.part_config
SET retention = '2 days', retention_schema = 'old_tables'
WHERE parent_table = 'public.time_stuff';

CALL partman.run_maintenance_proc();

Now we can see that the old tables are no longer in the partition set, but are
now in the old_tables schema.

d+ time_stuff
                                             Partitioned table "public.time_stuff"
 Column |           Type           | Collation | Nullable |    Default    | Storage  | Compression | Stats target | Description
--------+--------------------------+-----------+----------+---------------+----------+-------------+--------------+-------------
 col1   | integer                  |           |          |               | plain    |             |              |
 col2   | text                     |           |          | 'stuff'::text | extended |             |              |
 col3   | timestamp with time zone |           | not null | now()         | plain    |             |              |
Partition key: RANGE (col3)
Partitions: time_stuff_p20240410 FOR VALUES FROM ('2024-04-10 00:00:00-04') TO ('2024-04-11 00:00:00-04'),
            time_stuff_p20240411 FOR VALUES FROM ('2024-04-11 00:00:00-04') TO ('2024-04-12 00:00:00-04'),
            time_stuff_p20240412 FOR VALUES FROM ('2024-04-12 00:00:00-04') TO ('2024-04-13 00:00:00-04'),
            time_stuff_p20240413 FOR VALUES FROM ('2024-04-13 00:00:00-04') TO ('2024-04-14 00:00:00-04'),
            time_stuff_p20240414 FOR VALUES FROM ('2024-04-14 00:00:00-04') TO ('2024-04-15 00:00:00-04'),
            time_stuff_p20240415 FOR VALUES FROM ('2024-04-15 00:00:00-04') TO ('2024-04-16 00:00:00-04'),
            time_stuff_p20240416 FOR VALUES FROM ('2024-04-16 00:00:00-04') TO ('2024-04-17 00:00:00-04'),
            time_stuff_default DEFAULT
dt old_tables.*
                 List of relations
   Schema   |         Name         | Type  | Owner
------------+----------------------+-------+-------
 old_tables | time_stuff_p20240408 | table | keith
 old_tables | time_stuff_p20240409 | table | keith

To store these tables “offline” outside of the database, we can use a python
script provided by pg_partman to dump all tables in a given schema. It’s not
tied in any way to the partition configuration or the partition set, so this
script can be used to dump any tables in any schema.

$ python3 dump_partition.py -c"host=localhost" --schema=old_tables
DROP TABLE IF EXISTS"old_tables"."time_stuff_p20240409"
DROP TABLE IF EXISTS"old_tables"."time_stuff_p20240408

$ ls -l old*
-rw-rw-r-- 1 keith keith  168 Apr 12 18:17 old_tables.time_stuff_p20240408.hash
-rw-rw-r-- 1 keith keith 1410 Apr 12 18:17 old_tables.time_stuff_p20240408.pgdump
-rw-rw-r-- 1 keith keith  168 Apr 12 18:17 old_tables.time_stuff_p20240409.hash
-rw-rw-r-- 1 keith keith 1410 Apr 12 18:17 old_tables.time_stuff_p20240409.pgdump

By default it creates dump files in the custom dump format as well as providing
a SHA-512 hash of the dump file to provide long-term data integrity checks. This
backup option can either be run as part of a regularly scheduled script or as a
one off backup.

Summary

Keeping data that doesn’t need to actually exist inside the database is a key
part of keeping it running efficiently. Hopefully this has provided a guide to
using both basic and advanced retention management options available in
pg_partman.

PostgreSQL