Rodrigo Rosenfeld Rosas

Upgrading PostgreSQL from 9.6 to 10 with minimal downtime using pglogical

Fri, 10 Nov 2017 15:00:00 +0000

Once PostgreSQL 10 was released I wanted to upgrade our 9.6 cluster to the newest version. However, it would require a lot of coordination effort to get a maintenance window to perform the migration the way I was used to: put the application in maintenance mode, get a new dump and restore it to the new cluster and switch off the maintenance mode.

That means the application wouldn't be available for an hour or so, maybe more. After reading once more about pglogical, I decided to finally give it a try, which allowed me to switch from 9.6 to 10 in just a few seconds.

How it works - a higher level view

pglogical implements logical replication, which allows replicating databases among different versions, which is not possible with the binary replication mechanism provided by PostgreSQL itself. Well, PG 10 added some support to logical replication, but since we want to replicate from 9.6, we'd need to resort to some external extension.

A required condition from pglogical is that all tables being replicated must have a primary key. It doesn't need to be a single column, but a primary key must exist. Superuser access must also be provided for both databases for the replication agents. DDL replication is not supported. Truncate cascades are not replicated. Nothing fancy, after all. It should allow us to replicate most databases.

You should pay special attention to the primary key requirement though, specially if you're using the ActiveRecord Ruby gem to manage the database migrations in older databases as the schema_migrations table didn't have a primary key in the earlier days. If that's your case:

1
alter table schema_migrations add primary key (version);

The idea is to install a PostgreSQL package with support for the pglogical extension, then create the new PG 10 cluster and restore the schema only in the new cluster. The current cluster should be stopped and restarted using the pglogical-enabled installed PostgreSQL. The clusters should be reachable to it other through TCP/IP. You'll need to tell the provider (the 9.6 database being upgraded) the IP and port for the subscriber (the new PG 10 database) and vice-versa. The pglogical extension is created in both databases, postgresql.conf and pg_hba.conf are changed to enable logical replication and both databases are restarted. Finally, some pglogical statements are issued to create the provider, subscriber and subscription, which starts the replication. Once the replication is finished, you may change the port in the new cluster to match the old one, stop the old cluster and restart the new one. Finally it would be a good idea to restart the applications as well, specially if you're using some custom types such as row types, as they will most likely have different OIDs and if you have registered those row types it won't work as expected until you reboot the application. This would be the case if you're using DB.register_row_type using the Sequel Ruby gem, for example.

The final switch can happen in as quickly as a few seconds, which means minimal downtime.

How it works - hands on

We use Docker to run PostgreSQL in our servers (besides the apps), so this article also uses it to demonstrate how the process works, but it should be easy to apply the instructions to other kind of set-ups. The advantage of Docker as demonstration tool is that these procedures should be easy to replicate as is and it also takes care of creating and running the databases as well.

We assume the PostgreSQL client is installed in the host too for this article.

Prepare the images and start-up script

Create the following Dockerfiles in sub-directories pg96 and pg10 (look at the instructions inside the Dockerfiles in order to replicate in your own environment if you're not running PostgreSQL in a Docker container):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# pg96/Dockerfile
FROM postgres:9.6

RUN apt-get update && apt-get install -y wget gnupg
RUN echo "deb [arch=amd64] http://packages.2ndquadrant.com/pglogical/apt/ jessie-2ndquadrant main" > /etc/apt/sources.list.d/2ndquadrant.list \
  && wget --quiet -O - http://packages.2ndquadrant.com/pglogical/apt/AA7A6805.asc | apt-key add - \
  && apt-get update \
  && apt-get install -y postgresql-9.6-pglogical

RUN echo "host    replication          postgres                172.18.0.0/16   trust" >> /usr/share/postgresql/9.6/pg_hba.conf.sample
RUN echo "host    replication          postgres                ::1/128         trust" >> /usr/share/postgresql/9.6/pg_hba.conf.sample
RUN echo "shared_preload_libraries = 'pglogical'" >> /usr/share/postgresql/postgresql.conf.sample
RUN echo "wal_level = 'logical'" >> /usr/share/postgresql/postgresql.conf.sample
RUN echo "max_wal_senders = 20" >> /usr/share/postgresql/postgresql.conf.sample
RUN echo "max_replication_slots = 20" >> /usr/share/postgresql/postgresql.conf.sample
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# pg10/Dockerfile
FROM postgres:10

RUN rm /etc/apt/trusted.gpg && apt-get update && apt-get install -y wget
RUN echo "deb [arch=amd64] http://packages.2ndquadrant.com/pglogical/apt/ stretch-2ndquadrant main" > /etc/apt/sources.list.d/2ndquadrant.list \
  && wget --quiet -O - http://packages.2ndquadrant.com/pglogical/apt/AA7A6805.asc | apt-key add - \
  && apt-get update \
  && apt-get install -y postgresql-10-pglogical

RUN echo "host    replication          postgres                172.18.0.0/16   trust" >> /usr/share/postgresql/10/pg_hba.conf.sample
RUN echo "host    replication          postgres                ::1/128         trust" >> /usr/share/postgresql/10/pg_hba.conf.sample
RUN echo "shared_preload_libraries = 'pglogical'" >> /usr/share/postgresql/postgresql.conf.sample
RUN echo "wal_level = 'logical'" >> /usr/share/postgresql/postgresql.conf.sample
RUN echo "max_wal_senders = 20" >> /usr/share/postgresql/postgresql.conf.sample
RUN echo "max_replication_slots = 20" >> /usr/share/postgresql/postgresql.conf.sample

Let's assume both servers will run in the same machine with IP 10.0.1.10. The 9.6 instance is running on port 5432 and the new cluster will be running initially (before the switch) in port 5433.

1
2
cd pg96 && docker build . -t postgresql-pglogical:9.6 && cd -
cd pg10 && docker build . -t postgresql-pglogical:10 && cd -

This is not a tutorial on Docker, but if you're actually using Docker, it would be a good idea to push those images to your private registry.

The first step is to stop the old 9.6 cluster and start the pglogical enabled cluster with the old data (taking a backup before is always a good idea by the way). Suppose your cluster data is located at "/var/lib/postgresql/9.6/main/" and that your config files are located at "/etc/postgresql/9.6/main/". If "/etc/postgresql/9.6" and "/var/lib/postgresql/9.6" do not exist, don't worry, the script will create a new cluster for you (in case you want to try with new dbs, first, which is a good idea by the way, and map some temp directories).

Create the following script at "/sbin/pg-scripts/start-pg" and make it executable. It will run the database from the container.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/bin/bash
version=$1
net=$2
setup_db(){
  pg_createcluster $version main -o listen_addresses='*' -o wal_level=logical \
        -o max_wal_senders=10 -o max_worker_processes=10 -o max_replication_slots=10 \
        -o hot_standby=on -o max_wal_senders=10 -o shared_preload_libraries=pglogical -- -A trust
  pghba=/etc/postgresql/$version/main/pg_hba.conf
  echo -e "host\tall\tappuser\t$net\ttrust" >> $pghba
  echo -e "host\treplication\tappuser\t$net\ttrust" >> $pghba
  echo -e "host\tall\tpostgres\t172.17.0.0/24\ttrust" >> $pghba
  echo -e "host\treplication\tpostgres\t172.17.0.0/24\ttrust" >> $pghba
  pg_ctlcluster $version main start
  psql -U postgres -c '\du' postgres|grep -q appuser || createuser -U postgres -l -s appuser
  pg_ctlcluster $version main stop
}
[ -d /var/lib/postgresql/$version/main ] || setup_db
exec pg_ctlcluster --foreground $version main start

This script will take care of creating a new cluster if one doesn't already exist. Although not really required for the replication to work, it also takes care of creating a new "appuser" database superuser authenticated with "trust" for simplicity sake. It might be useful if you decide to use this script for spawning new databases for testing purposes. Adapt the script to suite your needs in that case, changing the user name or the authentication methods.

Run the containers

Let's run the 9.6 cluster in port 5432 (feel free to run it in another port and use a temporary directory in the mappings if you just want to give it a try):

1
2
3
4
5
6
7
docker run --rm -v /sbin/pg-scripts:/pg-scripts -v /var/lib/postgresql:/var/lib/postgresql \
    -v /etc/postgresql:/etc/postgresql -p 5432:5432 postgres-pglogical:9.6 \
    /pg-scripts/start-pg 9.6 10.0.1.0/24
# since we're running in the foreground with the --rm option, run this in another terminal:
docker run --rm -v /sbin/pg-scripts:/pg-scripts -v /var/lib/postgresql:/var/lib/postgresql \
    -v /etc/postgresql:/etc/postgresql -p 5433:5432 postgres-pglogical:10 \
    /pg-scripts/start-pg 10 10.0.1.0/24

The first argument to start-pg is the PG version and the second and last argument is the net used to create pg_hba.conf if it doesn't exist, to allow "appuser" to connect from using the "trust" authentication method.

If you're curious about how to run a Docker container as a systemd service, let me know in the comments section below and I may complement this article once I find some time, but it's not hard. There are plenty of documents explaining that in the internet, but our own service unit file is a bit different from what I've seen in most tutorials, as it tries to check that the port is indeed accepting connections when starting the service and it doesn't pull the image from the registry if it is available locally already.

Edit PostgreSQL configuration

Once you make sure the old cluster is running file with the postgresql-pglogical container, it's time to update your postgresql.conf file and restart the container. Use the following configuration as a start-point for both 9.6 and 10 clusters:

1
2
3
4
5
wal_level = logical
max_worker_processes = 10
max_replication_slots = 10
max_wal_senders = 10
shared_preload_libraries = 'pglogical'

For pg_hba.conf, include the following lines (change the network settings if you're not using Docker, or if you're running the containers in another net than the default one):

1
2
host    all     postgres        172.17.0.0/24   trust
host    replication     postgres        172.17.0.0/24   trust

Restart the servers and we should be ready for starting the replication.

Replicating the database

Set up the provider

In the PG 9.6 database:

1
2
3
4
5
6
7
8
# take a dump from the schema that we'll use to restore in PG 10
pg_dump -Fc -s -h 10.0.1.10 -p 5432 -U appuser mydb > mydb-schema.dump
psql -h 10.0.1.10 -p 5432 -c 'create extension pglogical;' -U appuser mydb
psql -h 10.0.1.10 -p 5432 -c "select pglogical.create_node(node_name := 'provider', dsn := 'host=10.0.1.10 port=5432 dbname=mydb');" -U appuser mydb
psql -h 10.0.1.10 -p 5432 -c "select pglogical.replication_set_add_all_tables('default', ARRAY['public']);" -U appuser mydb

# I couldn't get sequences replication to work, so I'll suggest another method just before switching the database
# psql -h 10.0.1.10 -p 5432 -c "select pglogical.replication_set_add_all_sequences('default', ARRAY['public']);" -U appuser mydb

This mark all tables and sequences from the public schema to be replicated.

Set up the subscriber and subscription

In the PG 10 database:

1
2
3
4
5
6
7
# create and restore the schema of the database
createdb -U appuser -h 10.0.1.10 -p 5433 mydb
pg_restore -s -h 10.0.1.10 -p 5433 -U appuser -d mydb mydb-schema.dump
# install the pglogical extension and setup the subscriber and subscription
psql -h 10.0.1.10 -p 5433 -c 'create extension pglogical;' -U appuser mydb
psql -h 10.0.1.10 -p 5433 -c "select pglogical.create_node(node_name := 'subscriber', dsn := 'host=10.0.1.10 port=5433 dbname=mydb');" -U appuser mydb
psql -h 10.0.1.10 -p 5433 -c "select pglogical.create_subscription(subscription_name := 'subscription', provider_dsn := 'host=10.0.1.10 port=5432 dbname=mydb');" -U appuser mydb

From now on you can follow the status of the replication with

1
select pglogical.show_subscription_status('subscription');

Once the initialization is over and the databases are synced and replicating (this may take quite a while depending on your database size) you may start the switch.

Replicating the sequence values

At this point the replication database is almost all set. I couldn't figure out how to replicate the sequence values, so, if you're using serial integer primary key columns relying on sequences, then you'll also want to set proper values to the sequences otherwise you won't be able to insert new records while relying on the serial sequence next value. Here's how you can do that. Just to be sure, it's inserting a 5000 gap so that you have enough time to stop the old server after gererating the set-value statements in case your database is very write intensive. You should probably review that gap value depending on how quickly your database might grow up between running those scripts and stopping the server.

1
2
3
4
psql -h 10.0.1.10 -p 5432 -U appuser -c "select string_agg('select ''select setval(''''' || relname || ''''', '' || last_value + 5000 || '')'' from ' || relname, ' union ' order by relname) from pg_class where relkind ='S';" -t -q -o set-sequences-values-generator.sql mydb
psql -h 10.0.1.10 -p 5432 -U appuser -t -q -f set-sequences-values-generator.sql -o set-sequences-values.sql mydb
# set the new sequence values in the new database (port 5433 in this example):
psql -h 10.0.1.10 -p 5433 -U appuser -f set-sequences-values.sql mydb

Final switch steps

Then, basically, you should change the port for the PG10 cluster and set it to 5432 (or whatever was the port the old cluster was using). Then stop the 9.6 cluster (Ctrl+C in the example above) and restart the new cluster. Finally, it's a good idea to also restart the apps using the database, just in case they are relying on some custom types whose conversion rules would depend on the row type OID.

This assumes your apps are able to gracefully handle disconnections for the connections in the pool by using some connection validation before issuing any SQL statements. Otherwise, it's probably a good idea to restart the apps whenever you restart the database after tweaking "postgresql.conf" and "pg_hba.conf".

Clean-up

Once everything is running fine with the new database, you might want to clean things up. If that's the case:

1
2
3
select pglogical.drop_subscription('subscription');
select pglogical.drop_node('subscriber');
drop extension pglogical;

I hope that helps you getting your database upgraded with minimal downtime.

comments powered byDisqus