High availability and Disaster recovery with Docker and Postgres part I

In this article I'll look at how to run a dockerized postgresql cluster with high availablilty and disaster recovery in mind.

In this series of articles I'll look at how to run a dockerized postgresql cluster with high availability and disaster recovery in mind.

The basic strategy outlined here will allow for asynchronous streaming replication and automatic failover of the master db.

For a quick refresher on how postgres does high availability go here.

The ingredients needed to build this stack are postgres, patroni (a python library from Zalando), etcd, haproxy, and of course docker.

Some of the topics that will be covered include:

  • Streaming replication.
  • High availability.
  • Automatic failover.
  • Load balancing read/writes.
  • Logical and physical backup strategies.
  • Automated testing of backups.

To understand everything we're going to cover here you should already have some experience of working with docker and postgres.

You can find the code for the all the examples described in these articles here.

The Stack

Part I of this guide will be based on the configuration file I use for testing and development - docker-stack.test.yml.

A note on the configuration file format

The docker stack api and the docker-compose library use the same yaml configuration standard although not all options in the specification are mutually intelligible to the two libraries.

Please note I will be using the docker stack version 3.x api exclusively in this article. You can browse the full specification here.

The file name "docker-stack.test.yml" is an arbitrary choice as unlike the docker-compose library, the docker stack api does not have defined behaviour based on naming conventions.

The configuration file is written in a declarative style that I think is quite intuitive, even for beginners.

The basic format is as follows:

version: '3.3'
services:
    - server
    - app
    - db
volumes:
    - data
secrets:
    - cert
configs: 
    - nginx.conf
networks:
    - stack

This type of configuration is designed to be deployed on a cluster of host machines referred to collectively as a swarm. The nodes that comprise the swarm are physical or virtual machines running docker. You can find more detailed information about swarms and nodes here.

Services can be thought of as micro-services, and are intended to be idempotent in nature and horizontally scalable.

Volumes are docker's data persistence layer and are mounted into services as needed.

The rest of the configuration api will be touched on as I walk through the test stack's configuration file, starting with the external entrypoint into the stack, the haproxy service.

The Haproxy service

haproxy:
    image: haproxy:alpine
    ports:
        - "5000:5000"
        - "5001:5001"
        - "8008:8008"
    configs:
      - haproxy_cfg
    networks:
      - dbs

The haproxy service routes incoming requests from external sources to the master and replica database nodes.

Each haproxy task (i.e. a running container containing and instance of a service) is created from a lightweight alpine linux image.

The service definition defaults to one instance for the whole stack which is the case here. Services can be scaled using the cli or declaratively in the configuration file:

# service with multiple replicas specified
  redis
    image: redis:alpine
    deploy:
      replicas: 5
      restart_policy:
        condition: on-failure

Service tasks can be constrained to particular hosts or be allocated randomly which is the default behaviour.

# mysql service will only be deployed on hosts/nodes labeled 'db'
  mysql
    image: mysql:5.7
    deploy:
      placement:
        constraints:
          - node.labels.type == db

For more information on service configuration check out the service api docs.

Ports:

The following ports are exposed on the host and forwarded to running haproxy containers:

  • 5000: the master db
  • 5001: the replica dbs
  • 8008: the internal api (used by patroni)

This means external requests can be made to port 5000 for db read/writes, port 5001 for db reads (the replicas are hot-standbys) and port 8008 for access to the master server’s patroni api.

The docker swarm load balancer

Docker will expose ports 5000, 5001 and 8008 on all nodes regardless of whether there is a haproxy service running on that node or not. The internal docker swarm load balancer will route the request to a node with a haproxy task running on it.

If for example the db-1 node is running an instance of haproxy but the db-2 node is not, a query to db-2-ip-address:8008/config will still be routed to the haproxy service on db-1 via Docker swarm's internal networking mesh.

Docker configs and secrets:

The docker swarm secrets api exposes key-value blobs to targeted containers by mounting a secret-value at /run/secrets/secret-key (or at a path of your choice) at runtime. All secrets are encrypted and may be strings or binary content up to 500 kb in size.

Configs use more or less the same api as secrets making it pretty handy to load secret values and library configurations the fly when working with replicated services on different nodes.

In this particular case, the haproxy service is declaring that it should have access to the haproxy_cfg configuration file at the default path /haproxy_cfg.

The actual path to the config file in the repo itself is defined later in the yaml file under the "configs" key:

# docker-stack.test.yml#83
configs:
  haproxy_cfg:
    file: config/haproxy.cfg

# used later to configure patroni services
secrets:
  patroni.yml:
    file: patroni.test.yml

You can read more about the configs api and study more complex examples here.

Haproxy configuration file:

# haproxy_cfg (shortened for brevity)
frontend master_postgresql
	bind *:5000
	default_backend backend_master

frontend replicas_postgresql
	bind *:5001
	default_backend backend_replicas

frontend patroni_api
	bind *:8008
	default_backend backend_api

backend backend_master
	option httpchk OPTIONS /master
	server dbnode1 dbnode1:5432 maxconn 100 check port 8008
	server dbnode2 dbnode2:5432 maxconn 100 check port 8008
	server dbnode3 dbnode3:5432 maxconn 100 check port 8008

Three frontend endpoints are exposed for traffic to the master, replicas and api respectively.

The backends are responsible for routing requests to each dbnode using patroni's health check endpoints (GET /master and GET /replica).

If a node responds with 200 ok, the request is routed there.

Internal DNS and service discovery

Docker swarm mode's internal DNS service assigns hostnames to each service equivalent to its name in the configuration yaml file - in this case haproxy, etcd, dbnode1, dbnode2 and dbnode3.

All stack services can discover each other by means of a default overlay network or by other configured networks.

Hence the "hosts" dbnode1, dbnode2, dbnode3 and etcd are reachable from within a running haproxy container or indeed from any other container in any other node in the stack.

To test this run an interactive terminal in any running container and curl or ping another service using its name as the host value.

    # dbnode1 can reach dbnode2 via the swarm's overlay network
    $ docker exec -ti $(docker ps -qf name=dbnode1) bash
    > curl -I dbnode2:8008

Network configuration

Although I could have used the default overlay network, I've declared an overlay network called "dbs" which explicitly defines a closed network between each service on this stack that references it.

The network itself must be defined under the top-level network key:

# docker-stack.test.yml#72
networks:
  dbs:
    external: true

Other stacks can also connect to closed networks as long as they are marked "external: true" when created.

For example if I had an admin stack, its yaml file could declare "pg_cluster_dbs" as an external network like so:

# my-admin-app.stack.yml
services:
    pg_admin:
        networks:
            - pg_cluster_dbs
            - admin

The master server can now be added to pg admin using "haproxy" as the hostname and "5000" as the port.

It's also possible to connect a standalone container to a stack when running a task. For an example of how to do this refer to the run_tests.sh script.

You can read more about networking in docker swarm mode in more detail here.

Security

By default the nodes encrypt and authenticate information they exchange via gossip using the AES algorithm.

You can read more about swarm modes networking and security here

The Etcd Service

Etcd is a distributed key-value store that implements the raft algorithm (docker swarm mode uses etcd’s raft implementation under the hood also) to form consensus among a cluster of nodes.

etcd:
    image: quay.io/coreos/etcd:v3.1.2
    configs:
      - etcd_cfg
    networks: 
      - dbs
    command: /bin/sh /etcd_cfg

In the context of our postgres cluster we use etcd to:

  • synchronize the current state of the stack if conflicting values exist;
  • agree on a value for the leader of the cluster (the master database);
  • store other configuration information common to all nodes.

The leader election logic is governed by patroni, a python script which runs in an infinite loop. I will cover patroni in more depth in the next section.

Etcd has a http api and a cli (etcdctl) for querying the store. Patroni uses it to obtain the current state of the cluster and its members.

# Etcdctl example - first ssh into etcd container
$ docker exec -ti $(docker ps -qf "name=etcd") ash

# List configuration data for the cluster
$ etcdctl ls service/pg-cluster
> service/pg-cluster/members
> /service/pg-cluster/initialize
> /service/pg-cluster/config
> /service/pg-cluster/leader
> /service/pg-cluster/optime

# Get contents of config dir
$ etcdctl get service/pg-cluster/config
> {"postgresql":{"use_pg_rewind":true}}

If you are unfamiliar with etcd this article provides a nice introduction to its core features.

Generally etcd is designed to be run in a cluster which makes it more resilient to failure but for simplicity's sake we are running only one instance here.

Configs

The config mount is a simple shell command that starts etcd with the minimum configuration needed for it to communicate with the running patroni clients:

etcd
-advertise-client-urls http://etcd:2379 \
-listen-client-urls http://0.0.0.0:2379 \

Full information on configuring etcd and using the etcd cli can be found here.

The Patroni Services

dbnode1:
    image: seocahill/patroni:1.2.5
    secrets: 
      - patroni.yml
    environment:
      - PATRONI_NAME=dbnode1
      - PATRONI_POSTGRESQL_DATA_DIR=data/dbnode1
      - PATRONI_POSTGRESQL_CONNECT_ADDRESS=dbnode1:5432
      - PATRONI_RESTAPI_CONNECT_ADDRESS=dbnode1:8008
    env_file:
      - test.env
    networks:
      - dbs
    entrypoint: patroni
    command: /run/secrets/patroni.yml

The Docker Image

All the nodes use my own patroni image which is derived from the official postgres image. You can find the source here

The Dockerfile installs patroni, its various dependencies, some utilities and does some other minor setup work.

Wal-e is also installed which is a library for shipping wal logs to remote storage buckets.

Environment variables and configuration

Docker allows you to declare environment variables in different ways and with different degrees of granularity.

Environment variables specific to each node have been added inline and an env file that contains common settings is also referenced.

Patroni can be configured either via the rest api, environment variables or by passing a yaml configuration file on initialization.

# sample patroni .env file

# the name of the cluster
PATRONI_SCOPE=pg-cluster

# create an admin user on pg init
PATRONI_admin_OPTIONS=createdb, createrole
PATRONI_admin_PASSWORD=admin

# host and port of etcd service
PATRONI_ETCD_HOST=etcd:2379

# location of password file
PATRONI_POSTGRESQL_PGPASS=home/postgres/.pgpass

# address patroni will use to connect to local server
PATRONI_POSTGRESQL_LISTEN=0.0.0.0:5432

# replication user and password
PATRONI_REPLICATION_PASSWORD=abcd
PATRONI_REPLICATION_USERNAME=replicator

# address patroni used to receive incoming api calls
PATRONI_RESTAPI_LISTEN=0.0.0.0:8008

# api basic auth
PATRONI_RESTAPI_PASSWORD=admin
PATRONI_RESTAPI_USERNAME=admin

# patroni needs superuser adminstrate postgres
PATRONI_SUPERUSER_PASSWORD=postgres
PATRONI_SUPERUSER_USERNAME=postgres

The complete list of patroni configuration options can be found here.

Running patroni

The patroni command initiates or restarts a patroni instance and takes one argument, a yaml configuration file.

The configuration file has an extensive array of configuration options. You can view an annotated example here.

Something to bear in mind is that this method of configuration only works on initialization and not on server restart. Use the api to change the configuration of a running patroni instance by sending a PATCH request to the config endpoint:

$ curl -s -XPATCH -d \
    '{"postgresql":{"parameters":{"max_connections":"101"}}}' 

Replication setup

In order to allow a hot standby to connect to the master and replicate asynchronously, a custom pg_hba configuration must be specified in the patroni configuration yaml file.

# patroni.yml
bootstrap:
  dcs:
    postgresql:
      use_pg_rewind: true
  pg_hba:
    - host all all 0.0.0.0/0 md5
    - host replication replicator 10.0.0.0/24 md5

The bootstrap key here indicates that this configuration is applied only on initialization.

The dcs key stands for "distributed configuration settings", that is the settings stored centrally in etcd that are common to all patroni instances.

The ip range for the replication user should encompass the ip range defined on the network that connects the dbnodes, in this case the 'dbs' network.

By default docker swarm mode uses the 10.0.0.0/24 ip range but you can use the command docker network inspect dbs to debug the dbs network settings if in doubt.

It is also possible to create an overlay network with a specified gateway and subnet by defining it ahead of time in the docker configuration yaml file under the "networks" key e.g.

networks:
  dbs:
    ipam:
      config:
        - subnet: 10.0.0.0/24

How patroni works

Patroni consists of a run loop that sends requests to etcd and performs various tasks such as:

  • bootstrapping a new cluster (if none exists);
  • initializing a postgres server instance on the current node;
  • selecting a leader for the cluster;
  • carrying out a failover procedure if the leader goes down;
  • querying and updating the stack state via etcd with respect to a fixed ttl (30 seconds by default).

Patroni has a rest api that can be used to query and update configurations and has a cli which can show the current state of the cluster, pause and remove nodes, as well as initiate a planned failover or ‘switchover’.

Patroni also includes ancillary modules that provide various helpful utilities such as the health check scripts utilized by haproxy to direct external requests.

Understanding Patroni

The patroni codebase is under fairly heavy development and you may find it a bit hard to figure out what's going on at first.

My recommendation is to open up the original governor repo in one tab and the high availability decision loop diagram in another, and to walk through it. The code is clean and easy to understand and it will give you a good frame of reference before diving into patroni proper.

This blogpost also provides a nice quick overview of the original design choices and motivation for governor.

Wrap up and part II

Hopefully by now you have a good idea of the libraries and services that make up the stack and their respective roles within it.

Next up we'll provision some hosts and deploy our highly available postgres cluster.