Running bare metal PostgreSQL on ZFS

Oct 31, 2023

I'm setting up new postgres servers for Atuin! We're going with a hot replica this time, and making things much more reliable. Atuin has had no outages or database issues in a couple of years, but I don't want to push my luck.

You might also be interested in the hetzner k3s setup I did for the Atuin api images

I'm going to be doing a fairly minimal setup to begin with, and tune things later. I'm not sure which options will best suit my workload, so will keep things simple.

Note that the Atuin queries are not complicated. We mostly just store a pretty high volume of data, and read sequentially (ish). There are minimal joins, and minimal complex queries.

We will often have bursts where 10s/100s of thousands of rows need to be written or read as quickly as possible - however this is also pretty sequential and not at all "complex".

Ideally, we will compress the data as much as possible. While Atuin mostly stores encrypted data (which does not compress well), there is a decent amount of unencrypted data in the form of JSON padding + timestamps etc. Compress them! This will nicely reduce disk usage, but also IO - at the cost of some CPU.

As our queries are simple, the CPU cost is fine.

Setting up

I'm running this on a couple of hetzner machines I bought in an auction. They have Ryzen CPUs, and 4x 1tb nvme SSD.

Two of the SSDs will be running a simple RAID mirror, and will store the OS and logs. The other two will have my ZFS filesystem + postgres on them.

This costs me 50% of my storage in mirroring, however as this is bare metal hardware there's a chance that a disk could fail. I'd like to ensure my database can continue running until I failover to the replica + a technician can replace the disk.

# Install zfs
apt install zfsutils-linux

# check version
zfs version

Check your disk layout with lsblk

NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
nvme1n1     259:0    0 953.9G  0 disk
├─nvme1n1p1 259:1    0    32G  0 part
│ └─md0       9:0    0    32G  0 raid1 [SWAP]
├─nvme1n1p2 259:2    0     1G  0 part
│ └─md1       9:1    0  1022M  0 raid1 /boot
├─nvme1n1p3 259:3    0   128G  0 part
│ └─md2       9:2    0 127.9G  0 raid1 /var
├─nvme1n1p4 259:4    0     1K  0 part
└─nvme1n1p5 259:5    0 792.9G  0 part
  └─md3       9:3    0 792.7G  0 raid1 /
nvme0n1     259:6    0 953.9G  0 disk
nvme3n1     259:7    0 953.9G  0 disk
nvme2n1     259:8    0 953.9G  0 disk
├─nvme2n1p1 259:9    0    32G  0 part
│ └─md0       9:0    0    32G  0 raid1 [SWAP]
├─nvme2n1p2 259:10   0     1G  0 part
│ └─md1       9:1    0  1022M  0 raid1 /boot
├─nvme2n1p3 259:11   0   128G  0 part
│ └─md2       9:2    0 127.9G  0 raid1 /var
├─nvme2n1p4 259:12   0     1K  0 part
└─nvme2n1p5 259:13   0 792.9G  0 part
  └─md3       9:3    0 792.7G  0 raid1 /

/dev/nvme1n1 and /dev/nvme2n1 are both in use for my OS, but /dev/nvme0n1 and /dev/nvme3n1 are leftover for my zfs mirror.

Create the mirror

zpool create postgres mirror -o ashift=12 /dev/nvme0n1 /dev/nvme3n1

It returned nice and fast!

Checking again with lsblk shows some usage

nvme0n1     259:6    0 953.9G  0 disk
├─nvme0n1p1 259:14   0 953.9G  0 part
└─nvme0n1p9 259:15   0     8M  0 part
nvme3n1     259:7    0 953.9G  0 disk
├─nvme3n1p1 259:18   0 953.9G  0 part
└─nvme3n1p9 259:19   0     8M  0 part

Confirmed with zpool status

  pool: postgres
 state: ONLINE
config:

	NAME         STATE     READ WRITE CKSUM
	postgres     ONLINE       0     0     0
	  mirror-0   ONLINE       0     0     0
	    nvme0n1  ONLINE       0     0     0
	    nvme3n1  ONLINE       0     0     0

errors: No known data errors

The pool is mounted by default at /postgres. Nice!

Postgres

Before we can setup datasets, we need to setup postgres. Postgres needs to run its init before we move it to zfs. Annoying, but ok.

Ubuntu 22.04 has a pretty old version in its repos, so install the postgres one

sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'

wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -

apt update

apt install postgresql-16 postgresql-contrib-16
systemctl enable postgresql
systemctl start postgresql

Then, stop postgres. Move its data to a temp location, create the datasets, and move things back again. Could probably rsync instead of mv/cp to keep permissions.

systemctl stop postgresql

# move postgres data to temp
mv /var/lib/postgresql/16/main/pg_wal /tmp/pg_wal
mv /var/lib/postgresql /tmp/postgresql

# create the datasets
zfs create postgres/data -o mountpoint=/var/lib/postgresql
zfs create postgres/wal -o mountpoint=/var/lib/postgresql/16/main/pg_wal

# switcharoo the data back
cp -r /tmp/postgresql/* /var/lib/postgresql
cp -r /tmp/pg_wal/* /var/lib/postgresql/16/main/pg_wal

# sort perms
chmod -R 0700 /var/lib/postgresql
chmod -R 0700 /var/lib/postgresql/16/main/pg_wal
chown -R postgres: /var/lib/postgresql

# start postgres once more
systemctl start postgresql

Check all is ok:

zfs list

NAME                 USED  AVAIL     REFER  MOUNTPOINT
postgres             600K   922G       24K  /postgres
postgres/db           72K   922G       24K  /postgres/db
postgres/db/base      24K   922G       24K  /postgres/db/base
postgres/db/pg_wal    24K   922G       24K  /postgres/db/pg_wal

Configuring ZFS

I've read a bunch, including:

https://vadosware.io/post/everything-ive-seen-on-optimizing-postgres-on-zfs-on-linux/
https://bun.uptrace.dev/postgres/tuning-zfs-aws-ebs.html

One of the above articles had issues where some nvme hardware was reporting a successful write when using fdatasync, but not actually writing ok. Many drives will report a successful write once the data has been stored in the volatile write cache, but not actually stored. Disable this with:

apt install nvme-cli
nvme set-feature -f 6 -v 0 /dev/nvme0n1
nvme set-feature -f 6 -v 0 /dev/nvme3n1

I spend a while thinking about the optimal recordsize. While I'd get a higher tps running recordsize = postgres block size = 8kb, Atuin is largely sequential + reads/writes large amounts of data. I could change the postgres block size (it's a compilation option), but I'll consider trying that in the future.

To begin with, I'll try the default of 128k and see how it goes. Lower numbers are potentially faster, but higher numbers tend to get better compression. It depends on a lot of factors though, so I'll measure things and see.

# enable compression
zfs set compression=zstd-3 postgres

# disable access time (so, so many writes...)
zfs set atime=off postgres

# enable improved extended attributes
zfs set xattr=sa postgres

# zfs set recordsize=16k postgres

I then set full_page_writes = off on the postgres side - zfs cannot write partial pages so it's pretty redundant.

Next up, we'll give the ARC (ZFS page cache) 75% of the system memory. The remainder will be used by postgres shared_buffers.

echo 51539607552 >> /sys/module/zfs/parameters/zfs_arc_max

To persist across reboots, set this in /etc/modprobe.d/zfs.conf

options zfs zfs_arc_max=51539607552

Postgres config

Not specific to ZFS, but some general postgres tuning

# 25% of 64GB
shared_buffers = 16GB

work_mem = 8MB

# make vaccuums/etc faster
maintenance_work_mem = 1GB

# tell the planner how much the ZFS ARC will likely cache
effective_cache_size = 48GB

There's a bunch more tuning we can do, but realistically this will help the most. Postgres has a pretty tiny amount of memory configured by default!

At this point, I rebooted to ensure everything was OK + persisted properly.

Chuck load at it

I'd like to ensure my system performs at least mostly ok, so I ran pgbench and got

scaling factor: 50
query mode: simple number of clients: 20 
number of threads: 4 
maximum number of tries: 1 
number of transactions per client: 100000 
number of transactions actually processed: 2000000/2000000 
number of failed transactions: 0 (0.000%) 
latency average = 2.863 ms 
initial connection time = 10.287 ms 
tps = 6984.804317 (without initial connection time)

Not bad! Not that representative of my workload, but at least an indication it's not totally broken. I'll tune this some more in the future.

Next I'll be configuring backups with pgbackrest, and setting up a hot standby we can failover to should things go wrong!

Then copying the dataset across, and making this new database production 🚀