Building a shared GFS2 filesystem with Hyperdisk Balanced HA multi-writer

2024-10-08 · 1691 words · 8 minute read

GCP · Hyperdisk

Building highly available and performant applications often requires shared storage accessible by multiple VMs. Traditionally shared storage has been made available to VMs in a single zone and was not fault tolerant to zone failure. Google Cloud’s Hyperdisk Balanced HA multi-writer technology solves this problem by combining cross-zone replication with multi-writer access from VMs across two zones.

This solution opens up new possibilities for applications requiring both resilience and performance including:

Host based clusters: High availability designs that typically use Linux Pacemaker or Windows Server Failover Clusters to assign and move resources amongst nodes in the cluster. With Hyperdisk Balanced HA multi-writer those nodes can now be in two different zones and use Persistent Reservations to control access to shared-disk resources.
Shared-disk filesystems: Shared storage designs supported by filesystems such as GFS2, OCFS2 and VMFS that manage access to shared-disk resources using clustering and/or locking capabilities for data coherence across nodes. Often combined with host based clusters.

If your shared storage needs can be met using NFS or SMB protocols I highly recommend using Filestore (single zone or HA across three zones) or NetApp Volumes (single zone or HA across two zones). Both of these provide check-box simplicity for shared storage requirements in a zone, or across zones. But, the blog today is about Hyperdisk Multi-writer so we’ll take the harder path and build something ourselves!

Building a GFS2 Shared Filesystem with Ubuntu

Solutions that use shared disks tend to be complex, and to focus on the core concepts I decided to build a basic GFS2 shared-disk filesystem across two VMs in different zones. With this solution an application running on each of those VMs can read and write data from the same filesystem concurrently, and a VM (or zone) can fail and the data is still available in the other. Filesystem coherency is managed by a cluster-wide locking scheme that uses DLM and Corosync to exchange locking information. I added a VM in a third zone to make locking services resilient.

The infrastructure can be visualized as:

Multi-zone GFS2 design on GCP

Step by Step Instructions

WARNING: These instructions are useful to prove the concept and NOT suitable for a production deployment. A production deployment would secure Corosync, likely add Pacemaker cluster software for coordinated resource management and fencing to increase availability and fault tolerance, use LVM, and include tuning of the DLM and GFS2 settings for the desired use case.

Prepare the environment and create cloud resources

Assumptions:

You are a project owner
Avoid setting public IP addresses on resources, instead:
- Configure a Public NAT in the region enabling VMs to pull packages from the Internet
- Enable IAP for TCP forwarding and created a firewall rule allowing IAP connections to your VPC

Set environment variables for our desired location:
```
PROJECT=p20241007-gfs2-88
REGION=europe-west4
```
The Dynamic Lock Manager (DLM) service and Corosync manage locking information between VMs in the cluster. Create a firewall rule to allow all communication between VMs when tagged with gfs2cluster:
```
gcloud compute firewall-rules create gfs2cluster \
--project=$PROJECT \
--direction=INGRESS --allow=all \
--target-tags=gfs2cluster \
--source-tags=gfs2cluster
```

Create three Ubuntu VMs, one in each zone of your selected region. These VMs will form the basis of the GFS2 cluster:

for ID in a b c; do
gcloud compute instances create node-$ID \
--machine-type=c3-highcpu-8 \
--shielded-secure-boot \
--network-interface=stack-type=IPV4_ONLY,subnet=default,no-address \
--tags=gfs2cluster \
--project=$PROJECT \
--zone=$REGION-$ID \
--create-disk=auto-delete=yes,boot=yes,mode=rw,provisioned-iops=3000,provisioned-throughput=140,size=10,image-project=ubuntu-os-cloud,image-family=ubuntu-2404-lts-amd64,type=projects/$PROJECT/zones/$REGION-$ID/diskTypes/hyperdisk-balanced
done

Create a HdB-HA-MW disk with replicas in the b and c zones. We will use these zones for our GFS2 VMs and the VM in zone a for distributed locking only.

gcloud compute disks create hdb-ha-mw \
--project=$PROJECT \
--type=hyperdisk-balanced-high-availability \
--replica-zones=projects/$PROJECT/zones/$REGION-b,projects/$PROJECT/zones/$REGION-c \
--size=250GB \
--access-mode=READ_WRITE_MANY \
--provisioned-iops=6000 --provisioned-throughput=280

Attach the disk to the -b and -c VMs.

gcloud compute instances attach-disk node-b --disk=hdb-ha-mw --device-name=hdb-ha-mw --zone=$REGION-b --disk-scope=regional --project=$PROJECT
gcloud compute instances attach-disk node-c --disk=hdb-ha-mw --device-name=hdb-ha-mw --zone=$REGION-c --disk-scope=regional --project=$PROJECT

Install and configure the VMs

Open a shell to each of the three VMs. Launching a SSH-in-browser tab from Cloud Console is easy.
On node-a install modules needed for distributed locking and cluster communications:
```
sudo apt-get update
sudo apt-get install corosync dlm-controld -y
```

On node-b and node-c install modules for GFS2, distributed locking and cluster communications:

sudo apt-get update
sudo apt-get install corosync dlm-controld gfs2-utils linux-modules-extra-$(uname -r) fio -y

Create a Corosync config file based on your VMs. Update the below replacing the node entries in the nodelist with the FQDN and ip address of your VMs. Use hostname and ip addr show on each VM to get the needed values. Once you have adapted the config run the command to apply it on all three VMs:

cat > corosync.conf <<EOF
totem {
    version: 2
    cluster_name: gcpcluster
    crypto_cipher: none
    crypto_hash: none
}

nodelist {
    node {
            name: node-a.europe-west4-a.c.p20241007-gfs2-88.internal
            nodeid: 1
            ring0_addr: 10.164.0.7
    }
    node {
            name: node-b.europe-west4-b.c.p20241007-gfs2-88.internal
            nodeid: 2
            ring0_addr: 10.164.0.8
    }
    node {
            name: node-c.europe-west4-c.c.p20241007-gfs2-88.internal
            nodeid: 3
            ring0_addr: 10.164.0.9
    }
}

quorum {
    provider: corosync_votequorum
}

system {
    allow_knet_handle_fallback: yes
}

logging {
    fileline: off
    to_stderr: yes
    to_logfile: yes
    logfile: /var/log/corosync/corosync.log
    to_syslog: yes
    debug: off
    logger_subsys {
            subsys: QUORUM
            debug: off
    }
}
EOF

sudo cp corosync.conf /etc/corosync/corosync.conf

On each VM restart corosync and verify all three nodes are members. Do not continue if the output does not show all three nodes as members:

sudo systemctl restart corosync

sudo corosync-quorumtool 
Quorum information
------------------
Date:             Fri Oct  4 16:48:01 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          1
Ring ID:          1.12
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
Nodeid      Votes Name
     1          1 node-a.europe-west4-a.c.p20241007-gfs2-88.internal (local)
     2          1 node-b.europe-west4-b.c.p20241007-gfs2-88.internal
     3          1 node-c.europe-west4-c.c.p20241007-gfs2-88.internal

On node-b create a GFS2 volume. The -t argument includes the cluster_name we configured in the corosync.conf file, and shared is a unique filesystem name. We set -j 2 because we are sharing the disk with two VMs. Run this command and answer yes when prompted:
```
sudo mkfs.gfs2 -p lock_dlm -t gcpcluster:shared -j 2 /dev/disk/by-id/google-hdb-ha-mw
```

On both node-b and node-c create a mount point and systemd service to mount it on boot because a standard entry in /etc/fstab is insufficient due to dependencies on DLM.

sudo mkdir /shared

cat > gfs2-mount.service <<EOF
[Unit]
Description=Mount GFS2 Filesystem
After=network-online.target
After=dlm.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/mount -t gfs2 -o rw,noatime,nodiratime,rgrplvb /dev/disk/by-id/google-hdb-ha-mw /shared
ExecStop=/bin/umount /shared

[Install]
WantedBy=multi-user.target
EOF

sudo cp gfs2-mount.service /etc/systemd/system/gfs2-mount.service
sudo systemctl enable --now gfs2-mount.service

To ensure all required services start correctly on boot, stop and start all VMs from the Cloud Console or from the CLI using:

for ID in a b c; do
gcloud compute instances stop node-$ID \
--project=$PROJECT \
--zone=$REGION-$ID
done

for ID in a b c; do
gcloud compute instances start node-$ID \
--project=$PROJECT \
--zone=$REGION-$ID \
--async
done

Mount and use the shared filesystem

Re-open shells to each of the three VMs. From node-b relax filesystem permissions to ease access and then start a simple workload:
```
sudo chmod 777 /shared
cd /shared
watch -n 1 "echo -n $(hostname) ' ' >> date.out; date >> date.out"
```

On node-c read the file that is being appended to by node-b:

cd /shared
tail -f date.out

node-b.europe-west4-b.c.p20241007-gfs2-88.internal  Mon Oct  7 16:20:49 UTC 2024
node-b.europe-west4-b.c.p20241007-gfs2-88.internal  Mon Oct  7 16:20:50 UTC 2024
node-b.europe-west4-b.c.p20241007-gfs2-88.internal  Mon Oct  7 16:20:51 UTC 2024
node-b.europe-west4-b.c.p20241007-gfs2-88.internal  Mon Oct  7 16:20:52 UTC 2024
node-b.europe-west4-b.c.p20241007-gfs2-88.internal  Mon Oct  7 16:20:53 UTC 2024

Let’s test this shared filesystem by running the same workload, on the same file, from node-c. This will demonstrate how GFS2’s locking mechanisms ensure data integrity:
```
watch -n 1 "echo -n $(hostname) ' ' >> date.out; date >> date.out"
```

Open another shell to node-b and follow the file to observe writes from both hosts are intermixed:

tail -f /shared/date.out

node-c.europe-west4-c.c.p20241007-gfs2-88.internal  Mon Oct  7 16:22:49 UTC 2024
node-b.europe-west4-b.c.p20241007-gfs2-88.internal  Mon Oct  7 16:22:49 UTC 2024
node-c.europe-west4-c.c.p20241007-gfs2-88.internal  Mon Oct  7 16:22:50 UTC 2024
node-b.europe-west4-b.c.p20241007-gfs2-88.internal  Mon Oct  7 16:22:50 UTC 2024
node-c.europe-west4-c.c.p20241007-gfs2-88.internal  Mon Oct  7 16:22:51 UTC 2024
node-b.europe-west4-b.c.p20241007-gfs2-88.internal  Mon Oct  7 16:22:51 UTC 2024

Now it’s time to test the performance. Stop the date commands and use the fio command below to start a write IOP workload from each VM concurrently. Observe each VM achieves half the disk provisioned IOPs, or 3K IOPs in our case, with similar latencies:

fio --name=`hostname -s` --size=1G \
--time_based --runtime=5m --ramp_time=2s --ioengine=libaio --direct=1 \
--verify=0 --bs=4K --iodepth=2 --rw=randwrite --group_reporting=1

node-b: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=2
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=11.7MiB/s][w=3001 IOPS][eta 00m:00s]
node-b: (groupid=0, jobs=1): err= 0: pid=2501: Mon Oct  7 16:42:21 2024
write: IOPS=3016, BW=11.8MiB/s (12.4MB/s)(3535MiB/300001msec); 0 zone resets
slat (nsec): min=1919, max=150331, avg=4637.11, stdev=2043.55
clat (usec): min=404, max=29166, avg=657.96, stdev=192.52
 lat (usec): min=407, max=29171, avg=662.60, stdev=192.85
clat percentiles (usec):
 |  1.00th=[  474],  5.00th=[  506], 10.00th=[  529], 20.00th=[  553],
 | 30.00th=[  570], 40.00th=[  586], 50.00th=[  611], 60.00th=[  627],
 | 70.00th=[  652], 80.00th=[  685], 90.00th=[  955], 95.00th=[ 1123],
 | 99.00th=[ 1270], 99.50th=[ 1319], 99.90th=[ 1483], 99.95th=[ 1713],
 | 99.99th=[ 2671]
bw (  KiB/s): min=11704, max=14024, per=100.00%, avg=12068.29, stdev=300.79, samples=600
iops        : min= 2926, max= 3506, avg=3017.06, stdev=75.19, samples=600
lat (usec)   : 500=4.09%, 750=82.75%, 1000=3.81%
lat (msec)   : 2=9.32%, 4=0.03%, 10=0.01%, 20=0.01%, 50=0.01%
cpu          : usr=0.32%, sys=1.57%, ctx=596311, majf=0, minf=37
IO depths    : 1=0.0%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=0,904999,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
WRITE: bw=11.8MiB/s (12.4MB/s), 11.8MiB/s-11.8MiB/s (12.4MB/s-12.4MB/s), io=3535MiB (3707MB), run=300001-300001msec

Disk stats (read/write):
nvme0n2: ios=0/911244, sectors=0/7289952, merge=0/0, ticks=0/594190, in_queue=594190, util=98.94%

Use Cloud Monitoring to verify the per attachment IOPs compute.googleapis.com/instance/disk/write_ops_count metric, filtered on device_name=hdb-ha-mw, aggregated by system metadata label name:

At this point you are free to try other workloads, reboot to simulate failure modes, and in general experiment with the ins and outs of a shared-disk filesystem. When you are done, delete the VMs and the shared disk to clean up.

Conclusion

The ability to replicate data across zones, coupled with the multi-writer access mode, makes Hyperdisk an ideal solution for applications demanding both high availability and performance. This blog post demonstrated a basic GFS2 implementation, but the possibilities extend far beyond. Hyperdisk Balanced HA multi-writer provides a foundation for building resilient and scalable applications on Google Cloud.

As always, comments are welcome!