Building a shared GFS2 filesystem with Hyperdisk Balanced HA multi-writer

Building highly available and performant applications often requires shared storage accessible by multiple VMs. Traditionally shared storage has been made available to VMs in a single zone and was not fault tolerant to zone failure. Google Cloud’s Hyperdisk Balanced HA multi-writer technology solves this problem by combining cross-zone replication with multi-writer access from VMs across two zones.

This solution opens up new possibilities for applications requiring both resilience and performance including:

  • Host based clusters: High availability designs that typically use Linux Pacemaker or Windows Server Failover Clusters to assign and move resources amongst nodes in the cluster. With Hyperdisk Balanced HA multi-writer those nodes can now be in two different zones and use Persistent Reservations to control access to shared-disk resources.
  • Shared-disk filesystems: Shared storage designs supported by filesystems such as GFS2, OCFS2 and VMFS that manage access to shared-disk resources using clustering and/or locking capabilities for data coherence across nodes. Often combined with host based clusters.

If your shared storage needs can be met using NFS or SMB protocols I highly recommend using Filestore (single zone or HA across three zones) or NetApp Volumes (single zone or HA across two zones). Both of these provide check-box simplicity for shared storage requirements in a zone, or across zones. But, the blog today is about Hyperdisk Multi-writer so we’ll take the harder path and build something ourselves!

Building a GFS2 Shared Filesystem with Ubuntu

Solutions that use shared disks tend to be complex, and to focus on the core concepts I decided to build a basic GFS2 shared-disk filesystem across two VMs in different zones. With this solution an application running on each of those VMs can read and write data from the same filesystem concurrently, and a VM (or zone) can fail and the data is still available in the other. Filesystem coherency is managed by a cluster-wide locking scheme that uses DLM and Corosync to exchange locking information. I added a VM in a third zone to make locking services resilient.

The infrastructure can be visualized as:

Multi-zone GFS2 design on GCP

Step by Step Instructions

WARNING: These instructions are useful to prove the concept and NOT suitable for a production deployment. A production deployment would secure Corosync, likely add Pacemaker cluster software for coordinated resource management and fencing to increase availability and fault tolerance, use LVM, and include tuning of the DLM and GFS2 settings for the desired use case.

Prepare the environment and create cloud resources

Assumptions:

  • You are a project owner
  • Avoid setting public IP addresses on resources, instead:
  1. Set environment variables for our desired location:

    PROJECT=p20241007-gfs2-88
    REGION=europe-west4
    
  2. The Dynamic Lock Manager (DLM) service and Corosync manage locking information between VMs in the cluster. Create a firewall rule to allow all communication between VMs when tagged with gfs2cluster:

    gcloud compute firewall-rules create gfs2cluster \
    --project=$PROJECT \
    --direction=INGRESS --allow=all \
    --target-tags=gfs2cluster \
    --source-tags=gfs2cluster
    
  3. Create three Ubuntu VMs, one in each zone of your selected region. These VMs will form the basis of the GFS2 cluster:

    for ID in a b c; do
    gcloud compute instances create node-$ID \
    --machine-type=c3-highcpu-8 \
    --shielded-secure-boot \
    --network-interface=stack-type=IPV4_ONLY,subnet=default,no-address \
    --tags=gfs2cluster \
    --project=$PROJECT \
    --zone=$REGION-$ID \
    --create-disk=auto-delete=yes,boot=yes,mode=rw,provisioned-iops=3000,provisioned-throughput=140,size=10,image-project=ubuntu-os-cloud,image-family=ubuntu-2404-lts-amd64,type=projects/$PROJECT/zones/$REGION-$ID/diskTypes/hyperdisk-balanced
    done
    
  4. Create a HdB-HA-MW disk with replicas in the b and c zones. We will use these zones for our GFS2 VMs and the VM in zone a for distributed locking only.

    gcloud compute disks create hdb-ha-mw \
    --project=$PROJECT \
    --type=hyperdisk-balanced-high-availability \
    --replica-zones=projects/$PROJECT/zones/$REGION-b,projects/$PROJECT/zones/$REGION-c \
    --size=250GB \
    --access-mode=READ_WRITE_MANY \
    --provisioned-iops=6000 --provisioned-throughput=280
    
  5. Attach the disk to the -b and -c VMs.

    gcloud compute instances attach-disk node-b --disk=hdb-ha-mw --device-name=hdb-ha-mw --zone=$REGION-b --disk-scope=regional --project=$PROJECT
    gcloud compute instances attach-disk node-c --disk=hdb-ha-mw --device-name=hdb-ha-mw --zone=$REGION-c --disk-scope=regional --project=$PROJECT 
    

Install and configure the VMs

  1. Open a shell to each of the three VMs. Launching a SSH-in-browser tab from Cloud Console is easy.

  2. On node-a install modules needed for distributed locking and cluster communications:

    sudo apt-get update
    sudo apt-get install corosync dlm-controld -y
    
  3. On node-b and node-c install modules for GFS2, distributed locking and cluster communications:

    sudo apt-get update
    sudo apt-get install corosync dlm-controld gfs2-utils linux-modules-extra-$(uname -r) fio -y
    
  4. Create a Corosync config file based on your VMs. Update the below replacing the node entries in the nodelist with the FQDN and ip address of your VMs. Use hostname and ip addr show on each VM to get the needed values. Once you have adapted the config run the command to apply it on all three VMs:

    cat > corosync.conf <<EOF
    totem {
        version: 2
        cluster_name: gcpcluster
        crypto_cipher: none
        crypto_hash: none
    }
    
    nodelist {
        node {
                name: node-a.europe-west4-a.c.p20241007-gfs2-88.internal
                nodeid: 1
                ring0_addr: 10.164.0.7
        }
        node {
                name: node-b.europe-west4-b.c.p20241007-gfs2-88.internal
                nodeid: 2
                ring0_addr: 10.164.0.8
        }
        node {
                name: node-c.europe-west4-c.c.p20241007-gfs2-88.internal
                nodeid: 3
                ring0_addr: 10.164.0.9
        }
    }
    
    quorum {
        provider: corosync_votequorum
    }
    
    system {
        allow_knet_handle_fallback: yes
    }
    
    logging {
        fileline: off
        to_stderr: yes
        to_logfile: yes
        logfile: /var/log/corosync/corosync.log
        to_syslog: yes
        debug: off
        logger_subsys {
                subsys: QUORUM
                debug: off
        }
    }
    EOF
    
    sudo cp corosync.conf /etc/corosync/corosync.conf
    
  5. On each VM restart corosync and verify all three nodes are members. Do not continue if the output does not show all three nodes as members:

    sudo systemctl restart corosync
    
    sudo corosync-quorumtool 
    Quorum information
    ------------------
    Date:             Fri Oct  4 16:48:01 2024
    Quorum provider:  corosync_votequorum
    Nodes:            3
    Node ID:          1
    Ring ID:          1.12
    Quorate:          Yes
    
    Votequorum information
    ----------------------
    Expected votes:   3
    Highest expected: 3
    Total votes:      3
    Quorum:           2  
    Flags:            Quorate 
    
    Membership information
    ----------------------
    Nodeid      Votes Name
         1          1 node-a.europe-west4-a.c.p20241007-gfs2-88.internal (local)
         2          1 node-b.europe-west4-b.c.p20241007-gfs2-88.internal
         3          1 node-c.europe-west4-c.c.p20241007-gfs2-88.internal
    
  6. On node-b create a GFS2 volume. The -t argument includes the cluster_name we configured in the corosync.conf file, and shared is a unique filesystem name. We set -j 2 because we are sharing the disk with two VMs. Run this command and answer yes when prompted:

    sudo mkfs.gfs2 -p lock_dlm -t gcpcluster:shared -j 2 /dev/disk/by-id/google-hdb-ha-mw
    
  7. On both node-b and node-c create a mount point and systemd service to mount it on boot because a standard entry in /etc/fstab is insufficient due to dependencies on DLM.

    sudo mkdir /shared
    
    cat > gfs2-mount.service <<EOF
    [Unit]
    Description=Mount GFS2 Filesystem
    After=network-online.target
    After=dlm.service
    
    [Service]
    Type=oneshot
    RemainAfterExit=yes
    ExecStart=/bin/mount -t gfs2 -o rw,noatime,nodiratime,rgrplvb /dev/disk/by-id/google-hdb-ha-mw /shared
    ExecStop=/bin/umount /shared
    
    [Install]
    WantedBy=multi-user.target
    EOF
    
    sudo cp gfs2-mount.service /etc/systemd/system/gfs2-mount.service
    sudo systemctl enable --now gfs2-mount.service
    
  8. To ensure all required services start correctly on boot, stop and start all VMs from the Cloud Console or from the CLI using:

    for ID in a b c; do
    gcloud compute instances stop node-$ID \
    --project=$PROJECT \
    --zone=$REGION-$ID
    done
    
    for ID in a b c; do
    gcloud compute instances start node-$ID \
    --project=$PROJECT \
    --zone=$REGION-$ID \
    --async
    done
    

Mount and use the shared filesystem

  1. Re-open shells to each of the three VMs. From node-b relax filesystem permissions to ease access and then start a simple workload:

    sudo chmod 777 /shared
    cd /shared
    watch -n 1 "echo -n $(hostname) ' ' >> date.out; date >> date.out"
    
  2. On node-c read the file that is being appended to by node-b:

    cd /shared
    tail -f date.out
    
    node-b.europe-west4-b.c.p20241007-gfs2-88.internal  Mon Oct  7 16:20:49 UTC 2024
    node-b.europe-west4-b.c.p20241007-gfs2-88.internal  Mon Oct  7 16:20:50 UTC 2024
    node-b.europe-west4-b.c.p20241007-gfs2-88.internal  Mon Oct  7 16:20:51 UTC 2024
    node-b.europe-west4-b.c.p20241007-gfs2-88.internal  Mon Oct  7 16:20:52 UTC 2024
    node-b.europe-west4-b.c.p20241007-gfs2-88.internal  Mon Oct  7 16:20:53 UTC 2024
    
  3. Let’s test this shared filesystem by running the same workload, on the same file, from node-c. This will demonstrate how GFS2’s locking mechanisms ensure data integrity:

    watch -n 1 "echo -n $(hostname) ' ' >> date.out; date >> date.out"
    
  4. Open another shell to node-b and follow the file to observe writes from both hosts are intermixed:

    tail -f /shared/date.out
    
    node-c.europe-west4-c.c.p20241007-gfs2-88.internal  Mon Oct  7 16:22:49 UTC 2024
    node-b.europe-west4-b.c.p20241007-gfs2-88.internal  Mon Oct  7 16:22:49 UTC 2024
    node-c.europe-west4-c.c.p20241007-gfs2-88.internal  Mon Oct  7 16:22:50 UTC 2024
    node-b.europe-west4-b.c.p20241007-gfs2-88.internal  Mon Oct  7 16:22:50 UTC 2024
    node-c.europe-west4-c.c.p20241007-gfs2-88.internal  Mon Oct  7 16:22:51 UTC 2024
    node-b.europe-west4-b.c.p20241007-gfs2-88.internal  Mon Oct  7 16:22:51 UTC 2024
    
  5. Now it’s time to test the performance. Stop the date commands and use the fio command below to start a write IOP workload from each VM concurrently. Observe each VM achieves half the disk provisioned IOPs, or 3K IOPs in our case, with similar latencies:

    fio --name=`hostname -s` --size=1G \
    --time_based --runtime=5m --ramp_time=2s --ioengine=libaio --direct=1 \
    --verify=0 --bs=4K --iodepth=2 --rw=randwrite --group_reporting=1
    
    node-b: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=2
    fio-3.36
    Starting 1 process
    Jobs: 1 (f=1): [w(1)][100.0%][w=11.7MiB/s][w=3001 IOPS][eta 00m:00s]
    node-b: (groupid=0, jobs=1): err= 0: pid=2501: Mon Oct  7 16:42:21 2024
    write: IOPS=3016, BW=11.8MiB/s (12.4MB/s)(3535MiB/300001msec); 0 zone resets
    slat (nsec): min=1919, max=150331, avg=4637.11, stdev=2043.55
    clat (usec): min=404, max=29166, avg=657.96, stdev=192.52
     lat (usec): min=407, max=29171, avg=662.60, stdev=192.85
    clat percentiles (usec):
     |  1.00th=[  474],  5.00th=[  506], 10.00th=[  529], 20.00th=[  553],
     | 30.00th=[  570], 40.00th=[  586], 50.00th=[  611], 60.00th=[  627],
     | 70.00th=[  652], 80.00th=[  685], 90.00th=[  955], 95.00th=[ 1123],
     | 99.00th=[ 1270], 99.50th=[ 1319], 99.90th=[ 1483], 99.95th=[ 1713],
     | 99.99th=[ 2671]
    bw (  KiB/s): min=11704, max=14024, per=100.00%, avg=12068.29, stdev=300.79, samples=600
    iops        : min= 2926, max= 3506, avg=3017.06, stdev=75.19, samples=600
    lat (usec)   : 500=4.09%, 750=82.75%, 1000=3.81%
    lat (msec)   : 2=9.32%, 4=0.03%, 10=0.01%, 20=0.01%, 50=0.01%
    cpu          : usr=0.32%, sys=1.57%, ctx=596311, majf=0, minf=37
    IO depths    : 1=0.0%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,904999,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2
    
    Run status group 0 (all jobs):
    WRITE: bw=11.8MiB/s (12.4MB/s), 11.8MiB/s-11.8MiB/s (12.4MB/s-12.4MB/s), io=3535MiB (3707MB), run=300001-300001msec
    
    Disk stats (read/write):
    nvme0n2: ios=0/911244, sectors=0/7289952, merge=0/0, ticks=0/594190, in_queue=594190, util=98.94%
    
  6. Use Cloud Monitoring to verify the per attachment IOPs compute.googleapis.com/instance/disk/write_ops_count metric, filtered on device_name=hdb-ha-mw, aggregated by system metadata label name: Cloud Monitoring IOPs metrics

At this point you are free to try other workloads, reboot to simulate failure modes, and in general experiment with the ins and outs of a shared-disk filesystem. When you are done, delete the VMs and the shared disk to clean up.

Conclusion

The ability to replicate data across zones, coupled with the multi-writer access mode, makes Hyperdisk an ideal solution for applications demanding both high availability and performance. This blog post demonstrated a basic GFS2 implementation, but the possibilities extend far beyond. Hyperdisk Balanced HA multi-writer provides a foundation for building resilient and scalable applications on Google Cloud.

As always, comments are welcome!

Share Comments
comments powered by Disqus