Fixing filesystem corruption on a Kubernetes Persistent Volume

I like to break things. Kubernetes is really good at maintaining desired state so I decided to play whack-a-mole with my pod:

# kubectl get pods
NAME                       READY     STATUS    RESTARTS   AGE
etcd-2372957639-s3107   2/2       Running   1          1d

# kubectl delete pods etcd-2372957639-s3107
pod "etcd-2372957639-s3107" deleted

# kubectl get pods
NAME                       READY     STATUS              RESTARTS   AGE
etcd-2372957639-s3107   2/2       Terminating         1          1d
etcd-2372957639-tl5xh   0/2       ContainerCreating   0          4s

After dozens of ‘whacks’ I noticed it got stuck with a status of ContainerCreating. Lets check the logs:

# kubectl logs etcd-2372957639-tl5xh etcd
Error from server (BadRequest): container "etcd" in pod "etcd-2372957639-tl5xh" is waiting to start: ContainerCreating

Oh yeah, if the container hasn’t started yet it has no logs to view! Lets describe the pod and check the events section:

# kubectl describe pods etcd-2372957639-tl5xh
...
Events:
  FirstSeen     LastSeen        Count   From                    SubObjectPath   Type            Reason          Message
  ---------     --------        -----   ----                    -------------   --------        ------          -------
  17m           17m             1       default-scheduler                       Normal          Scheduled       Successfully assigned etcd-2372957639-tl5xh to 3-docker2

  17m           3m              15      kubelet, 3-docker2                      Warning         FailedMount     MountVolume.SetUp failed for volume
   "kubernetes.io/iscsi/fb6d5a06-56b5-11e7-bafb-525400f4fd4c-etcd" (spec.Name: "etcd") pod "fb6d5a06-56b5-11e7-bafb-525400f4fd4c"
   (UID: "fb6d5a06-56b5-11e7-bafb-525400f4fd4c") with: 'fsck' found errors on device
   /dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 but could not correct them: fsck from util-linux 2.23.2
   /dev/sda contains a file system with errors, check forced.
   /dev/sda: Unattached inode 23
   /dev/sda: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
   (i.e., without -a or -p options)

  1m             1m              1       kubelet, 3-docker2                     Warning         FailedMount     MountVolume.SetUp failed for volume
   "kubernetes.io/iscsi/fb6d5a06-56b5-11e7-bafb-525400f4fd4c-etcd" (spec.Name: "etcd") pod "fb6d5a06-56b5-11e7-bafb-525400f4fd4c" (UID: "fb6d5a06-56b5-11e7-bafb-525400f4fd4c")
   with: mount failed: exit status 32
  Mounting command: mount
  Mounting arguments: /dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0
  /var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/192.168.1.190:3260-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 ext4 [defaults]
  Output: mount: /dev/sda is already mounted or
  /var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/192.168.1.190:3260-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 busy

  15m            24s             8       kubelet, 3-docker2                     Warning         FailedMount     Unable to mount volumes for pod  
   "etcd-2372957639-dkdwz_default(fb6d5a06-56b5-11e7-bafb-525400f4fd4c)": timeout expired waiting for volumes to attach/mount for pod "default"/"etcd-2372957639-dkdwz". list
   of unattached/unmounted volumes=[etcd-vol]

  15m            24s             8       kubelet, 3-docker2                     Warning         FailedSync      Error syncing pod, skipping: timeout expired waiting for volumes to
   attach/mount for pod "default"/"etcd-2372957639-dkdwz". list of unattached/unmounted volumes=[etcd-vol]

So we can see the pod was assigned to the 3-docker2 host, but then when the kubelet tried to mount a volume the ext4 filesystem inside had errors that require manual intervention. Not good.

Since the pod is scheduled to host 3-docker2 lets go there and check out the logs:

# journalctl -r -u kubelet

Jun 21 21:29:39 3-docker2 kubelet[13812]: E0621 21:29:39.794177   13812 iscsi_util.go:195] iscsi: failed to mount iscsi volume
/dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 [ext4] to
/var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/192.168.1.190:3260-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0, error 'fsck' found errors on device
/dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 but could not correct them: fsck from util-linux 2.23.2
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda contains a file system with errors, check forced.
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda: Unattached inode 23
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
Jun 21 21:29:39 3-docker2 kubelet[13812]: (i.e., without -a or -p options)
Jun 21 21:29:39 3-docker2 kubelet[13812]: .
Jun 21 21:29:39 3-docker2 kubelet[13812]: E0621 21:29:39.794443   13812 disk_manager.go:50] failed to attach disk
Jun 21 21:29:39 3-docker2 kubelet[13812]: E0621 21:29:39.794471   13812 iscsi.go:228] iscsi: failed to setup
Jun 21 21:29:39 3-docker2 kubelet[13812]: E0621 21:29:39.794898   13812 nestedpendingoperations.go:262] Operation for
"\"kubernetes.io/iscsi/fb6d5a06-56b5-11e7-bafb-525400f4fd4c-etcd\" (\"fb6d5a06-56b5-11e7-bafb-525400f4fd4c\")" failed. No retries permitted until 2017-06-21 21:31:39.794803785
+0200 CEST (durationBeforeRetry 2m0s). Error: MountVolume.SetUp failed for volume "kubernetes.io/iscsi/fb6d5a06-56b5-11e7-bafb-525400f4fd4c-etcd" (spec.Name: "etcd") pod
"fb6d5a06-56b5-11e7-bafb-525400f4fd4c" (UID: "fb6d5a06-56b5-11e7-bafb-525400f4fd4c") with: 'fsck' found errors on device
/dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 but could not correct them: fsck from util-linux 2.23.2
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda contains a file system with errors, check forced.
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda: Unattached inode 23
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
Jun 21 21:29:39 3-docker2 kubelet[13812]: (i.e., without -a or -p options)
Jun 21 21:29:39 3-docker2 kubelet[13812]: .
Jun 21 21:30:46 3-docker2 kubelet[13812]: E0621 21:30:46.869197   13812 kubelet.go:1549] Unable to mount volumes for pod
"etcd-2372957639-dkdwz_default(fb6d5a06-56b5-11e7-bafb-525400f4fd4c)": timeout expired waiting for volumes to attach/mount
for pod "default"/"etcd-2372957639-dkdwz". list of unattached/unmounted volumes=[etcd-vol]; skipping pod
Jun 21 21:30:46 3-docker2 kubelet[13812]: E0621 21:30:46.869346   13812 pod_workers.go:182] Error syncing pod fb6d5a06-56b5-11e7-bafb-525400f4fd4c
("etcd-2372957639-dkdwz_default(fb6d5a06-56b5-11e7-bafb-525400f4fd4c)"), skipping:
timeout expired waiting for volumes to attach/mount for pod "default"/"etcd-2372957639-dkdwz". list of unattached/unmounted volumes=[etcd-vol]

So basically the same story.

The device is already presented to the host but just not mounted so run the fsck:

# fsck /dev/sda

fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/sda contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Unattached inode 23
Connect to /lost+found<y>? yes
Inode 23 ref count is 2, should be 1.  Fix<y>? yes
Pass 5: Checking group summary information
Block bitmap differences:  +(7939--23563)
Fix<y>? yes
Free blocks count wrong for group #5 (16785, counted=32410).
Fix<y>? yes
Free blocks count wrong (391677, counted=439712).
Fix<y>? yes
Inode bitmap differences:  -17
Fix<y>? yes
Free inodes count wrong for group #0 (8125, counted=8126).
Fix<y>? yes
Free inodes count wrong (122141, counted=122142).
Fix<y>? yes

/dev/sda: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda: 18/122160 files (16.7% non-contiguous), 48736/488448 blocks

So we fixed it. I’m not an expert on ext4 but there is probably a file in lost+found we might want to check if we need. In any case, the filesystem should be able to be mounted now.

Lets give it a minute and check if Kubernetes got it up:

# kubectl get pods
NAME                       READY     STATUS    RESTARTS   AGE
etcd-2372957639-tl5xh      2/2       Running   1          50m

There we go, our pod is running again!

For the paranoid (me included) I was curious that no other admins or Kubernetes itself could corrupt things even more while I did my fsck. Checking the event log I see that while my fsck was active it logged a new error because the device was busy:

Events:
  FirstSeen     LastSeen        Count   From                    SubObjectPath   Type            Reason          Message
  ---------     --------        -----   ----                    -------------   --------        ------          -------
  10m           2m              5       kubelet, 3-docker2      Warning         FailedMount     MountVolume.SetUp failed for volume
  "kubernetes.io/iscsi/fb6d5a06-56b5-11e7-bafb-525400f4fd4c-etcd" (spec.Name: "etcd") pod "fb6d5a06-56b5-11e7-bafb-525400f4fd4c"
  (UID: "fb6d5a06-56b5-11e7-bafb-525400f4fd4c") with: mount failed: exit status 32
  Mounting command: mount
  Mounting arguments: /dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.default-trash-7971f.2484-lun-0
  /var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/192.168.1.190:3260-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 ext4 [defaults]
  Output: mount: /dev/sdh is already mounted or /var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/192.168.1.190:3260-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 busy

A bit of filesystem recovery container style :-)

As always, comments are welcome!

Share Comments
comments powered by Disqus