I like to break things. Kubernetes is really good at maintaining desired state so I decided to play whack-a-mole with my pod:
# kubectl get pods
NAME READY STATUS RESTARTS AGE
etcd-2372957639-s3107 2/2 Running 1 1d
# kubectl delete pods etcd-2372957639-s3107
pod "etcd-2372957639-s3107" deleted
# kubectl get pods
NAME READY STATUS RESTARTS AGE
etcd-2372957639-s3107 2/2 Terminating 1 1d
etcd-2372957639-tl5xh 0/2 ContainerCreating 0 4s
After dozens of ‘whacks’ I noticed it got stuck with a status of ContainerCreating
. Lets check the logs:
# kubectl logs etcd-2372957639-tl5xh etcd
Error from server (BadRequest): container "etcd" in pod "etcd-2372957639-tl5xh" is waiting to start: ContainerCreating
Oh yeah, if the container hasn’t started yet it has no logs to view! Lets describe the pod and check the events section:
# kubectl describe pods etcd-2372957639-tl5xh
...
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
17m 17m 1 default-scheduler Normal Scheduled Successfully assigned etcd-2372957639-tl5xh to 3-docker2
17m 3m 15 kubelet, 3-docker2 Warning FailedMount MountVolume.SetUp failed for volume
"kubernetes.io/iscsi/fb6d5a06-56b5-11e7-bafb-525400f4fd4c-etcd" (spec.Name: "etcd") pod "fb6d5a06-56b5-11e7-bafb-525400f4fd4c"
(UID: "fb6d5a06-56b5-11e7-bafb-525400f4fd4c") with: 'fsck' found errors on device
/dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 but could not correct them: fsck from util-linux 2.23.2
/dev/sda contains a file system with errors, check forced.
/dev/sda: Unattached inode 23
/dev/sda: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
1m 1m 1 kubelet, 3-docker2 Warning FailedMount MountVolume.SetUp failed for volume
"kubernetes.io/iscsi/fb6d5a06-56b5-11e7-bafb-525400f4fd4c-etcd" (spec.Name: "etcd") pod "fb6d5a06-56b5-11e7-bafb-525400f4fd4c" (UID: "fb6d5a06-56b5-11e7-bafb-525400f4fd4c")
with: mount failed: exit status 32
Mounting command: mount
Mounting arguments: /dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0
/var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/192.168.1.190:3260-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 ext4 [defaults]
Output: mount: /dev/sda is already mounted or
/var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/192.168.1.190:3260-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 busy
15m 24s 8 kubelet, 3-docker2 Warning FailedMount Unable to mount volumes for pod
"etcd-2372957639-dkdwz_default(fb6d5a06-56b5-11e7-bafb-525400f4fd4c)": timeout expired waiting for volumes to attach/mount for pod "default"/"etcd-2372957639-dkdwz". list
of unattached/unmounted volumes=[etcd-vol]
15m 24s 8 kubelet, 3-docker2 Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to
attach/mount for pod "default"/"etcd-2372957639-dkdwz". list of unattached/unmounted volumes=[etcd-vol]
So we can see the pod was assigned to the 3-docker2 host, but then when the kubelet tried to mount a volume the ext4
filesystem inside had errors that require manual intervention. Not good.
Since the pod is scheduled to host 3-docker2 lets go there and check out the logs:
# journalctl -r -u kubelet
Jun 21 21:29:39 3-docker2 kubelet[13812]: E0621 21:29:39.794177 13812 iscsi_util.go:195] iscsi: failed to mount iscsi volume
/dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 [ext4] to
/var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/192.168.1.190:3260-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0, error 'fsck' found errors on device
/dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 but could not correct them: fsck from util-linux 2.23.2
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda contains a file system with errors, check forced.
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda: Unattached inode 23
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
Jun 21 21:29:39 3-docker2 kubelet[13812]: (i.e., without -a or -p options)
Jun 21 21:29:39 3-docker2 kubelet[13812]: .
Jun 21 21:29:39 3-docker2 kubelet[13812]: E0621 21:29:39.794443 13812 disk_manager.go:50] failed to attach disk
Jun 21 21:29:39 3-docker2 kubelet[13812]: E0621 21:29:39.794471 13812 iscsi.go:228] iscsi: failed to setup
Jun 21 21:29:39 3-docker2 kubelet[13812]: E0621 21:29:39.794898 13812 nestedpendingoperations.go:262] Operation for
"\"kubernetes.io/iscsi/fb6d5a06-56b5-11e7-bafb-525400f4fd4c-etcd\" (\"fb6d5a06-56b5-11e7-bafb-525400f4fd4c\")" failed. No retries permitted until 2017-06-21 21:31:39.794803785
+0200 CEST (durationBeforeRetry 2m0s). Error: MountVolume.SetUp failed for volume "kubernetes.io/iscsi/fb6d5a06-56b5-11e7-bafb-525400f4fd4c-etcd" (spec.Name: "etcd") pod
"fb6d5a06-56b5-11e7-bafb-525400f4fd4c" (UID: "fb6d5a06-56b5-11e7-bafb-525400f4fd4c") with: 'fsck' found errors on device
/dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 but could not correct them: fsck from util-linux 2.23.2
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda contains a file system with errors, check forced.
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda: Unattached inode 23
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
Jun 21 21:29:39 3-docker2 kubelet[13812]: (i.e., without -a or -p options)
Jun 21 21:29:39 3-docker2 kubelet[13812]: .
Jun 21 21:30:46 3-docker2 kubelet[13812]: E0621 21:30:46.869197 13812 kubelet.go:1549] Unable to mount volumes for pod
"etcd-2372957639-dkdwz_default(fb6d5a06-56b5-11e7-bafb-525400f4fd4c)": timeout expired waiting for volumes to attach/mount
for pod "default"/"etcd-2372957639-dkdwz". list of unattached/unmounted volumes=[etcd-vol]; skipping pod
Jun 21 21:30:46 3-docker2 kubelet[13812]: E0621 21:30:46.869346 13812 pod_workers.go:182] Error syncing pod fb6d5a06-56b5-11e7-bafb-525400f4fd4c
("etcd-2372957639-dkdwz_default(fb6d5a06-56b5-11e7-bafb-525400f4fd4c)"), skipping:
timeout expired waiting for volumes to attach/mount for pod "default"/"etcd-2372957639-dkdwz". list of unattached/unmounted volumes=[etcd-vol]
So basically the same story.
The device is already presented to the host but just not mounted so run the fsck
:
# fsck /dev/sda
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/sda contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Unattached inode 23
Connect to /lost+found<y>? yes
Inode 23 ref count is 2, should be 1. Fix<y>? yes
Pass 5: Checking group summary information
Block bitmap differences: +(7939--23563)
Fix<y>? yes
Free blocks count wrong for group #5 (16785, counted=32410).
Fix<y>? yes
Free blocks count wrong (391677, counted=439712).
Fix<y>? yes
Inode bitmap differences: -17
Fix<y>? yes
Free inodes count wrong for group #0 (8125, counted=8126).
Fix<y>? yes
Free inodes count wrong (122141, counted=122142).
Fix<y>? yes
/dev/sda: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda: 18/122160 files (16.7% non-contiguous), 48736/488448 blocks
So we fixed it. I’m not an expert on ext4
but there is probably a file in lost+found
we might want to check if we need. In any case, the filesystem should be able to be mounted now.
Lets give it a minute and check if Kubernetes got it up:
# kubectl get pods
NAME READY STATUS RESTARTS AGE
etcd-2372957639-tl5xh 2/2 Running 1 50m
There we go, our pod is running again!
For the paranoid (me included) I was curious that no other admins or Kubernetes itself could corrupt things even more while I did my fsck
. Checking the event log I see that while my fsck
was active it logged a new error because the device was busy:
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
10m 2m 5 kubelet, 3-docker2 Warning FailedMount MountVolume.SetUp failed for volume
"kubernetes.io/iscsi/fb6d5a06-56b5-11e7-bafb-525400f4fd4c-etcd" (spec.Name: "etcd") pod "fb6d5a06-56b5-11e7-bafb-525400f4fd4c"
(UID: "fb6d5a06-56b5-11e7-bafb-525400f4fd4c") with: mount failed: exit status 32
Mounting command: mount
Mounting arguments: /dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.default-trash-7971f.2484-lun-0
/var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/192.168.1.190:3260-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 ext4 [defaults]
Output: mount: /dev/sdh is already mounted or /var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/192.168.1.190:3260-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 busy
A bit of filesystem recovery container style :-)
As always, comments are welcome!