我们在 Ubuntu 上运行 1.10 的 Kubernetes 集群中遇到了间歇性连接/dns 问题。
我们一直在查看错误报告/等,最近我们可以确定一个进程正在持有 /run/xtables.lock
并且它导致 kube-proxy pod 出现问题。
绑定到工作人员的 kube-proxy pod 之一在日志中重复出现此错误:
E0920 13:39:42.758280 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 13:46:46.193919 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 14:05:45.185720 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 14:11:52.455183 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 14:38:36.213967 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 14:44:43.442933 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
这些错误大约在 3 周前开始发生,到目前为止我们一直无法纠正。因为问题是间歇性的,我们直到现在才追查到这个问题。
我们认为这会导致其中一个 kube-flannel-ds pod 也处于永久的 CrashLoopBackOff
状态:
NAME READY STATUS RESTARTS AGE
coredns-78fcdf6894-6z6rs 1/1 Running 0 40d
coredns-78fcdf6894-dddqd 1/1 Running 0 40d
etcd-k8smaster1 1/1 Running 0 40d
kube-apiserver-k8smaster1 1/1 Running 0 40d
kube-controller-manager-k8smaster1 1/1 Running 0 40d
kube-flannel-ds-amd64-sh5gc 1/1 Running 0 40d
kube-flannel-ds-amd64-szkxt 0/1 CrashLoopBackOff 7077 40d
kube-proxy-6pmhs 1/1 Running 0 40d
kube-proxy-d7d8g 1/1 Running 0 40d
kube-scheduler-k8smaster1 1/1 Running 0 40d
/run/xtables.lock
周围的大多数错误报告似乎表明它已在 2017 年 7 月解决,但我们在新设置中看到了这一点。我们似乎在 iptables 中有适当的链配置。
运行 fuser /run/xtables.lock
什么也不返回。
有人对此有见识吗?它造成了很多痛苦
因此,经过更多挖掘,我们能够使用以下命令找到原因代码:
kubectl -n kube-system describe pods kube-flannel-ds-amd64-szkxt
pod 的名称当然会在不同的安装中更改,但终止的原因代码输出为:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
我们之前错过了这个原因代码(我们主要关注 137 的退出代码),这意味着内存不足;被杀。
默认情况下,kube-flannel-ds 获得的最大内存分配为 100Mi
- 这显然太低了。在参考配置中更改此默认值时记录了其他问题,但我们的修复是将最大限制调整为 256Mi
更改配置是一步,只需发出:
kubectl -n kube-system edit ds kube-flannel-ds-amd64
并在限制下更改 100Mi
的值 ->记忆更高的东西;我们做了 256Mi
。
默认情况下,这些 pod 只会更新 OnDelete
,因此您需要删除 CrashLoopBackOff
中的 pod,之后将使用更新的值重新创建它。
我想您也可以滚动并删除其他节点上的任何节点,但我们只删除了一直失败的节点。
以下是对一些帮助我们追踪的问题的参考:
https://github.com/coreos/flannel/issues/963https://github.com/coreos/flannel/issues/1012