Podman pod disappears after a few days, but process is still running and listening on a given port - rhel

I am running an Elasticsearch container as Podman pod using podman play kube and a yaml definition of a pod. Pod is created, cluster of three nodes is created and everything works as expected. But: Podman pod dies after a few days of staying idle.
Podman podman ps command says:
ERRO[0000] Error refreshing container af05fafe31f6bfb00c2599255c47e35813ecf5af9bbe6760ae8a4abffd343627: error acquiring lock 1 for container af05fafe31f6bfb00c2599255c47e35813ecf5af9bbe6760ae8a4abffd343627: file exists
ERRO[0000] Error refreshing container b4620633d99f156bb59eb327a918220d67145f8198d1c42b90d81e6cc29cbd6b: error acquiring lock 2 for container b4620633d99f156bb59eb327a918220d67145f8198d1c42b90d81e6cc29cbd6b: file exists
ERRO[0000] Error refreshing pod 389b0c34313d9b23ecea3faa0e494e28413bd15566d66297efa9b5065e025262: error retrieving lock 0 for pod 389b0c34313d9b23ecea3faa0e494e28413bd15566d66297efa9b5065e025262: file exists
POD ID NAME STATUS CREATED INFRA ID # OF CONTAINERS
389b0c34313d elasticsearch-pod Created 1 week ago af05fafe31f6 2
What's weird is that the process is still listening if we try to find the process id listening on port 9200 or 9300:
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp6 0 0 :::9200 :::* LISTEN 1328607/containers-
tcp6 0 0 :::9300 :::* LISTEN 1328607/containers-
The process ID that is hanging (and making the process still listening is):
user+ 1339220 0.0 0.1 45452 8284 ? S Jan11 2:19 /bin/slirp4netns --disable-host-loopback --mtu 65520 --enable-sandbox --enable-seccomp -c -e 3 -r 4 --netns-type=path /tmp/run-1002/netns/cni-e4bb2146-d04e-c3f1-9207-380a234efa1f tap0
The only actions I do to the pod is regular: podman pod stop, podman pod rm and podman play kube that is starting pod.
What can be causing such strange behaviour of Podman? What may be causing the lock not to be released properly?
System information:
NAME="Red Hat Enterprise Linux"
VERSION="8.3 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.3"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.3 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.3:GA"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.3
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.3"
Red Hat Enterprise Linux release 8.3 (Ootpa)
Red Hat Enterprise Linux release 8.3 (Ootpa)
Podman version:
podman --version
podman version 2.2.1

The workaround that worked for me is to add this configuration file from the Podman repository [1] under /usr/lib/tmpfiles.d/ and /etc/tmpfiles.d/, in this way we are preventing the removal of Podman temporary files from /tmp directory [2]. As stated in [3], additionally CNI leaves Network information in /var/lib/cni/networks when the system crashes or containers do not shut down properly. This behaviour has been fixed in the latest Podman release [4] and it happens when using rootless Podman.
Workaround
First, check the runRoot default directory set for your Podman rootless user:
podman info | grep runRoot
Create the temporary configuration file:
sudo vim /usr/lib/tmpfiles.d/podman.conf
Add the following content, replacing /tmp/podman-run-* by your default runRoot directory. E.g. If your output is /tmp/run-6695/containers then use: x /tmp/run-*
# /tmp/podman-run-* directory can contain content for Podman containers that have run
# for many days. This following line prevents systemd from removing this content.
x /tmp/podman-run-*
x /tmp/containers-user-*
D! /run/podman 0700 root root
D! /var/lib/cni/networks
Copy the temporary file from /usr/lib/tmpfiles.d to /etc/tmpfiles.d/
sudo cp -p /usr/lib/tmpfiles.d/podman.conf /etc/tmpfiles.d/
After you have done all the steps according to your configuration, the error should disappear.
References
https://github.com/containers/podman/blob/master/contrib/tmpfile/podman.conf
https://bugzilla.redhat.com/show_bug.cgi?id=1888988#c9
https://github.com/containers/podman/commit/2e0a9c453b03d2a372a3ab03b9720237e93a067c
https://github.com/containers/podman/pull/8241

Related

Why the linux ps command "sees" the processes ran by K8s pods?

I have a K8s cluster created in the context of the Linux Foundation's CKAD course (LFD259). So it is a "bare metal" cluster created with kubeadm.
So I have a metrics-server deployment running on the worker node:
student#master:~$ k get deployments.apps metrics-server -o yaml | grep -A10 args
- args:
- --secure-port=4443
- --cert-dir=/tmp
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
- --kubelet-insecure-tls
image: k8s.gcr.io/metrics-server/metrics-server:v0.6.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
student#master:~$ k get pod metrics-server-6894588c69-fpvtt -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
metrics-server-6894588c69-fpvtt 1/1 Running 0 4d15h 192.168.171.98 worker <none> <none>
student#master:~$
It is my understanding that the pod's process runs inside a container running on the worker node. However, I am completely puzzled by the fact that the linux ps command "sees" it:
student#worker:~$ ps aux | grep kubelet-preferred-address-types
ubuntu 1343092 0.3 0.6 752468 49612 ? Ssl Oct28 20:25 /metrics-server --secure-port=4443 --cert-dir=/tmp --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --kubelet-use-node-status-port --metric-resolution=15s --kubelet-insecure-tls
student 3310743 0.0 0.0 8184 2532 pts/0 S+ 17:39 0:00 grep --color=auto kubelet-preferred-address-types
student#worker:~$
What am I missing?
A container is just a process running on your host with some isolation features enabled. The isolation only works in one way: a container can't see resources on your host, but your host has access to all the resources running in a container.
Because a container is just a process, it shows up in ps (as do any processes that are spawned inside the container).
See e.g.:
"What is a Linux Container?"

Failed to connect to containerd: failed to dial

Just installed Docker CE following official instructions with the repository in Ubuntu 14.04
Installation went successfully, the daemon is running
$ ps aux | grep docker
[...] /usr/bin/dockerd --raw-logs [...]
My user is in the docker group:
$ groups
[...] docker
The cli can't seem to communicate (same with sudo)
$ docker ps
Cannot connect to the Docker daemon at unix:///var/run/docker.sock.
Is the docker daemon running?
The socket seems to have the correct permissions:
$ ls -l /var/run/docker.sock
srw-rw---- 1 root docker 0 Feb 4 16:21 /var/run/docker.sock
The log seems to claim about some issues though
$ sudo tail -f /var/log/upstart/docker.log
Failed to connect to containerd: failed to dial "/var/run/docker/containerd/docker-containerd.sock": dial unix:///var/run/docker/containerd/docker-containerd.sock: timeout
/var/run/docker.sock is up
time="2018-02-04T16:22:21.031459040+01:00" level=info msg="libcontainerd: started new docker-containerd process" pid=17147
INFO[0000] starting containerd module=containerd revision=89623f28b87a6004d4b785663257362d1658a729 version=v1.0.0
INFO[0000] setting subreaper... module=containerd
containerd: invalid argument
time="2018-02-04T16:22:21.056685023+01:00" level=error msg="containerd did not exit successfully" error="exit status 1" module=libcontainerd
Any advice to make this work ?
Relog and Docker restart already done of course
As #bobbear suggested and is actually mentioned in the official doc one of the prerequisites is:
Version 3.10 or higher of the Linux kernel. The latest version of the kernel available for you platform is recommended.
After having checked my Kernel version:
$ uname -a
Linux [...] 3.2.[...]-generic [...]-Ubuntu [...] x86_64
I searched for candidates:
$ apt-cache search linux-image
And installed my new_kernel:
$ sudo apt-get install \
linux-image-new_kernel \
linux-headers-new_kernel \
linux-image-extra-new_kernel
Same situation happend on me. IS because your linux kernel version too low !!! check it use command "uname -r" , if the version below "3.10" (for example: debian 7 whezzy default version is 3.2 ) ,even you install docker-ce suceessfully, you will still can not start docker daemon success.That why! All most answers on the web tell you to 'restart' bla bla bla... but they did not consider this problem.

Docker daemon throwing error while starting in Linux RHEL

I am trying to start my dockerd daemon by this command - dockerd &
Then i start getting the error as below -
ERRO[0036] libcontainerd: failed to receive event from containerd: rpc error: code = 12 desc = unknown service types.API
This keeps rolling again and again and i am unable to start any container after that. If i close the session and open a new session, i could see docker ps is accessible. But i am unable to start any container. While starting the container I am getting error -
docker run hello-world
docker: Error response from daemon: unknown service types.API. ERRO[0000] error waiting for container: context canceled
Please let me know if any logs are needed.
Why do you start the docker daemon using dockerd & and not systemctl start docker.service? This is probably the cause of your problem.
In order to start the daemon at boot, you need to run systemctl enable docker.service. See Getting Started with Containers.
Note that the kernel for Red Hat Enterprise Linux 6 only supports a limited subset of the functionality needed for container support, and I don't think anyone tests either the daemon or container images on that operating system version.

dockerd: Error running deviceCreate (CreatePool) dm_task_run failed

I'm building some CentOS VM with VMWare, with no access to internet, so I've downloaded and made local repositories, including this one
Then I have installed docker-engine.x86_64, and when starting the docker daemon, I get the following errors :
[root]# dockerd
DEBU[0000] docker group found. gid: 993
...
...
DEBU[0001] Error retrieving the next available loopback: open /dev/loop-control: no such device
ERRO[0001] **There are no more loopback devices available.**
ERRO[0001] [graphdriver] prior storage driver "devicemapper" failed: loopback attach failed
DEBU[0001] Cleaning up old mountid : start.
FATA[0001] Error starting daemon: error initializing graphdriver: loopback attach failed
After manually add the loop module which control loop device with this command :
insmod /lib/modules/3.10.0-327.36.2.el7.x86_64/kernel/drivers/block/loop.ko
The error changes to :
[graphdriver] prior storage driver "devicemapper" failed: devicemapper: Error running deviceCreate (CreatePool) dm_task_run failed
I've read that it could be because I have not enough space disk, I think it's not that, any idea?
[root]# df -k .
Filesystem blocs de 1K Used Available Used Mounted on
/dev/mapper/centos-root 51887356 2436256 49451100 5% /
I got the "There are no more loopback devices available" error, which stopped dockerd from running.
I fixed it by ensuring the storage driver was 'overlay':
# /usr/bin/dockerd -D --storage-driver=overlay
This was on Debian Jessie and docker running as a systemd service/unit.
To make it permanent, I created a systemd drop-in:
$ cat /etc/systemd/system/docker.service.d/docker.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// --storage-driver=overlay

How to identify orphaned veth interfaces and how to delete them?

When I start any container by docker run, we get a new veth interface. After deleting container, veth interface which was linked with container should be removed. However, sometimes it's fail ( oftern then container started with errors):
root#hostname /home # ifconfig | grep veth | wc -l
53
root#hostname /home # docker run -d -P axibase/atsd -name axibase-atsd-
28381035d1ae2800dea51474c4dee9525f56c2347b1583f56131d8a23451a84e
Error response from daemon: Cannot start container 28381035d1ae2800dea51474c4dee9525f56c2347b1583f56131d8a23451a84e: iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 33359 -j DNAT --to-destination 172.17.2.136:8883 ! -i docker0: iptables: No chain/target/match by that name.
(exit status 1)
root#hostname /home # ifconfig | grep veth | wc -l
55
root#hostname /home # docker rm -f 2838
2838
root#hostname /home # ifconfig | grep veth | wc -l
55
How I can identify which interfaces are linked with existing containers, and how I can remove extra interface which was linked with removed contrainers?
This way doesn't work (by root):
ifconfig veth55d245e down
brctl delbr veth55d245e
can't delete bridge veth55d245e: Operation not permitted
Extra interfaces now defined by transmitted traffic (if there are no activity, it's extra interface).
UPDATE
root#hostname ~ # uname -a
Linux hostname 3.13.0-53-generic #89-Ubuntu SMP Wed May 20 10:34:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
root#hostname ~ # docker info
Containers: 10
Images: 273
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 502
Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.13.0-53-generic
Operating System: Ubuntu 14.04.2 LTS
CPUs: 8
Total Memory: 47.16 GiB
Name: hostname
ID: 3SQM:44OG:77HJ:GBAU:2OWZ:C5CN:UWDV:JHRZ:LM7L:FJUN:AGUQ:HFAL
WARNING: No swap limit support
root#hostname ~ # docker version
Client version: 1.7.1
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 786b29d
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 786b29d
OS/Arch (server): linux/amd64
There are three problems here:
Starting a single container should not increase the count of veth interfaces on your system by 2, because when Docker creates a veth pair, one end of the pair is isolated in the container namespace and is not visible from the host.
It looks like you're not able to start a container:
Error response from daemon: Cannot start container ...
Docker should be cleaning up the veth interfaces automatically.
These facts make me suspect that there is something fundamentally wrong in your environment. Can you update your question with details about what distribution you're using, which kernel version, and which Docker version?
How I can identify which interfaces are linked with existing containers, and how I can remove extra interface which was linked with removed contrainers?
With respect to manually deleting veth interfaces: A veth interface isn't a bridge, so of course you can't delete one with brctl.
To delete a veth interface:
# ip link delete <ifname>
Detecting "idle" interfaces is a thornier problem, because if you just look at traffic you're liable to accidentally delete something that was still in use but that just wasn't seeing much activity.
I think what you would actually want to look for are veth interfaces whose peer is also visible in the global network namespace. You can find the peer of a veth interface using these instructions, and then it would be a simple matter of seeing if that interface is visible, and then deleting one or the other (deleting a veth interface will also remove its peer).
Fixed by upgrade docker to last version.
New version:
root#hostname ~ # docker version
Client:
Version: 1.8.1
API version: 1.20
Go version: go1.4.2
Git commit: d12ea79
Built: Thu Aug 13 02:35:49 UTC 2015
OS/Arch: linux/amd64
Server:
Version: 1.8.1
API version: 1.20
Go version: go1.4.2
Git commit: d12ea79
Built: Thu Aug 13 02:35:49 UTC 2015
OS/Arch: linux/amd64
Now interfaces remove together with containers. Old orphaned interfaces were deleted manually by following command:
# ip link delete <ifname>
Here is how you can delete them all together by pattern.
for name in $(ifconfig -a | sed 's/[ \t].*//;/^\(lo\|\)$/d' | grep veth)
do
echo $name
# ip link delete $name # uncomment this
done
In my case, all virtual ethernet network interface were created by Docker. For solving that, I've stopped all Docker services:
docker stop $(docker ps -q)
And the deleted all networks created by Docker:
docker network rm $(docker network ls -q)

Resources