How to know Memory cgroup limit? - linux

We have kubernetes cluster, and we are running jenkins in it. Our jenkins restart after every 48 hours, when we check the kubelet logs for that worker where jenkins deployed, it gives error
Feb 15 14:52:01 myworker kernel: Memory cgroup out of memory: Kill process 110129 (Computer.thread) score 1972 or sacrifice child
Feb 15 14:52:01 myworker kernel: Killed process 50179 (java), UID 1000, total-vm:17378260kB, anon-rss:8371056kB, file-rss:29676kB, shmem-rss:0kB
where 50179 is java process for jenkins.
We set limit in kubernetes for jenkins as 8Gi
resources:
limits:
cpu: 3500m
memory: 8Gi
requests:
cpu: "1"
memory: 4Gi
I also check newrelic alerts, which we integrated with our pods, it never goes beyond 5GB in memory.
Details logs below.
Feb 15 14:52:01 myworker kernel: Download metada cpuset=kubepods-burstable-pod1840326e_dca6_4e8c_a55a_f4fb9a7c95fa.slice:cri-containerd:7fc11e70ccd4fd078b8d243f2710ecc1404955bf52a5cb05eb54f2917086420d mems_allowed=0
Feb 15 14:52:01 myworker kernel: CPU: 6 PID: 115222 Comm: Download metada Kdump: loaded Tainted: G ------------ T 3.10.0-1160.15.2.el7.x86_64 #1
Feb 15 14:52:01 myworker kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
Feb 15 14:52:01 myworker kernel: Call Trace:
Feb 15 14:52:01 myworker kernel: [<ffffffff82581fba>] dump_stack+0x19/0x1b
Feb 15 14:52:01 myworker kernel: [<ffffffff8257c8da>] dump_header+0x90/0x229
Feb 15 14:52:01 myworker kernel: [<ffffffff8209d378>] ? ep_poll_callback+0xf8/0x220
Feb 15 14:52:01 myworker kernel: [<ffffffff81fc1d16>] ? find_lock_task_mm+0x56/0xc0
Feb 15 14:52:01 myworker kernel: [<ffffffff8203caa8>] ? try_get_mem_cgroup_from_mm+0x28/0x60
Feb 15 14:52:01 myworker kernel: [<ffffffff81fc227d>] oom_kill_process+0x2cd/0x490
Feb 15 14:52:01 myworker kernel: [<ffffffff82040ebc>] mem_cgroup_oom_synchronize+0x55c/0x590
Feb 15 14:52:01 myworker kernel: [<ffffffff82040320>] ? mem_cgroup_charge_common+0xc0/0xc0
Feb 15 14:52:01 myworker kernel: [<ffffffff81fc2b64>] pagefault_out_of_memory+0x14/0x90
Feb 15 14:52:01 myworker kernel: [<ffffffff8257ade6>] mm_fault_error+0x6a/0x157
Feb 15 14:52:01 myworker kernel: [<ffffffff8258f8d1>] __do_page_fault+0x491/0x500
Feb 15 14:52:01 myworker kernel: [<ffffffff8258f975>] do_page_fault+0x35/0x90
Feb 15 14:52:01 myworker kernel: [<ffffffff8258b778>] page_fault+0x28/0x30
Feb 15 14:52:01 myworker kernel: Task in /system.slice/containerd.service/kubepods-burstable-pod1840326e_dca6_4e8c_a55a_f4fb9a7c95fa.slice:cri-containerd:7fc11e70ccd4fd078b8d243f2710ecc1404955bf52a5cb05eb54f2917086420d killed as a result of limit of /system.slice/containerd.service/kubepods-burstable-pod1840326e_dca6_4e8c_a55a_f4fb9a7c95fa.slice:cri-containerd:7fc11e70ccd4fd078b8d243f2710ecc1404955bf52a5cb05eb54f2917086420d
Feb 15 14:52:01 myworker kernel: memory: usage 8388608kB, limit 8388608kB, failcnt 111634
Feb 15 14:52:01 myworker kernel: memory+swap: usage 8388608kB, limit 9007199254740988kB, failcnt 0
Feb 15 14:52:01 myworker kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Feb 15 14:52:01 myworker kernel: Memory cgroup stats for /system.slice/containerd.service/kubepods-burstable-pod1840326e_dca6_4e8c_a55a_f4fb9a7c95fa.slice:cri-containerd:7fc11e70ccd4fd078b8d243f2710ecc1404955bf52a5cb05eb54f2917086420d: cache:20KB rss:8388588KB rss_huge:6144KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:8388556KB inactive_file:4KB active_file:0KB unevictable:0KB
Feb 15 14:52:01 myworker kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[root#myworker log]# head messages -n376428 | tail -n 40
Feb 15 14:52:01 myworker kernel: [115493] 1000 115493 2059 462 8 0 969 git
Feb 15 14:52:01 myworker kernel: [115497] 1000 115497 1764 350 8 0 969 git
Feb 15 14:52:01 myworker kernel: [115498] 1000 115498 24351 2784 17 0 969 git-remote-http
Feb 15 14:52:01 myworker kernel: Memory cgroup out of memory: Kill process 115496 (git fetch --tag) score 1972 or sacrifice child
Feb 15 14:52:01 myworker kernel: Killed process 115493 (git), UID 1000, total-vm:8236kB, anon-rss:296kB, file-rss:1552kB, shmem-rss:0kB
Feb 15 14:52:01 myworker containerd: time="2022-02-15T14:52:01.791126760Z" level=info msg="TaskOOM event &TaskOOM{ContainerID:7fc11e70ccd4fd078b8d243f2710ecc1404955bf52a5cb05eb54f2917086420d,XXX_unrecognized:[],}"
Feb 15 14:52:01 myworker kernel: Download metada invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=969
Feb 15 14:52:01 myworker kernel: Download metada cpuset=kubepods-burstable-pod1840326e_dca6_4e8c_a55a_f4fb9a7c95fa.slice:cri-containerd:7fc11e70ccd4fd078b8d243f2710ecc1404955bf52a5cb05eb54f2917086420d mems_allowed=0
Feb 15 14:52:01 myworker kernel: CPU: 6 PID: 115222 Comm: Download metada Kdump: loaded Tainted: G ------------ T 3.10.0-1160.15.2.el7.x86_64 #1
Feb 15 14:52:01 myworker kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
Feb 15 14:52:01 myworker kernel: Call Trace:
Feb 15 14:52:01 myworker kernel: [<ffffffff82581fba>] dump_stack+0x19/0x1b
Feb 15 14:52:01 myworker kernel: [<ffffffff8257c8da>] dump_header+0x90/0x229
Feb 15 14:52:01 myworker kernel: [<ffffffff8209d378>] ? ep_poll_callback+0xf8/0x220
Feb 15 14:52:01 myworker kernel: [<ffffffff81fc1d16>] ? find_lock_task_mm+0x56/0xc0
Feb 15 14:52:01 myworker kernel: [<ffffffff8203caa8>] ? try_get_mem_cgroup_from_mm+0x28/0x60
Feb 15 14:52:01 myworker kernel: [<ffffffff81fc227d>] oom_kill_process+0x2cd/0x490
Feb 15 14:52:01 myworker kernel: [<ffffffff82040ebc>] mem_cgroup_oom_synchronize+0x55c/0x590
Feb 15 14:52:01 myworker kernel: [<ffffffff82040320>] ? mem_cgroup_charge_common+0xc0/0xc0
Feb 15 14:52:01 myworker kernel: [<ffffffff81fc2b64>] pagefault_out_of_memory+0x14/0x90
Feb 15 14:52:01 myworker kernel: [<ffffffff8257ade6>] mm_fault_error+0x6a/0x157
Feb 15 14:52:01 myworker kernel: [<ffffffff8258f8d1>] __do_page_fault+0x491/0x500
Feb 15 14:52:01 myworker kernel: [<ffffffff8258f975>] do_page_fault+0x35/0x90
Feb 15 14:52:01 myworker kernel: [<ffffffff8258b778>] page_fault+0x28/0x30
Feb 15 14:52:01 myworker kernel: Task in /system.slice/containerd.service/kubepods-burstable-pod1840326e_dca6_4e8c_a55a_f4fb9a7c95fa.slice:cri-containerd:7fc11e70ccd4fd078b8d243f2710ecc1404955bf52a5cb05eb54f2917086420d killed as a result of limit of /system.slice/containerd.service/kubepods-burstable-pod1840326e_dca6_4e8c_a55a_f4fb9a7c95fa.slice:cri-containerd:7fc11e70ccd4fd078b8d243f2710ecc1404955bf52a5cb05eb54f2917086420d
Feb 15 14:52:01 myworker kernel: memory: usage 8388608kB, limit 8388608kB, failcnt 111634
Feb 15 14:52:01 myworker kernel: memory+swap: usage 8388608kB, limit 9007199254740988kB, failcnt 0
Feb 15 14:52:01 myworker kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Feb 15 14:52:01 myworker kernel: Memory cgroup stats for /system.slice/containerd.service/kubepods-burstable-pod1840326e_dca6_4e8c_a55a_f4fb9a7c95fa.slice:cri-containerd:7fc11e70ccd4fd078b8d243f2710ecc1404955bf52a5cb05eb54f2917086420d: cache:20KB rss:8388588KB rss_huge:6144KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:8388556KB inactive_file:4KB active_file:0KB unevictable:0KB
Feb 15 14:52:01 myworker kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Feb 15 14:52:01 myworker kernel: [41710] 1000 41710 285 1 4 0 969 tini
Feb 15 14:52:01 myworker kernel: [50179] 1000 50179 4344565 2100159 4662 0 969 java
Feb 15 14:52:01 myworker kernel: [115497] 1000 115497 1764 350 8 0 969 git
Feb 15 14:52:01 myworker kernel: [115498] 1000 115498 24351 2784 17 0 969 git-remote-http
Feb 15 14:52:01 myworker kernel: Memory cgroup out of memory: Kill process 110129 (Computer.thread) score 1972 or sacrifice child
Feb 15 14:52:01 myworker kernel: Killed process 50179 (java), UID 1000, total-vm:17378260kB, anon-rss:8371056kB, file-rss:29676kB, shmem-rss:0kB
Feb 15 14:52:03 myworker containerd: time="2022-02-15T14:52:03.132654815Z" level=info msg="Finish piping stderr of container \"7fc11e70ccd4fd078b8d243f2710ecc1404955bf52a5cb05eb54f2917086420d\""
Feb 15 14:52:03 myworker containerd: time="2022-02-15T14:52:03.132676088Z" level=info msg="Finish piping stdout of container \"7fc11e70ccd4fd078b8d243f2710ecc1404955bf52a5cb05eb54f2917086420d\""
Feb 15 14:52:03 myworker containerd: time="2022-02-15T14:52:03.134738144Z" level=info msg="TaskExit event &TaskExit{ContainerID:7fc11e70ccd4fd078b8d243f2710ecc1404955bf52a5cb05eb54f2917086420d,ID:7fc11e70ccd4fd078b8d243f2710ecc1404955bf52a5cb05eb54f2917086420d,Pid:41710,ExitStatus:137,ExitedAt:2022-02-15 14:52:03.134458495 +0000 UTC,XXX_unrecognized:[],}"
Feb 15 14:52:03 myworker containerd: time="2022-02-15T14:52:03.248040140Z" level=info msg="shim disconnected" id=7fc11e70ccd4fd078b8d243f2710ecc1404955bf52a5cb05eb54f2917086420d
Only problem I can see here is, we are telling kubernetes to go till 8Gb but Memory cgroup might have limit below 8Gb and when it try to reach something beyond 5Gb it kill the pod and it restart again.
What is the best way to know the Memory cgroup limit? and is there way to know which pods/process are using this cgroup?
Questions:
Q1: What kind of cluster do you use? Minikube, kubeadm or managed by cloud GKE, EKS, AKS?
A1: kubeadm
Q2: Which version of kubernetes do you use?
A2: v1.21.3
Q3: From when the problem with restart jenkins pod has been started?
A3: Issue might be from the beginning, but we start noticing recently when we moved more jobs to kubernetes based jenkins.
Q4: Could you paste an output from jenkins pods using kubectl describe pod ?
A4:
# kubectl describe pod -n jenkins jenkins-jenkins-instance
Name: jenkins-jenkins-instance
Namespace: jenkins
Priority: 0
Node: myworker/192.168.X.X
Start Time: Sun, 13 Mar 2022 15:12:19 +0000
Labels: app=jenkins-operator
jenkins-cr=jenkins-instance
Annotations: <none>
Status: Running
IP: 192.168.113.152
IPs:
IP: 192.168.113.152
Controlled By: Jenkins/jenkins-instance
Containers:
jenkins-master:
Container ID: containerd://70e68b7b069404f825b53e9d8f0dac22c595074e5bdc4659cae5248e25af8e00
Image: jenkins/jenkins:lts
Image ID: docker.io/jenkins/jenkins#sha256:b414f82151b865d3efd49ec27a944f624188d09fec58700cddfbe6bae2450f77
Ports: 8080/TCP, 50000/TCP
Host Ports: 0/TCP, 0/TCP
Command:
bash
-c
/var/jenkins/scripts/init.sh && exec /sbin/tini -s -- /usr/local/bin/jenkins.sh
State: Running
Started: Sun, 13 Mar 2022 15:12:20 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 3500m
memory: 8Gi
Requests:
cpu: 1
memory: 4Gi
Liveness: http-get http://:http/login delay=100s timeout=5s period=10s #success=1 #failure=12
Readiness: http-get http://:http/login delay=80s timeout=1s period=10s #success=1 #failure=10
Environment:
COPY_REFERENCE_FILE_LOG: /var/lib/jenkins/copy_reference_file.log
NEW_RELIC_METADATA_KUBERNETES_CLUSTER_NAME: IAD.Prod
NEW_RELIC_METADATA_KUBERNETES_NODE_NAME: (v1:spec.nodeName)
NEW_RELIC_METADATA_KUBERNETES_NAMESPACE_NAME: jenkins (v1:metadata.namespace)
NEW_RELIC_METADATA_KUBERNETES_POD_NAME: jenkins-jenkins-instance (v1:metadata.name)
NEW_RELIC_METADATA_KUBERNETES_CONTAINER_NAME: master
NEW_RELIC_METADATA_KUBERNETES_CONTAINER_IMAGE_NAME: jenkins/jenkins:lts
JAVA_OPTS: -XX:MinRAMPercentage=50.0 -XX:MaxRAMPercentage=80.0 -Djenkins.install.runSetupWizard=false -Djava.awt.headless=true
JENKINS_HOME: /var/lib/jenkins
Mounts:
/var/jenkins/init-configuration from init-configuration (ro)
/var/jenkins/operator-credentials from operator-credentials (ro)
/var/jenkins/scripts from scripts (ro)
/var/lib/jenkins from jenkins-home (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fc57k (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
jenkins-home:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
scripts:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: jenkins-operator-scripts-jenkins-instance
Optional: false
init-configuration:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: jenkins-operator-init-configuration-jenkins-instance
Optional: false
operator-credentials:
Type: Secret (a volume populated by a Secret)
SecretName: jenkins-operator-credentials-jenkins-instance
Optional: false
kube-api-access-fc57k:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Q5: How we deploy jenkins?
A5: We are using Jenkins-operator to deploy jenkins.

Related

The server went to the freezer unexpectedly [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 5 days ago.
Improve this question
Server crashed, when starting the analysis in the logs I found these memory messages.
Would this be relevant to the server's freezer?
[JovaTricolo#xxx log]$ grep -R "Feb 11" messages | egrep -v "audit" | egrep -r "Warning|warning|error|erro|panic|boot|memory"
Feb 11 11:19:56 xxx rsyslogd-2036: error starting up disk queue, using pure in-memory mode [try http://www.rsyslog.com/e/2036 ]
Feb 11 11:19:56 xxx kernel: init_memory_mapping: 0000000000000000-00000000bd2f0000
Feb 11 11:19:56 xxx kernel: init_memory_mapping: 0000000100000000-0000002040000000
Feb 11 11:19:56 xxx kernel: bootmap [0000000000100000 - 0000000000307fff] pages 208
Feb 11 11:19:56 xxx kernel: (8 early reservations) ==> bootmem [0000000000 - 1040000000]
Feb 11 11:19:56 xxx kernel: bootmap [0000001040035000 - 0000001040234fff] pages 200
Feb 11 11:19:56 xxx kernel: (8 early reservations) ==> bootmem [1040000000 - 2040000000]
Feb 11 11:19:56 xxx kernel: Reserving 137MB of memory at 48MB for crashkernel (System RAM: 132096MB)
Feb 11 11:19:56 xxx kernel: PM: Registered nosave memory: 000000000009c000 - 0000000000100000
Feb 11 11:19:56 xxx kernel: PM: Registered nosave memory: 00000000bd2f0000 - 00000000bd31c000
Feb 11 11:19:56 xxx kernel: PM: Registered nosave memory: 00000000bd31c000 - 00000000bd35b000
Feb 11 11:19:56 xxx kernel: PM: Registered nosave memory: 00000000bd35b000 - 00000000c0000000
Feb 11 11:19:56 xxx kernel: PM: Registered nosave memory: 00000000c0000000 - 00000000e0000000
Feb 11 11:19:56 xxx kernel: PM: Registered nosave memory: 00000000e0000000 - 00000000f0000000
Feb 11 11:19:56 xxx kernel: PM: Registered nosave memory: 00000000f0000000 - 00000000fe000000
Feb 11 11:19:56 xxx kernel: PM: Registered nosave memory: 00000000fe000000 - 0000000100000000
Feb 11 11:19:56 xxx kernel: please try 'cgroup_disable=memory' option if you don't want memory cgroups
Feb 11 11:19:56 xxx kernel: Initializing cgroup subsys memory
Feb 11 11:19:56 xxx kernel: Freeing initrd memory: 15695k freed
Feb 11 11:19:56 xxx kernel: Non-volatile memory driver v1.3
Feb 11 11:19:56 xxx kernel: crash memory driver: version 1.1
Feb 11 11:19:56 xxx kernel: Freeing unused kernel memory: 1252k freed
Feb 11 11:19:56 xxx kernel: Freeing unused kernel memory: 1051k freed
Feb 11 11:19:56 xxx kernel: Freeing unused kernel memory: 1734k freed
Feb 11 11:19:56 xxx kernel: EXT4-fs (dm-13): warning: checktime reached, running e2fsck is recommended
Feb 11 11:19:56 xxx kernel: EXT4-fs (dm-14): warning: checktime reached, running e2fsck is recommended
Feb 11 11:19:56 xxx kernel: EXT4-fs (dm-15): warning: checktime reached, running e2fsck is recommended
Feb 11 11:19:56 xxx kernel: EXT4-fs (dm-16): warning: maximal mount count reached, running e2fsck is recommended
Feb 11 11:50:05 xxx snmpd[24795]: Warning: no access control information configured.#012 It's unlikely this agent can serve any useful purpose in this state.#012 Run "snmpconf -g basic_setup" to help you configure the snmpd.conf file for this agent.
[JovaTricolo#xxx log]$
Is there any solution for this?

Pop OS / Dell XPS 9310 -- battery drained overnight on suspend

My laptop is suspending on lid close successfully, but if I don't have it plugged in overnight, the battery is drained by the morning.
I'm including logs from a short suspend I ran just now. I can suspend it overnight and look at the logs afterward, but is there anything immediately suspicious here? I validated that all suspend-related targets are loaded via sudo systemctl status sleep.target suspend.target hibernate.target hybrid-sleep.target
Apr 11 22:09:29 pop-os systemd[1]: Reached target Sleep.
Apr 11 22:09:29 pop-os systemd[1]: Starting Suspend...
Apr 11 22:09:29 pop-os kernel: [ 44.986190] PM: suspend entry (s2idle)
Apr 11 22:09:29 pop-os systemd-sleep[3730]: Suspending system...
Apr 11 22:09:29 pop-os kernel: [ 44.991600] Filesystems sync: 0.005 seconds
Apr 11 22:09:57 pop-os kernel: [ 44.994638] Freezing user space processes ... (elapsed 0.002 seconds) done.
Apr 11 22:09:57 pop-os kernel: [ 44.996920] OOM killer disabled.
Apr 11 22:09:57 pop-os kernel: [ 44.996921] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
Apr 11 22:09:57 pop-os kernel: [ 44.998055] printk: Suspending console(s) (use no_console_suspend to debug)
Apr 11 22:09:57 pop-os kernel: [ 45.315954] psmouse serio1: Failed to disable mouse on isa0060/serio1
Apr 11 22:09:57 pop-os kernel: [ 46.377203] ACPI: EC: interrupt blocked
Apr 11 22:09:57 pop-os kernel: [ 72.605807] ACPI: EC: interrupt unblocked
Apr 11 22:09:57 pop-os kernel: [ 73.107660] pcieport 10000:e0:06.0: can't derive routing for PCI INT A
Apr 11 22:09:57 pop-os kernel: [ 73.107666] nvme 10000:e1:00.0: PCI INT A: no GSI
Apr 11 22:09:57 pop-os kernel: [ 73.114494] nvme nvme0: 8/0/0 default/read/poll queues
Apr 11 22:09:57 pop-os kernel: [ 73.363725] OOM killer enabled.
Apr 11 22:09:57 pop-os kernel: [ 73.363728] Restarting tasks ...
Apr 11 22:09:57 pop-os kernel: [ 73.364154] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
Apr 11 22:09:57 pop-os kernel: [ 73.367166] done.
Apr 11 22:09:57 pop-os touchegg[1000]: libinput error: event0 - Lid Switch: client bug: event processing lagging behind by 1279ms, your system is too slow
Apr 11 22:09:57 pop-os /usr/libexec/gdm-x-session[1823]: (II) modeset(0): EDID vendor "SHP", prod id 5370
Apr 11 22:09:57 pop-os /usr/libexec/gdm-x-session[1823]: (II) modeset(0): Printing DDC gathered Modelines:
Apr 11 22:09:57 pop-os /usr/libexec/gdm-x-session[1823]: (II) modeset(0): Modeline "3840x2400"x0.0 592.50 3840 3888 3920 4000 2400 2403 2409 2469 -hsync -vsync (148.1 kHz eP)
Apr 11 22:09:57 pop-os /usr/libexec/gdm-x-session[1823]: (II) modeset(0): Modeline "3840x2400"x0.0 474.00 3840 3888 3920 4000 2400 2403 2409 2469 -hsync -vsync (118.5 kHz e)
Apr 11 22:09:57 pop-os systemd-sleep[3730]: System resumed.
Apr 11 22:09:57 pop-os bluetoothd[961]: Controller resume with wake event 0x0
Apr 11 22:09:57 pop-os kernel: [ 73.413202] PM: suspend exit
Apr 11 22:09:57 pop-os systemd[1]: systemd-suspend.service: Succeeded.
Apr 11 22:09:57 pop-os systemd[1]: Finished Suspend.
Apr 11 22:09:57 pop-os systemd[1]: Stopped target Sleep.
Apr 11 22:09:57 pop-os systemd[1]: Reached target Suspend.
Apr 11 22:09:57 pop-os systemd[1]: Stopped target Suspend.
Apr 11 22:09:57 pop-os NetworkManager[968]: <info> [1649729397.3461] manager: sleep: wake requested (sleeping: yes enabled: yes)
Apr 11 22:09:57 pop-os NetworkManager[968]: <info> [1649729397.3461] device (wlp113s0): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
Apr 11 22:09:57 pop-os ModemManager[1079]: <info> [sleep-monitor] system is resuming
Apr 11 22:09:57 pop-os NetworkManager[968]: <info> [1649729397.4258] manager: NetworkManager state is now DISCONNECTED
The hardware on this system only supports s2idle sleep, and not deep sleep for less energy consumption (details on different sleep states here https://www.kernel.org/doc/Documentation/power/states.txt).
pop-os:$~ sudo cat /sys/power/mem_sleep
[s2idle]
I found this thread: https://www.dell.com/community/XPS/XPS-13-9310-Ubuntu-deep-sleep-missing/td-p/7734008 It suggests changing the disk management from RAID (Dell's default) to AHCI via the Dell BIOS.
So far this has worked for a solution! I've lost only 10% battery overnight, and can go 3 days idling in suspend without a charge.
(Before this, I did try enabling hibernate through these instructions from System76 https://support.system76.com/articles/enable-hibernation/. This does not work great, because the Killer wifi driver does not load on wake from hibernate.)
Suspend ( considering hybrid suspend ), the machine's state is stored in swap space and suspend via RAM (aka sleep) is invoked. This caused for minimal utilisation of power.
Reason to do so : wake up from hibernate is slower than wakeup from sleep. So to ensure system state is not lost, machine's state is stored in swap space and sleep is invoked that uses minimal power and does not shut off the machine. Machine's state is stored in RAM. If battery does not die, wake up happens from RAM which is faster.
Read More : https://wiki.archlinux.org/title/Power_management/Suspend_and_hibernate
In case you want your battery to not die or drain, switch your lid close action from sleep/suspend to hibernate. Hibernate has zero power consumption. Follow the steps mentioned below.
$ grep HandleLidSwitch /etc/systemd/logind.conf
HandleLidSwitch=suspend
If the line is commented, please uncomment by removing "#" and change option to hibernate.
HandleLidSwitch=hibernate
If you are new to Linux, please use gedit command to edit the file.
sudo gedit /etc/systemd/logind.conf

Kafka broker crash every day - OOM killer

I have a cluster of 3 kafka brokers Version 0.10.2.1. Each broker has it's own host 2 cpu / 16G RAM, In addition we are using docker to wrap the broker process.
The problems is as follows:
Almost every day at the same time we see all of our kafka clients failed for 10 minutes.
At the beginning I thought it is related to Kafka No broker in ISR for partition
But after a while I discover that the broker just crash due to OOM-killer.
I also played with the Xmx and Xms before I discover that it is the OOM-killer. I had:
-Xmx2048M -Xms2048M
-Xmx4096M -Xms2048M
Same behavior for both
In addition currently we don't have ulimit
>> ulimit
unlimited
less kern.log
LOGS:
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761019] run-parts invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761022] run-parts cpuset=/ mems_allowed=0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761026] CPU: 1 PID: 12266 Comm: run-parts Not tainted 4.4.0-59-generic #80-Ubuntu
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761027] Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761029] 0000000000000286 000000004811d7da ffff880036967af0 ffffffff813f7583
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761031] ffff880036967cc8 ffff880439f2f000 ffff880036967b60 ffffffff8120ad5e
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761033] ffffffff81cd2dc7 0000000000000000 ffffffff81e67760 0000000000000206
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761036] Call Trace:
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761041] [<ffffffff813f7583>] dump_stack+0x63/0x90
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761044] [<ffffffff8120ad5e>] dump_header+0x5a/0x1c5
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761048] [<ffffffff81192722>] oom_kill_process+0x202/0x3c0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761049] [<ffffffff81192b49>] out_of_memory+0x219/0x460
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761052] [<ffffffff81198abd>] __alloc_pages_slowpath.constprop.88+0x8fd/0xa70
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761054] [<ffffffff81198eb6>] __alloc_pages_nodemask+0x286/0x2a0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761056] [<ffffffff81198f6b>] alloc_kmem_pages_node+0x4b/0xc0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761060] [<ffffffff8107ea5e>] copy_process+0x1be/0x1b70
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761063] [<ffffffff81391bcc>] ? apparmor_file_alloc_security+0x5c/0x220
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761066] [<ffffffff811ed05a>] ? kmem_cache_alloc+0x1ca/0x1f0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761070] [<ffffffff81347bd3>] ? security_file_alloc+0x33/0x50
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761073] [<ffffffff810caf11>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761074] [<ffffffff810805a0>] _do_fork+0x80/0x360
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761076] [<ffffffff81080929>] SyS_clone+0x19/0x20
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761080] [<ffffffff818384f2>] entry_SYSCALL_64_fastpath+0x16/0x71
And ....
Jan 24 06:25:25 kafka10-172-40-103-177 kernel: [16591270.954463] Out of memory: Kill process 16123 (java) score 134 or sacrifice child
Jan 24 06:25:25 kafka10-172-40-103-177 kernel: [16591270.958609] Killed process 16123 (java) total-vm:11977548kB, anon-rss:2035780kB, file-rss:67848kB
Any suggestion of how to approach this ??
We found the problem.
First I will say that adding more RAM to the machine also solved the problem but it is "expensive solution".
The problem was as follows:
Since I was working with EC2 ubuntu distribution I got daily crontabs in all of my cluster exactly at the same time. One of the scripts was mlocate this script apparently took too many resources.
I assume that since all cluster of kafka has some issues with IO and Memory, brokers was trying to use more memory and then the OOM killer killed them.
When 2 of my 3 brokers were down some services were down.
So the solution was:
Change the crontab to work in different hours of the day in each
broker.
Disable mlocate
I also faced the same issue below mentioned blog helped me out :
https://docs.confluent.io/current/kafka/deployment.html
How to decide Kafka Cluster size
https://community.hortonworks.com/articles/80813/kafka-best-practices-1.html
And please make sure that the swap is enabled on all the brokers.

INFO: task nginx:22992 blocked for more than 120 seconds

I'm running an Ubuntu VM on Azure.
2 days ago my server was down. I found this inside my syslog:
Dec 11 06:45:28 myservice kernel: [4525694.437314] INFO: task nginx:22992 blocked for more than 120 seconds.
Dec 11 06:45:28 myservice kernel: [4525694.442895] Not tainted 3.16.0-29-generic #39-Ubuntu
Dec 11 06:45:28 myservice kernel: [4525694.447905] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 11 06:45:28 myservice kernel: [4525694.453525] nginx D ffff8801bb633840 0 22992 22990 0x00000000
Dec 11 06:45:28 myservice kernel: [4525694.453531] ffff8801a0a7bd60 0000000000000082 ffff8801a0ebf010 0000000000013840
Dec 11 06:45:28 myservice kernel: [4525694.453534] ffff8801a0a7bfd8 0000000000013840 ffff8801a0ebf010 ffff8801b88d8d10
Dec 11 06:45:28 myservice kernel: [4525694.453536] ffff8801b88d8d14 ffff8801a0ebf010 00000000ffffffff ffff8801b88d8d18
Dec 11 06:45:28 myservice kernel: [4525694.453539] Call Trace:
Dec 11 06:45:28 myservice kernel: [4525694.453547] [<ffffffff817858a9>] schedule_preempt_disabled+0x29/0x70
Dec 11 06:45:28 myservice kernel: [4525694.453551] [<ffffffff81787e45>] __mutex_lock_slowpath+0xd5/0x1f0
Dec 11 06:45:28 myservice kernel: [4525694.453562] [<ffffffff81787f7f>] mutex_lock+0x1f/0x30
Dec 11 06:45:28 myservice kernel: [4525694.453580] [<ffffffffc048fe90>] cifs_strict_writev+0xf0/0x250 [cifs]
Dec 11 06:45:28 myservice kernel: [4525694.453585] [<ffffffff811e0991>] new_sync_write+0x81/0xb0
Dec 11 06:45:28 myservice kernel: [4525694.453588] [<ffffffff811e1177>] vfs_write+0xb7/0x1f0
Dec 11 06:45:28 myservice kernel: [4525694.453592] [<ffffffff811ffdcb>] ? set_close_on_exec+0x4b/0x60
Dec 11 06:45:28 myservice kernel: [4525694.453595] [<ffffffff811e1d26>] SyS_write+0x46/0xb0
Dec 11 06:45:28 myservice kernel: [4525694.453598] [<ffffffff8178a1ad>] system_call_fastpath+0x1a/0x1f
"Google" told me, it has probably sth. to do with high Disk I/O rates. But my Azure monitoring showed me very low disk read/write values in the problematic timerange. Also low CPU and low memory usage.
Another guess to this issue was a faulty hardware.
How can I check if this really was the reason - and if it was: how can I solve this problem when my VM is in the cloud? Migrate to a new VM ?!
I also have a very old nginx version which I want to update - but I don't think this is the reason for this issue, is it?

docker exec: rpc error: code = 2 desc = oci runtime error: exec failed

every time I try to do:
$ docker exec
I get the error message:
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns process caused \"exit status 16\""
Session 1 (works like expected):
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
alpine latest baa5d63471ea 7 weeks ago 4.8 MB
hello-world latest c54a2cc56cbb 5 months ago 1.85 kB
$ docker run --rm --name alpine -it alpine sh
/ # pwd
/
Session 2:
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7bd39b37aee2 alpine "sh" 22 seconds ago Up 21 seconds alpine
$ docker exec -it alpine sh
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns process caused \"exit status 16\""
$ docker exec -it 7bd39b37aee2 sh
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:83: executing setns process caused \"exit status 16\""
/var/log/syslog shows some warnings, but I was neither able to understand the root cause not finding matching answers.
Thanks for any hint.
= = = = = = = = = = = = = = = = = = = = = = = = =
$ docker info
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 2
Server Version: 1.13.0-rc3
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 4
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 51371867a01c467f08af739783b8beafc154c4d7
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-53-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.487 GiB
Name: pb7tt6ts
ID: YQ4G:ETTP:5VCM:PAJD:F3KB:O7JN:AZOF:VLTI:SKH4:BTSR:KP7D:NXIZ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
= = =
/var/log/syslog docker restart and steps above
= = =
Dec 13 14:28:09 pb7tt6ts systemd[1]: Stopping Docker Socket for the API.
Dec 13 14:28:09 pb7tt6ts systemd[1]: Starting Docker Socket for the API.
Dec 13 14:28:09 pb7tt6ts systemd[1]: Listening on Docker Socket for the API.
Dec 13 14:28:09 pb7tt6ts systemd[1]: Starting Docker Application Container Engine...
Dec 13 14:28:09 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:09.291301057+01:00" level=info msg="libcontainerd: new containerd process, pid: 1448"
Dec 13 14:28:10 pb7tt6ts kernel: [25908.125394] audit: type=1400 audit(1481635690.357:28): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="docker-default" pid=1466 comm="apparmor_parser"
Dec 13 14:28:10 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:10.370364923+01:00" level=info msg="[graphdriver] using prior storage driver: aufs"
Dec 13 14:28:10 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:10.387915069+01:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Dec 13 14:28:10 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:10.388367650+01:00" level=warning msg="Your kernel does not support swap memory limit."
Dec 13 14:28:10 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:10.388465142+01:00" level=warning msg="Your kernel does not support cgroup rt period"
Dec 13 14:28:10 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:10.388508739+01:00" level=warning msg="Your kernel does not support cgroup rt runtime"
Dec 13 14:28:10 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:10.389419384+01:00" level=info msg="Loading containers: start."
Dec 13 14:28:10 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:10.397339748+01:00" level=info msg="Firewalld running: false"
Dec 13 14:28:10 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:10.628011070+01:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Dec 13 14:28:10 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:10.743703578+01:00" level=info msg="Loading containers: done."
Dec 13 14:28:10 pb7tt6ts kernel: [25908.510718] aufs au_opts_verify:1597:dockerd[1462]: dirperm1 breaks the protection by the permission bits on the lower branch
Dec 13 14:28:10 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:10.808510166+01:00" level=info msg="Daemon has completed initialization"
Dec 13 14:28:10 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:10.808575966+01:00" level=info msg="Docker daemon" commit=4d92237 graphdriver=aufs version=1.13.0-rc3
Dec 13 14:28:10 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:10.820562161+01:00" level=info msg="API listen on /var/run/docker.sock"
Dec 13 14:28:10 pb7tt6ts systemd[1]: Started Docker Application Container Engine.
Dec 13 14:28:10 pb7tt6ts console-kit-daemon[3106]: console-kit-daemon[3106]: GLib-CRITICAL: Source ID 226 was not found when attempting to remove it
Dec 13 14:28:10 pb7tt6ts console-kit-daemon[3106]: GLib-CRITICAL: Source ID 226 was not found when attempting to remove it
Dec 13 14:28:16 pb7tt6ts kernel: [25914.206672] aufs au_opts_verify:1597:dockerd[1460]: dirperm1 breaks the protection by the permission bits on the lower branch
Dec 13 14:28:16 pb7tt6ts kernel: [25914.388393] aufs au_opts_verify:1597:dockerd[1460]: dirperm1 breaks the protection by the permission bits on the lower branch
Dec 13 14:28:16 pb7tt6ts kernel: [25914.492197] aufs au_opts_verify:1597:dockerd[1460]: dirperm1 breaks the protection by the permission bits on the lower branch
Dec 13 14:28:16 pb7tt6ts NetworkManager[1343]: <warn> [1481635696.7320] device (vethff6f844): failed to find device 35 'vethff6f844' with udev
Dec 13 14:28:16 pb7tt6ts NetworkManager[1343]: <info> [1481635696.7340] manager: (vethff6f844): new Veth device (/org/freedesktop/NetworkManager/Devices/46)
Dec 13 14:28:16 pb7tt6ts systemd-udevd[1614]: Could not generate persistent MAC address for vethff6f844: No such file or directory
Dec 13 14:28:16 pb7tt6ts NetworkManager[1343]: <warn> [1481635696.7345] device (veth13c2a1d): failed to find device 36 'veth13c2a1d' with udev
Dec 13 14:28:16 pb7tt6ts systemd-udevd[1615]: Could not generate persistent MAC address for veth13c2a1d: No such file or directory
Dec 13 14:28:16 pb7tt6ts NetworkManager[1343]: <info> [1481635696.7417] manager: (veth13c2a1d): new Veth device (/org/freedesktop/NetworkManager/Devices/47)
Dec 13 14:28:16 pb7tt6ts kernel: [25914.509027] device veth13c2a1d entered promiscuous mode
Dec 13 14:28:16 pb7tt6ts kernel: [25914.509240] IPv6: ADDRCONF(NETDEV_UP): veth13c2a1d: link is not ready
Dec 13 14:28:16 pb7tt6ts NetworkManager[1343]: <info> [1481635696.7632] devices added (path: /sys/devices/virtual/net/vethff6f844, iface: vethff6f844)
Dec 13 14:28:16 pb7tt6ts NetworkManager[1343]: <info> [1481635696.7632] device added (path: /sys/devices/virtual/net/vethff6f844, iface: vethff6f844): no ifupdown configuration found.
Dec 13 14:28:16 pb7tt6ts NetworkManager[1343]: <info> [1481635696.7639] devices added (path: /sys/devices/virtual/net/veth13c2a1d, iface: veth13c2a1d)
Dec 13 14:28:16 pb7tt6ts NetworkManager[1343]: <info> [1481635696.7640] device added (path: /sys/devices/virtual/net/veth13c2a1d, iface: veth13c2a1d): no ifupdown configuration found.
Dec 13 14:28:16 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:16.965015836+01:00" level=warning msg="Your kernel does not support swap memory limit."
Dec 13 14:28:16 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:16.965090775+01:00" level=warning msg="Your kernel does not support cgroup rt period"
Dec 13 14:28:16 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:16.965117179+01:00" level=warning msg="Your kernel does not support cgroup rt runtime"
Dec 13 14:28:17 pb7tt6ts kernel: [25914.808163] eth0: renamed from vethff6f844
Dec 13 14:28:17 pb7tt6ts acvpnagent[2339]: Function: tableCallbackHandler File: RouteMgr.cpp Line: 1723 Invoked Function: recv Return Code: 11 (0x0000000B) Description: unknown
Dec 13 14:28:17 pb7tt6ts NetworkManager[1343]: <info> [1481635697.0599] devices removed (path: /sys/devices/virtual/net/vethff6f844, iface: vethff6f844)
Dec 13 14:28:17 pb7tt6ts acvpnagent[2339]: A new network interface has been detected.
Dec 13 14:28:17 pb7tt6ts NetworkManager[1343]: <info> [1481635697.0600] device (vethff6f844): driver 'veth' does not support carrier detection.
Dec 13 14:28:17 pb7tt6ts acvpnagent[2339]: Function: logInterfaces File: RouteMgr.cpp Line: 2105 Invoked Function: logInterfaces Return Code: 0 (0x00000000) Description: IP Address Interface List: 192.168.178.24 172.17.0.1 9.145.68.34 FE80:0:0:0:D8B4:C1E0:F8E4:DB77 FE80:0:0:0:42:44FF:FEC9:5D85 FE80:0:0:0:60A9:A1FF:FEED:F31C
Dec 13 14:28:17 pb7tt6ts NetworkManager[1343]: <info> [1481635697.0604] device (veth13c2a1d): link connected
Dec 13 14:28:17 pb7tt6ts NetworkManager[1343]: <info> [1481635697.0605] device (docker0): link connected
Dec 13 14:28:17 pb7tt6ts kernel: [25914.823988] IPv6: ADDRCONF(NETDEV_CHANGE): veth13c2a1d: link becomes ready
Dec 13 14:28:17 pb7tt6ts kernel: [25914.824039] docker0: port 1(veth13c2a1d) entered forwarding state
Dec 13 14:28:17 pb7tt6ts kernel: [25914.824061] docker0: port 1(veth13c2a1d) entered forwarding state
Dec 13 14:28:18 pb7tt6ts acvpnagent[2339]: Function: tableCallbackHandler File: RouteMgr.cpp Line: 1723 Invoked Function: recv Return Code: 11 (0x0000000B) Description: unknown
Dec 13 14:28:18 pb7tt6ts avahi-daemon[1217]: Joining mDNS multicast group on interface veth13c2a1d.IPv6 with address fe80::60a9:a1ff:feed:f31c.
Dec 13 14:28:18 pb7tt6ts avahi-daemon[1217]: New relevant interface veth13c2a1d.IPv6 for mDNS.
Dec 13 14:28:18 pb7tt6ts avahi-daemon[1217]: Registering new address record for fe80::60a9:a1ff:feed:f31c on veth13c2a1d.*.
Dec 13 14:28:32 pb7tt6ts kernel: [25929.850840] docker0: port 1(veth13c2a1d) entered forwarding state
Dec 13 14:28:36 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:36.704565159+01:00" level=error msg="Error running exec in container: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused \"process_linux.go:83: executing setns process caused \\\"exit status 16\\\"\"\n"
Dec 13 14:28:36 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:36.705362948+01:00" level=error msg="Handler for POST /v1.25/exec/8a78f29ef71d4c3ab982a8dd7a4a325e280766072dea7337860874a72c42f42c/resize returned error: rpc error: code = 2 desc = containerd: process not found for container"
Dec 13 14:28:46 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:46.921880770+01:00" level=error msg="Error running exec in container: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused \"process_linux.go:83: executing setns process caused \\\"exit status 16\\\"\"\n"
Dec 13 14:28:46 pb7tt6ts dockerd[1436]: time="2016-12-13T14:28:46.922576933+01:00" level=error msg="Handler for POST /v1.25/exec/5ad25668cac553118b8c702f02c69b427436eb67d1488d4170641bcacfdad50b/resize returned error: rpc error: code = 2 desc = containerd: process not found for container"
As recommended I reverted to a main version of docker and installed docker-engine 1.12.4
$ docker info
Containers: 2
Running: 1
Paused: 0
Stopped: 1
Images: 3
Server Version: 1.12.4
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 11
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: host bridge null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-53-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.487 GiB
Name: pb7tt6ts
ID: YQ4G:ETTP:5VCM:PAJD:F3KB:O7JN:AZOF:VLTI:SKH4:BTSR:KP7D:NXIZ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8
Furthermore, no success but different error:
$ docker exec -it alpine sh
rpc error: code = 13 desc = invalid header field value "oci runtime error: exec failed: container_linux.go:247: starting container process caused \"process_linux.go:83: executing setns process caused \\\"exit status 17\\\"\"\n"
Corresponding /var/log/syslog from service docker start (21:00), docker run ... (21:01), docker exec ... (21:01)
Dec 13 21:00:01 pb7tt6ts systemd[1]: Starting Docker Socket for the API.
Dec 13 21:00:01 pb7tt6ts systemd[1]: Listening on Docker Socket for the API.
Dec 13 21:00:01 pb7tt6ts systemd[1]: Starting Docker Application Container Engine...
Dec 13 21:00:01 pb7tt6ts dockerd[8675]: time="2016-12-13T21:00:01.468921183+01:00" level=info msg="libcontainerd: new containerd process, pid: 8686"
Dec 13 21:00:02 pb7tt6ts kernel: [49419.124965] audit: type=1400 audit(1481659202.536:37): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="docker-default" pid=8700 comm="apparmor_parser"
Dec 13 21:00:02 pb7tt6ts dockerd[8675]: time="2016-12-13T21:00:02.550070413+01:00" level=info msg="[graphdriver] using prior storage driver \"aufs\""
Dec 13 21:00:02 pb7tt6ts dockerd[8675]: time="2016-12-13T21:00:02.572067603+01:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Dec 13 21:00:02 pb7tt6ts dockerd[8675]: time="2016-12-13T21:00:02.572336166+01:00" level=warning msg="Your kernel does not support swap memory limit."
Dec 13 21:00:02 pb7tt6ts dockerd[8675]: time="2016-12-13T21:00:02.572799562+01:00" level=info msg="Loading containers: start."
Dec 13 21:00:02 pb7tt6ts dockerd[8675]: time="2016-12-13T21:00:02.579465999+01:00" level=info msg="Firewalld running: false"
Dec 13 21:00:02 pb7tt6ts dockerd[8675]: time="2016-12-13T21:00:02.779165187+01:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Dec 13 21:00:02 pb7tt6ts dockerd[8675]: time="2016-12-13T21:00:02.903085523+01:00" level=info msg="Loading containers: done."
Dec 13 21:00:02 pb7tt6ts dockerd[8675]: time="2016-12-13T21:00:02.903179108+01:00" level=info msg="Daemon has completed initialization"
Dec 13 21:00:02 pb7tt6ts dockerd[8675]: time="2016-12-13T21:00:02.903208197+01:00" level=info msg="Docker daemon" commit=1564f02 graphdriver=aufs version=1.12.4
Dec 13 21:00:02 pb7tt6ts dockerd[8675]: time="2016-12-13T21:00:02.923282443+01:00" level=info msg="API listen on /var/run/docker.sock"
Dec 13 21:00:02 pb7tt6ts systemd[1]: Started Docker Application Container Engine.
Dec 13 21:01:01 pb7tt6ts kernel: [49477.834789] aufs au_opts_verify:1597:dockerd[8692]: dirperm1 breaks the protection by the permission bits on the lower branch
Dec 13 21:01:01 pb7tt6ts kernel: [49477.896566] aufs au_opts_verify:1597:dockerd[8692]: dirperm1 breaks the protection by the permission bits on the lower branch
Dec 13 21:01:01 pb7tt6ts kernel: [49478.080340] aufs au_opts_verify:1597:dockerd[8692]: dirperm1 breaks the protection by the permission bits on the lower branch
Dec 13 21:01:01 pb7tt6ts kernel: [49478.192100] aufs au_opts_verify:1597:dockerd[8682]: dirperm1 breaks the protection by the permission bits on the lower branch
Dec 13 21:01:01 pb7tt6ts NetworkManager[1343]: <warn> [1481659261.6125] device (veth2b5b07c): failed to find device 47 'veth2b5b07c' with udev
Dec 13 21:01:01 pb7tt6ts systemd-udevd[8810]: Could not generate persistent MAC address for vethc2e4873: No such file or directory
Dec 13 21:01:01 pb7tt6ts kernel: [49478.196917] device vethc2e4873 entered promiscuous mode
Dec 13 21:01:01 pb7tt6ts NetworkManager[1343]: <info> [1481659261.6215] manager: (veth2b5b07c): new Veth device (/org/freedesktop/NetworkManager/Devices/63)
Dec 13 21:01:01 pb7tt6ts NetworkManager[1343]: <warn> [1481659261.6222] device (vethc2e4873): failed to find device 48 'vethc2e4873' with udev
Dec 13 21:01:01 pb7tt6ts NetworkManager[1343]: <info> [1481659261.6241] manager: (vethc2e4873): new Veth device (/org/freedesktop/NetworkManager/Devices/64)
Dec 13 21:01:01 pb7tt6ts systemd-udevd[8809]: Could not generate persistent MAC address for veth2b5b07c: No such file or directory
Dec 13 21:01:01 pb7tt6ts kernel: [49478.211913] IPv6: ADDRCONF(NETDEV_UP): vethc2e4873: link is not ready
Dec 13 21:01:01 pb7tt6ts NetworkManager[1343]: <info> [1481659261.6454] devices added (path: /sys/devices/virtual/net/veth2b5b07c, iface: veth2b5b07c)
Dec 13 21:01:01 pb7tt6ts NetworkManager[1343]: <info> [1481659261.6454] device added (path: /sys/devices/virtual/net/veth2b5b07c, iface: veth2b5b07c): no ifupdown configuration found.
Dec 13 21:01:01 pb7tt6ts NetworkManager[1343]: <info> [1481659261.6507] devices added (path: /sys/devices/virtual/net/vethc2e4873, iface: vethc2e4873)
Dec 13 21:01:01 pb7tt6ts NetworkManager[1343]: <info> [1481659261.6507] device added (path: /sys/devices/virtual/net/vethc2e4873, iface: vethc2e4873): no ifupdown configuration found.
Dec 13 21:01:01 pb7tt6ts kernel: [49478.557310] eth0: renamed from veth2b5b07c
Dec 13 21:01:01 pb7tt6ts NetworkManager[1343]: <info> [1481659261.9915] devices removed (path: /sys/devices/virtual/net/veth2b5b07c, iface: veth2b5b07c)
Dec 13 21:01:01 pb7tt6ts NetworkManager[1343]: <info> [1481659261.9916] device (veth2b5b07c): driver 'veth' does not support carrier detection.
Dec 13 21:01:01 pb7tt6ts NetworkManager[1343]: <info> [1481659261.9919] device (vethc2e4873): link connected
Dec 13 21:01:01 pb7tt6ts NetworkManager[1343]: <info> [1481659261.9937] device (docker0): link connected
Dec 13 21:01:01 pb7tt6ts kernel: [49478.573434] IPv6: ADDRCONF(NETDEV_CHANGE): vethc2e4873: link becomes ready
Dec 13 21:01:01 pb7tt6ts kernel: [49478.573503] docker0: port 1(vethc2e4873) entered forwarding state
Dec 13 21:01:01 pb7tt6ts kernel: [49478.573527] docker0: port 1(vethc2e4873) entered forwarding state
Dec 13 21:01:03 pb7tt6ts avahi-daemon[1217]: Joining mDNS multicast group on interface vethc2e4873.IPv6 with address fe80::d02a:ecff:fea8:662c.
Dec 13 21:01:03 pb7tt6ts avahi-daemon[1217]: New relevant interface vethc2e4873.IPv6 for mDNS.
Dec 13 21:01:03 pb7tt6ts avahi-daemon[1217]: Registering new address record for fe80::d02a:ecff:fea8:662c on vethc2e4873.*.
Dec 13 21:01:17 pb7tt6ts kernel: [49493.628038] docker0: port 1(vethc2e4873) entered forwarding state
Dec 13 21:02:02 pb7tt6ts dockerd[8675]: time="2016-12-13T21:02:02.072027206+01:00" level=error msg="Error running exec in container: rpc error: code = 13 desc = invalid header field value \"oci runtime error: exec failed: container_linux.go:247: starting container process caused \\\"process_linux.go:83: executing setns process caused \\\\\\\"exit status 17\\\\\\\"\\\"\\n\""
Dec 13 21:02:02 pb7tt6ts dockerd[8675]: time="2016-12-13T21:02:02.072759152+01:00" level=error msg="Handler for POST /v1.24/exec/00c0dcac7a178129a17cd9eb833d154d428f2a6efbcd0f421ab3c5c54e52a236/resize returned error: rpc error: code = 2 desc = containerd: process not found for container"
From the linked issue is this comment which appears to be the root cause:
I think I found the root reason. It's nothing to do with Docker.
Actually docker exec always fail because of Symantec AutoProtect
running on my system. It loads a custom kernel module that add some
file operation hooks, which affects the result of setns.
$ lsmod | grep symev
symev_custom_dkms_x86_64 72166 2 symap_custom_dkms_x86_64
The workaround is to disable Symantec AutoProtect and reboot.
sudo update-rc.d autoprotect disable

Resources