I was trying to debug a failed test job in a CircleCI workflow which had a config similar to this:
integration_tests:
docker:
- image: my-group/our-custom-image:latest
- image: postgres:9.6.11
environment:
POSTGRES_USER: user
POSTGRES_PASSWORD: password
POSTGRES_DB: db
steps:
- stuff && things
When I ran the job with SSH debugging and SSH'ed to where the CircleCI app told me, I found myself in a strange maze of twisty little namespaces, all alike. I ran ps awx and I could see processes from both the two docker containers:
root#90c93bcdd369:~# ps awwwx
PID TTY STAT TIME COMMAND
1 pts/0 Ss 0:00 /dev/init -- /bin/sh
6 pts/0 S+ 0:00 /bin/sh
7 pts/0 Ss+ 0:00 postgres
40 ? Ssl 0:02 /bin/circleci-agent ...
105 ? Ss 0:00 postgres: checkpointer process
106 ? Ss 0:00 postgres: writer process
107 ? Ss 0:00 postgres: wal writer process
108 ? Ss 0:00 postgres: autovacuum launcher process
109 ? Ss 0:00 postgres: stats collector process
153 pts/1 Ss+ 0:00 bash "stuff && things"
257 pts/1 Sl+ 0:31 /path/to/our/application
359 pts/2 Ss 0:00 -bash
369 pts/2 R+ 0:00 ps awwwx
It seems like what they did was somehow "merged" the cgroup namespaces of the two docker containers into a third namespace, under which the shell they provided me resides. Because pid 7 is running from one docker container, and pid 257 is the application running inside the my-group/our-custom-image:latest container.
The cgroup view from /proc shows some kind of merged cgroups going on, it looks like?
root#90c93bcdd369:~# cat /proc/7/cgroup
12:devices:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/d8fc5294708fd4cf91fa405d6462571e1dc56413b55a6b6e5790b8f158fee632
11:blkio:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/d8fc5294708fd4cf91fa405d6462571e1dc56413b55a6b6e5790b8f158fee632
10:memory:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/d8fc5294708fd4cf91fa405d6462571e1dc56413b55a6b6e5790b8f158fee632
9:hugetlb:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/d8fc5294708fd4cf91fa405d6462571e1dc56413b55a6b6e5790b8f158fee632
8:perf_event:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/d8fc5294708fd4cf91fa405d6462571e1dc56413b55a6b6e5790b8f158fee632
7:net_cls,net_prio:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/d8fc5294708fd4cf91fa405d6462571e1dc56413b55a6b6e5790b8f158fee632
6:rdma:/
5:cpu,cpuacct:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/d8fc5294708fd4cf91fa405d6462571e1dc56413b55a6b6e5790b8f158fee632
4:cpuset:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/d8fc5294708fd4cf91fa405d6462571e1dc56413b55a6b6e5790b8f158fee632
3:pids:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/d8fc5294708fd4cf91fa405d6462571e1dc56413b55a6b6e5790b8f158fee632
2:freezer:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/d8fc5294708fd4cf91fa405d6462571e1dc56413b55a6b6e5790b8f158fee632
1:name=systemd:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/d8fc5294708fd4cf91fa405d6462571e1dc56413b55a6b6e5790b8f158fee632
0::/system.slice/containerd.service
root#90c93bcdd369:~# cat /proc/257/cgroup
12:devices:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/90c93bcdd3693a918adddf62939c5b31e86868864edabe7347a268149e797f43
11:blkio:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/90c93bcdd3693a918adddf62939c5b31e86868864edabe7347a268149e797f43
10:memory:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/90c93bcdd3693a918adddf62939c5b31e86868864edabe7347a268149e797f43
9:hugetlb:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/90c93bcdd3693a918adddf62939c5b31e86868864edabe7347a268149e797f43
8:perf_event:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/90c93bcdd3693a918adddf62939c5b31e86868864edabe7347a268149e797f43
7:net_cls,net_prio:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/90c93bcdd3693a918adddf62939c5b31e86868864edabe7347a268149e797f43
6:rdma:/
5:cpu,cpuacct:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/90c93bcdd3693a918adddf62939c5b31e86868864edabe7347a268149e797f43
4:cpuset:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/90c93bcdd3693a918adddf62939c5b31e86868864edabe7347a268149e797f43
3:pids:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/90c93bcdd3693a918adddf62939c5b31e86868864edabe7347a268149e797f43
2:freezer:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/90c93bcdd3693a918adddf62939c5b31e86868864edabe7347a268149e797f43
1:name=systemd:/docker/94376e7880579c6bde0622017594fcdb8d5767788bb4790c0f014db282198577/90c93bcdd3693a918adddf62939c5b31e86868864edabe7347a268149e797f43
0::/system.slice/containerd.service
Is there a standard Docker feature, or some cgroup feature being used to produce this magic? Or is this some custom, proprietary CircleCI feature?
Is there a simple way I can completely pause an unprivileged docker container from the inside while retaining the ability to unpause/exec it from the outside?
TL;DR;
On a linux container, the answer is definitely, yes, because those two are equivalent:
From host:
docker pause [container-id]
From the container:
kill -SIGSTOP [process(es)-id]
or, even shorter
kill -SIGSTOP -1
Mind that:
If your process ID, or PID is 1, then you fall under an edge case, because PID 1, the init process, do have a specific meaning and behaviour in Linux.
Some processes might spawn child worker, as the NGINX example below.
And those two are also equivalent:
From host:
docker unpause [container-id]
From the container:
kill -SIGCONT [process(es)-id]
or, even shorter
kill -SIGCONT -1
Also mind that, in some edges cases, this won't work. The edge cases being that your process is meant to catch those two signals, SIGSTOP and SIGCONT, and ignore them.
In those cases, you will have to
either, be a privileged user, because the use of the cgroup freezer is under a folder, that is per default, read only in Docker, and probably this will end you in a dead end, because you will not be able to jump in the container anymore.
or, run your container with the flag --init so the PID 1 will just be a wrapper process initialised by Docker and you won't need to pause it anymore in order to pause the processes running inside your container.
You can use the --init flag to indicate that an init process should be used as the PID 1 in the container. Specifying an init process ensures the usual responsibilities of an init system, such as reaping zombie processes, are performed inside the created container.
The default init process used is the first docker-init executable found in the system path of the Docker daemon process. This docker-init binary, included in the default installation, is backed by tini.
This is definitely possible for Linux containers, and is explained, somehow, in the documentation, where they point out that running docker pause [container-id] just means that Docker will use an equivalent mechanism to sending the SIGSTOP signal to the process run in your container.
The docker pause command suspends all processes in the specified containers. On Linux, this uses the freezer cgroup. Traditionally, when suspending a process the SIGSTOP signal is used, which is observable by the process being suspended. With the freezer cgroup the process is unaware, and unable to capture, that it is being suspended, and subsequently resumed. On Windows, only Hyper-V containers can be paused.
See the freezer cgroup documentation for further details.
Source: https://docs.docker.com/engine/reference/commandline/pause/
Here would be an example on an NGINX Alpine container:
### For now, we are on the host machine
$ docker run -p 8080:80 -d nginx:alpine
f444eaf8464e30c18f7f83bb0d1bd07b48d0d99f9d9e588b2bd77659db520524
### Testing if NGINX answers, successful
$ curl -I -m 1 http://localhost:8080/
HTTP/1.1 200 OK
Server: nginx/1.19.0
Date: Sun, 28 Jun 2020 11:49:33 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 26 May 2020 15:37:18 GMT
Connection: keep-alive
ETag: "5ecd37ae-264"
Accept-Ranges: bytes
### Jumping into the container
$ docker exec -ti f7a2be0e230b9f7937d90954ef03502993857c5081ab20ed9a943a35687fbca4 ash
### This is the container, now, let's see the processes running
/ # ps -o pid,vsz,rss,tty,stat,time,ruser,args
PID VSZ RSS TT STAT TIME RUSER COMMAND
1 6000 4536 ? S 0:00 root nginx: master process nginx -g daemon off;
29 6440 1828 ? S 0:00 nginx nginx: worker process
30 6440 1828 ? S 0:00 nginx nginx: worker process
31 6440 1828 ? S 0:00 nginx nginx: worker process
32 6440 1828 ? S 0:00 nginx nginx: worker process
49 1648 1052 136,0 S 0:00 root ash
55 1576 4 136,0 R 0:00 root ps -o pid,vsz,rss,tty,stat,time,ruser,args
### Now let's send the SIGSTOP signal to the workers of NGINX, as docker pause would do
/ # kill -SIGSTOP 29 30 31 32
### Running ps again just to observer the T (stopped) state of the processes
/ # ps -o pid,vsz,rss,tty,stat,time,ruser,args
PID VSZ RSS TT STAT TIME RUSER COMMAND
1 6000 4536 ? S 0:00 root nginx: master process nginx -g daemon off;
29 6440 1828 ? T 0:00 nginx nginx: worker process
30 6440 1828 ? T 0:00 nginx nginx: worker process
31 6440 1828 ? T 0:00 nginx nginx: worker process
32 6440 1828 ? T 0:00 nginx nginx: worker process
57 1648 1052 136,0 S 0:00 root ash
63 1576 4 136,0 R 0:00 root ps -o pid,vsz,rss,tty,stat,time,ruser,args
/ # exit
### Back on the host to confirm NGINX doesn't answer anymore
$ curl -I -m 1 http://localhost:8080/
curl: (28) Operation timed out after 1000 milliseconds with 0 bytes received
$ docker exec -ti f7a2be0e230b9f7937d90954ef03502993857c5081ab20ed9a943a35687fbca4 ash
### Sending the SIGCONT signal as docker unpause would do
/ # kill -SIGCONT 29 30 31 32
/ # ps -o pid,vsz,rss,tty,stat,time,ruser,args
PID VSZ RSS TT STAT TIME RUSER COMMAND
1 6000 4536 ? S 0:00 root nginx: master process nginx -g daemon off;
29 6440 1828 ? S 0:00 nginx nginx: worker process
30 6440 1828 ? S 0:00 nginx nginx: worker process
31 6440 1828 ? S 0:00 nginx nginx: worker process
32 6440 1828 ? S 0:00 nginx nginx: worker process
57 1648 1052 136,0 S 0:00 root ash
62 1576 4 136,0 R 0:00 root ps -o pid,vsz,rss,tty,stat,time,ruser,args 29 30 31 32
/ # exit
### Back on the host to confirm NGINX is back
$ curl -I http://localhost:8080/
HTTP/1.1 200 OK
Server: nginx/1.19.0
Date: Sun, 28 Jun 2020 11:56:23 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 26 May 2020 15:37:18 GMT
Connection: keep-alive
ETag: "5ecd37ae-264"
Accept-Ranges: bytes
For the cases where the meaningful process is the PID 1 and so, is protected by the Linux kernel, you might want to try the --init flag at the run of your container so Docker will create a wrapper process that will be able to pass the signal to your application.
$ docker run -p 8080:80 -d --init nginx:alpine
e61e9158b2aab95007b97aa50bc77fff6b5c15cf3b16aa20a486891724bec6e9
$ docker exec -ti e61e9158b2aab95007b97aa50bc77fff6b5c15cf3b16aa20a486891724bec6e9 ash
/ # ps -o pid,vsz,rss,tty,stat,time,ruser,args
PID VSZ RSS TT STAT TIME RUSER COMMAND
1 1052 4 ? S 0:00 root /sbin/docker-init -- /docker-entrypoint.sh nginx -g daemon off;
7 6000 4320 ? S 0:00 root nginx: master process nginx -g daemon off;
31 6440 1820 ? S 0:00 nginx nginx: worker process
32 6440 1820 ? S 0:00 nginx nginx: worker process
33 6440 1820 ? S 0:00 nginx nginx: worker process
34 6440 1820 ? S 0:00 nginx nginx: worker process
35 1648 4 136,0 S 0:00 root ash
40 1576 4 136,0 R 0:00 root ps -o pid,vsz,rss,tty,stat,time,ruser,args
See how nginx: master process nginx -g daemon off; that was PID 1 in the previous use case became PID 7, now?
This ables us to kill -SIGSTOP -1 and be sure all meaningful processes are stopped, still we won't be locked out of the container.
While digging on this, I found this blog post that seems like a good read on the topic: https://major.io/2009/06/15/two-great-signals-sigstop-and-sigcont/
Also related it the ps manual page extract about process state code:
Here are the different values that the s, stat and state output
specifiers (header "STAT" or "S") will display to describe the state
of a process:
D uninterruptible sleep (usually IO)
I Idle kernel thread
R running or runnable (on run queue)
S interruptible sleep (waiting for an event to complete)
T stopped by job control signal
t stopped by debugger during the tracing
W paging (not valid since the 2.6.xx kernel)
X dead (should never be seen)
Z defunct ("zombie") process, terminated but not reaped by
its parent
For BSD formats and when the stat keyword is used, additional
characters may be displayed:
< high-priority (not nice to other users)
N low-priority (nice to other users)
L has pages locked into memory (for real-time and custom
IO)
s is a session leader
l is multi-threaded (using CLONE_THREAD, like NPTL
pthreads do)
+ is in the foreground process group
Source https://man7.org/linux/man-pages/man1/ps.1.html#PROCESS_STATE_CODES
docker pause command from inside is not possible for an unprivileged container. It would need access to the docker daemon by mounting the socket.
You would need to build a custom solution. Just the basic idea: You could bindmount a folder from the host. Inside this folder you create a file which acts as a lock. So when you pause inside the container you would create the file. While the file exists you activly wait/sleep. As soon as the host would delete the file at path which was mounted, your code would resume. That is a rather naive approach because you actively wait, but it should do the trick.
You can also look into inotify to overcome activ waiting.
https://lwn.net/Articles/604686/
We have two TypeScript apps, both created through CRA, and a CI pipeline which runs a series of npm commands to run tests/lint and build the apps for later stages:
time npm install --no-optional --unsafe-perm
npm test -- --coverage
npm run tsc
npm run lint
export REACT_APP_VERSION=$VERSION
export REACT_APP_COMMIT=$GIT_COMMIT
npm run build
npm run build-storybook
Our CI pipeline runs in Jenkins, and we're using the kubernetes plugin in order to get executors on-demand.
The script is run in parallel for app1 and app2 via the following logic in our Jenkinsfile:
stage('Frontend - App1') {
agent {
kubernetes {
label 'node'
defaultContainer 'jnlp'
yamlFile 'infrastructure/scripts/ci/pod-templates/node.yaml'
idleMinutes 30
}
}
environment {
CI = 'true'
NPMRC_SECRET_FILE_PATH = credentials('verdaccio-npmrc')
}
steps {
dir('frontend/app1') {
container('node') {
sh 'cp $NPMRC_SECRET_FILE_PATH ~/.npmrc'
sh 'chmod u+rw ~/.npmrc'
sh '../../infrastructure/scripts/ci/build-frontend.sh'
}
publishHTML(target: [
allowMissing : false,
alwaysLinkToLastBuild: false,
keepAll : true,
reportDir : 'coverage',
reportFiles : 'index.html',
reportName : "Coverage Report (app1)"
])
junit 'testing/junit.xml'
stash includes: 'build/**/*', name: 'app1-build'
stash includes: 'storybook-static/**/*', name: 'app1-storybook-build'
}
}
}
So, onto what we're seeing. Repeatedly yesterday we saw the same symptoms: the frontend stage for app1 would complete (the smaller of the two), whilst app2 would mysteriously stop in the middle of running tests (the last line of logging in Jenkins was always PASS src/x/y/file.test.ts, but not always the same test). It would remain in this state for a full hour before getting killed by our pipeline timeout (or a bored developer).
We ran kubectl exec -it node-blah sh to get onto the pod that was running the stuck stage and get some diagnostics. Running ps aux | cat gives us this:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
node 1 0.0 0.0 4396 720 ? Ss+ 08:51 0:00 cat
node 17 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 32 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 47 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 664 0.0 0.0 0 0 ? Z 09:04 0:00 [sh] <defunct>
.
.
.
node 6760 0.0 0.0 4340 108 ? S 10:36 0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node 6761 0.0 0.0 4340 1060 ? S 10:36 0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node 6762 0.0 0.0 4340 812 ? S 10:36 0:00 sh -xe /home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/script.sh
node 6764 0.0 0.0 20096 2900 ? S 10:36 0:00 /bin/bash ../../infrastructure/scripts/ci/build-frontend.sh
node 6804 0.0 0.5 984620 38552 ? Sl 10:37 0:00 npm
node 6816 0.0 0.0 4356 836 ? S 10:37 0:00 sh -c react-app-rewired test --reporters default --reporters jest-junit "--coverage"
node 6817 0.0 0.4 877704 30220 ? Sl 10:37 0:00 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/.bin/react-app-rewired test --reporters default --reporters jest-junit --coverage
node 6823 0.4 1.3 1006148 97108 ? Sl 10:37 0:06 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/react-app-rewired/scripts/test.js --reporters default --reporters jest-junit --coverage
node 6881 2.8 2.6 1065992 194076 ? Sl 10:37 0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6886 2.8 2.6 1067004 195748 ? Sl 10:37 0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6898 2.9 2.5 1058872 187360 ? Sl 10:37 0:43 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6905 2.8 2.4 1054256 183492 ? Sl 10:37 0:42 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6910 2.8 2.6 1067812 196344 ? Sl 10:37 0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6911 2.7 2.6 1063680 191088 ? Sl 10:37 0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6950 0.8 1.9 1018536 145396 ? Sl 10:38 0:11 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 7833 0.0 0.0 4340 804 ? Ss 10:59 0:00 sh
node 7918 0.0 0.0 4240 652 ? S 11:01 0:00 sleep 3
node 7919 0.0 0.0 17508 2048 ? R+ 11:01 0:00 ps aux
node 7920 0.0 0.0 4396 716 ? S+ 11:01 0:00 cat
From the manual on ps:
S interruptible sleep (waiting for an event to complete)
l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
So I think what this shows is that the tests have started running fine, spawned child processes to run them in parallel, and then for whatever reason after 40 seconds or so those processes have all gone to sleep and are no longer doing anything.
We're pretty stumped with how to investigate this further, particularly as we have the awkwardness of not easily being able to install whatever we like into the pod for further investigation (no easy root access)... but any suggested theories / next steps would be welcomed!
** EDIT **
It seems idleMinutes wasn't the culprit, as several times today we've seen the issue happen again since reverting it. I've been able to verify that the script was running in a brand new node in kubernetes which hadn't been used by any other builds previously. So now I have no idea what's even changed recently to make this start happening :(
Having banged my head against this some more, I'm pretty confident that the root cause was the tests using excessive memory in the pod. We got lucky that for a few builds yesterday we saw an ENOMEM error printed out amongst the logging, before it got stuck in an identical way. I can't explain why we weren't always seeing this (we went back and checked previous examples and it wasn't there), but that's what put us onto the right track.
Doing some more digging around, I happened to run a kubectl top pods in time to catch one of the node pods going crazy - you can see that node-thk0r-5vpzk is using 3131Mi at this particular moment in time, and we'd set the limit on the pod to be 3Gi:
Looking back at the corresponding build in Jenkins, I saw that it was now in the stuck state but with no ENOMEM logging. Subsequent kubectl top pods commands showed the memory had now decreased to a reasonable level in node-thk0r-5vpzk, but clearly the damage was already done as we now had all the child processes in the weird sleep state not doing anything.
This also (potentially) explains why the problem became way more common after I introduced the idleMinutes behaviour - if there's any sort of memory leak then re-using the same pod over and over for npm test will make it more and more likely to hit the memory ceiling and freak out. Our fix for now has been to limit the number of workers using the --maxWorkers setting, which keeps us well below our 3Gi limit. We're also planning to look into the memory usage a bit using --detectLeaks to see if there's something crazy in our tests we can fix to solve the rampant memory usage.
Hoping this can help someone else if they see a similar problem. Just another day in the crazy DevOps world...
Trying to stop Apache2 service, but get PID error:
#service apache2 stop
[FAIL] Stopping web server: apache2 failed!
[....] There are processes named 'apache2' running which do not match your pid file which are left untouched in the name of safety, Plea[warnview the situation by hand. ... (warning).
Trying to kill, those processes:
#kill -9 $(ps aux | grep apache2 | awk '{print $2}')
but they get re-spawned again:
#ps aux | grep apache2
root 19279 0.0 0.0 4080 348 ? Ss 05:10 0:00 runsv apache2
root 19280 0.0 0.0 4316 648 ? S 05:10 0:00 /bin/sh /usr/sbin/apache2ctl -D FOREGROUND
root 19282 0.0 0.0 91344 5424 ? S 05:10 0:00 /usr/sbin/apache2 -D FOREGROUND
www-data 19284 0.0 0.0 380500 2812 ? Sl 05:10 0:00 /usr/sbin/apache2 -D FOREGROUND
www-data 19285 0.0 0.0 380500 2812 ? Sl 05:10 0:00 /usr/sbin/apache2 -D FOREGROUND
And though the processes are running i can't connect to the server on port 80. /var/log/apache2/error.log.1 has no new messages when i do the kill -9.
Before I tried to restart everything worked perfectly.
Running on Debian: Linux adara 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux
UPD:
also tried apache2ctl:
#/usr/sbin/apache2ctl -k stop
AH00526: Syntax error on line 76 of /etc/apache2/apache2.conf:
PidFile takes one argument, A file for logging the server process ID
Action '-k stop' failed.
The Apache error log may have more information.
but there is no pid file in /var/run/apache2
I'm new to linux, looks like it has to do something with startup scripts, but can't figure out what exactly.
Below is the command to find out the process running on port 80
lsof -i tcp:80
Kill the process with PID.Restart the system once to check if their is any start up script executing and using the Port 80 which is preventing you to start your service.
For start up scripts you can check
/etc/init.d/ or /etc/rc.local or crontab - e
You can try Apache official documentation for stop/restart operations.
link
I created a database "my_new_database" and "albums", neither of which I can DELETE. I believe I am still in "ADMIN" party mode. To demonstrate my issue Ill just post some info below.
First here is to show couchdb running ( started using the SystemV script via service )
$ ps aux | grep couch
couchdb 2939 0.0 0.2 108320 1528 ? S 20:45 0:00 /bin/sh -e /usr/bin/couchdb -a /etc/couchdb/default.ini -a /etc/couchdb/local.ini -b -r 0 -p /var/run/couchdb/couchdb.pid -o /dev/null -e /dev/null -R
couchdb 2950 0.0 0.1 108320 732 ? S 20:45 0:00 /bin/sh -e /usr/bin/couchdb -a /etc/couchdb/default.ini -a /etc/couchdb/local.ini -b -r 0 -p /var/run/couchdb/couchdb.pid -o /dev/null -e /dev/null -R
couchdb 2951 4.8 2.3 362168 14004 ? Sl 20:45 0:00 /usr/lib64/erlang/erts-5.8.5/bin/beam -Bd -K true -A 4 -- -root /usr/lib64/erlang -progname erl -- -home /usr/local/var/lib/couchdb -- -noshell -noinput -sasl errlog_type error -couch_ini /etc/couchdb/default.ini /etc/couchdb/local.ini /etc/couchdb/default.ini /etc/couchdb/local.ini -s couch -pidfile /var/run/couchdb/couchdb.pid -heart
couchdb 2959 0.0 0.0 3932 304 ? Ss 20:45 0:00 heart -pid 2951 -ht 11
ec2-user 2963 0.0 0.1 103424 828 pts/1 S+ 20:45 0:00 grep couch
Here is the output of the ".couch" databases I have ( shown for user ownership and permissions)
$ ls -lat /var/lib/couchdb
-rw-r--r-- 1 couchdb couchdb 23 Oct 11 20:45 couch.uri
drwxr-xr-x 3 couchdb couchdb 4096 Oct 11 19:35 .
-rw-r--r-- 1 couchdb couchdb 79 Oct 11 19:35 database2.couch
-rwxrwxrwx 1 couchdb couchdb 79 Oct 11 19:00 my_new_database.couch
-rw-r--r-- 1 couchdb couchdb 4182 Oct 4 21:52 albums.couch
-rw-r--r-- 1 couchdb couchdb 79 Oct 4 21:42 albums-backup.couch
-rw-r--r-- 1 couchdb couchdb 4185 Oct 4 21:30 _users.couch
drwxr-xr-x 18 root root 4096 Oct 4 20:58 ..
drwxr-xr-x 2 root root 4096 Oct 4 18:34 .delete
Here is my first attempt to DELETE
$ curl -X DELETE http://127.0.0.1:5984/my_new_database
{"error":"unauthorized","reason":"You are not a server admin."}
And my second attempt using an authenticated user.
$ curl -X DELETE http://brian:brian#127.0.0.1:5984/my_new_database
{"error":"error","reason":"eacces"}
The username/password of brian/brian was added to the [admin] section of /etc/couchdb/local.ini
Here is the output of my "_users" file. The "key" and "id" fields confuse me.
$ curl -X GET http://brian:brian#127.0.0.1:5984/_users/_all_docs
{"total_rows":1,"offset":0,"rows":[
{"id":"_design/_auth","key":"_design/_auth","value":{"rev":"1-c44fb12a2676d481d235523092e0cec4"}}
]}
Have you restarted your CouchDB after you added to user to local.ini? If so, has the password in the file been hashed or is it readable?
Generally your file permissions look OK, so I can't tell what exactly causes the error. For a quick fix you can simply delete the .couch file, though.
This question is really old, but since I got bitten by this today and this is where Google led me, I thought I'd share my solution for others that stumble here. In my case, my Couch lib directory (/usr/local/var/lib/couchdb for me) had a directory called .delete that was owned by root. Changing the owner to couchdb let me delete databases again.