child_process didn't receive SIGTERM inside docker container - node.js

I'm writing test for electron application in typescript.
Inside application there are register listener for SIGTERM
process.on('SIGTERM', async () => {
console.log('before exit');
await this.exit(); //some inner function can't reach this statement anyway
});
Locally everything fine, but on CI when app running inside docker container looks like it didn't receive SIGTERM.
For starting application I'm using child_process.spawn
import type { ChildProcess } from 'child_process';
let yarnStart: ChildProcess = spawn('yarn', 'start', { shell: true });
// 'start' is just script in package.json
I try to kill application three different way and none of them works. Application didn't receive SIGTERM no before exit and after manually stopping ci-build in final-step ps aux showing my process.
// 1-way
yarnStart.kill('SIGTERM');
// 2-way
process.kill(yarnStart.pid, 'SIGTERM');
// 3-way
import { execSync } from 'child_process';
execSync(`kill -15 ${yarnStart.pid}`);
Why nodejs can't properly send SIGTERM inside docker container?
Only difference - locally I have debian-9(stretch) and image based on debian-10(buster). Same nodejs version 12.14.1. I will try to build container with stretch to see how it will behave, but skeptical about that this will help.
UPD
There is kind of difference in processes initiation(due to running scripts on CI in container, any instruction runs with /bin/sh -c)
when you execute ps aux you will see
//locally
myuser 101457 1.3 0.1 883544 58968 pts/8 Sl+ 10:32 0:00 /usr/bin/node /usr/share/yarn/bin/yarn.js start
myuser 101468 1.6 0.2 829316 69456 pts/8 Sl+ 10:32 0:00 /usr/bin/node /usr/share/yarn/lib/cli.js start
myuser 101479 1.6 0.2 829576 69296 pts/8 Sl+ 10:32 0:00 /usr/bin/node /usr/share/yarn/lib/cli.js start:debug
myuser 101490 0.2 0.0 564292 31140 pts/8 Sl+ 10:32 0:00 /usr/bin/node /home/myuser/myrepo/electron-app/node_modules/.bin/electron -r ts-node/register ./src/main.ts
myuser 101497 143 1.4 9215596 485132 pts/8 Sl+ 10:32 0:35 /home/myuser/myrepo/node_modules/electron/dist/electron -r ts-node/register ./src/main.ts
//container
root 495 0.0 0.0 2392 776 ? S 09:05 0:00 /bin/sh -c yarn start
root 496 1.0 0.2 893240 74336 ? Sl 09:05 0:00 /usr/local/bin/node /opt/yarn-v1.22.5/bin/yarn.js start
root 507 1.7 0.2 885588 68652 ? Sl 09:05 0:00 /usr/local/bin/node /opt/yarn-v1.22.5/lib/cli.js start
root 518 0.0 0.0 2396 712 ? S 09:05 0:00 /bin/sh -c yarn start:debug
root 519 1.7 0.2 885336 68608 ? Sl 09:05 0:00 /usr/local/bin/node /opt/yarn-v1.22.5/lib/cli.js start:debug
root 530 0.0 0.0 2396 780 ? S 09:05 0:00 /bin/sh -c electron -r ts-node/register ./src/main.ts
root 531 0.3 0.0 554764 32080 ? Sl 09:05 0:00 /usr/local/bin/node /opt/ci/jobfolder/job_id_423/electron-app/node_modules/.bin/electron -r ts-node/register ./src/main.ts
root 538 140 1.5 9072388 520824 ? Sl 09:05 0:26 /opt/ci/jobfolder/job_id_423/node_modules/electron/dist/electron -r ts-node/register ./src/main.ts
And actually killing process with
// 1-way
yarnStart.kill('SIGTERM');
works but it kills only /bin/sh -c yarn start and his child_process /usr/local/bin/node /opt/yarn-v1.22.5/bin/yarn.js start that actually spawn application still hanging

/bin/sh -c which comes with a load of problems, one of them notably that you’ll never see a signal in your application. And descendant processes create they own /bin/sh -c
I find a solution in https://stackoverflow.com/a/33556110/4577788
Also find alternative solution(but it didn't work for me, maybe becaus of specifics of nodejs execution).
You can kill all the processes belonging to the same process tree using the Process Group ID. More datailed info can be found here https://stackoverflow.com/a/15139734/4577788
When I try to execute execSync('kill -- -942'); or execSync('kill -- "-942"');
Error occure kill: illegal number -, didn't find why it occure and hove to fix it.

Related

NPM test getting mysteriously stuck when run in Jenkins

We have two TypeScript apps, both created through CRA, and a CI pipeline which runs a series of npm commands to run tests/lint and build the apps for later stages:
time npm install --no-optional --unsafe-perm
npm test -- --coverage
npm run tsc
npm run lint
export REACT_APP_VERSION=$VERSION
export REACT_APP_COMMIT=$GIT_COMMIT
npm run build
npm run build-storybook
Our CI pipeline runs in Jenkins, and we're using the kubernetes plugin in order to get executors on-demand.
The script is run in parallel for app1 and app2 via the following logic in our Jenkinsfile:
stage('Frontend - App1') {
agent {
kubernetes {
label 'node'
defaultContainer 'jnlp'
yamlFile 'infrastructure/scripts/ci/pod-templates/node.yaml'
idleMinutes 30
}
}
environment {
CI = 'true'
NPMRC_SECRET_FILE_PATH = credentials('verdaccio-npmrc')
}
steps {
dir('frontend/app1') {
container('node') {
sh 'cp $NPMRC_SECRET_FILE_PATH ~/.npmrc'
sh 'chmod u+rw ~/.npmrc'
sh '../../infrastructure/scripts/ci/build-frontend.sh'
}
publishHTML(target: [
allowMissing : false,
alwaysLinkToLastBuild: false,
keepAll : true,
reportDir : 'coverage',
reportFiles : 'index.html',
reportName : "Coverage Report (app1)"
])
junit 'testing/junit.xml'
stash includes: 'build/**/*', name: 'app1-build'
stash includes: 'storybook-static/**/*', name: 'app1-storybook-build'
}
}
}
So, onto what we're seeing. Repeatedly yesterday we saw the same symptoms: the frontend stage for app1 would complete (the smaller of the two), whilst app2 would mysteriously stop in the middle of running tests (the last line of logging in Jenkins was always PASS src/x/y/file.test.ts, but not always the same test). It would remain in this state for a full hour before getting killed by our pipeline timeout (or a bored developer).
We ran kubectl exec -it node-blah sh to get onto the pod that was running the stuck stage and get some diagnostics. Running ps aux | cat gives us this:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
node 1 0.0 0.0 4396 720 ? Ss+ 08:51 0:00 cat
node 17 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 32 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 47 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 664 0.0 0.0 0 0 ? Z 09:04 0:00 [sh] <defunct>
.
.
.
node 6760 0.0 0.0 4340 108 ? S 10:36 0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node 6761 0.0 0.0 4340 1060 ? S 10:36 0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node 6762 0.0 0.0 4340 812 ? S 10:36 0:00 sh -xe /home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/script.sh
node 6764 0.0 0.0 20096 2900 ? S 10:36 0:00 /bin/bash ../../infrastructure/scripts/ci/build-frontend.sh
node 6804 0.0 0.5 984620 38552 ? Sl 10:37 0:00 npm
node 6816 0.0 0.0 4356 836 ? S 10:37 0:00 sh -c react-app-rewired test --reporters default --reporters jest-junit "--coverage"
node 6817 0.0 0.4 877704 30220 ? Sl 10:37 0:00 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/.bin/react-app-rewired test --reporters default --reporters jest-junit --coverage
node 6823 0.4 1.3 1006148 97108 ? Sl 10:37 0:06 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/react-app-rewired/scripts/test.js --reporters default --reporters jest-junit --coverage
node 6881 2.8 2.6 1065992 194076 ? Sl 10:37 0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6886 2.8 2.6 1067004 195748 ? Sl 10:37 0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6898 2.9 2.5 1058872 187360 ? Sl 10:37 0:43 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6905 2.8 2.4 1054256 183492 ? Sl 10:37 0:42 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6910 2.8 2.6 1067812 196344 ? Sl 10:37 0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6911 2.7 2.6 1063680 191088 ? Sl 10:37 0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6950 0.8 1.9 1018536 145396 ? Sl 10:38 0:11 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 7833 0.0 0.0 4340 804 ? Ss 10:59 0:00 sh
node 7918 0.0 0.0 4240 652 ? S 11:01 0:00 sleep 3
node 7919 0.0 0.0 17508 2048 ? R+ 11:01 0:00 ps aux
node 7920 0.0 0.0 4396 716 ? S+ 11:01 0:00 cat
From the manual on ps:
S interruptible sleep (waiting for an event to complete)
l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
So I think what this shows is that the tests have started running fine, spawned child processes to run them in parallel, and then for whatever reason after 40 seconds or so those processes have all gone to sleep and are no longer doing anything.
We're pretty stumped with how to investigate this further, particularly as we have the awkwardness of not easily being able to install whatever we like into the pod for further investigation (no easy root access)... but any suggested theories / next steps would be welcomed!
** EDIT **
It seems idleMinutes wasn't the culprit, as several times today we've seen the issue happen again since reverting it. I've been able to verify that the script was running in a brand new node in kubernetes which hadn't been used by any other builds previously. So now I have no idea what's even changed recently to make this start happening :(
Having banged my head against this some more, I'm pretty confident that the root cause was the tests using excessive memory in the pod. We got lucky that for a few builds yesterday we saw an ENOMEM error printed out amongst the logging, before it got stuck in an identical way. I can't explain why we weren't always seeing this (we went back and checked previous examples and it wasn't there), but that's what put us onto the right track.
Doing some more digging around, I happened to run a kubectl top pods in time to catch one of the node pods going crazy - you can see that node-thk0r-5vpzk is using 3131Mi at this particular moment in time, and we'd set the limit on the pod to be 3Gi:
Looking back at the corresponding build in Jenkins, I saw that it was now in the stuck state but with no ENOMEM logging. Subsequent kubectl top pods commands showed the memory had now decreased to a reasonable level in node-thk0r-5vpzk, but clearly the damage was already done as we now had all the child processes in the weird sleep state not doing anything.
This also (potentially) explains why the problem became way more common after I introduced the idleMinutes behaviour - if there's any sort of memory leak then re-using the same pod over and over for npm test will make it more and more likely to hit the memory ceiling and freak out. Our fix for now has been to limit the number of workers using the --maxWorkers setting, which keeps us well below our 3Gi limit. We're also planning to look into the memory usage a bit using --detectLeaks to see if there's something crazy in our tests we can fix to solve the rampant memory usage.
Hoping this can help someone else if they see a similar problem. Just another day in the crazy DevOps world...

Killing subprocess from inside a Docker container kills the entire container

On my Windows machine, I started a Docker container from docker compose. My entrypoint is a Go filewatcher that runs a task of a taskmanager on every filechange. The executed task builds and runs the Go program.
But before I can build and run the program again after filechanges I have to kill the previous running version. But every time I kill the app process, the container is also gone.
The goal is to kill only the svc1 process with PID 74 in this example. I tried pkill -9 svc1 and kill $(pgrep svc1). But every time the parent processes are killed too.
The commandline output from inside the container:
root#bf073c39e6a2:/app/cmd/svc1# ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 2.5 0.0 104812 2940 ? Ssl 13:38 0:00 /go/bin/watcher
root 13 0.0 0.0 294316 7576 ? Sl 13:38 0:00 /go/bin/task de
root 74 0.0 0.0 219284 4908 ? Sl 13:38 0:00 /svc1
root 82 0.2 0.0 18184 3160 pts/0 Ss 13:38 0:00 /bin/bash
root 87 0.0 0.0 36632 2824 pts/0 R+ 13:38 0:00 ps -aux
root#bf073c39e6a2:/app/cmd/svc1# ps -afx
PID TTY STAT TIME COMMAND
82 pts/0 Ss 0:00 /bin/bash
88 pts/0 R+ 0:00 \_ ps -afx
1 ? Ssl 0:01 /go/bin/watcher -cmd /go/bin/task dev -startcmd
13 ? Sl 0:00 /go/bin/task dev
74 ? Sl 0:00 \_ /svc1
root#bf073c39e6a2:/app/cmd/svc1# pkill -9 svc1
root#bf073c39e6a2:/app/cmd/svc1
Switching to the containerlog:
task: Failed to run task "dev": exit status 255
2019/08/16 14:20:21 exit status 1
"dev" is the name of the task in the taskmanger.
The Dockerfile:
FROM golang:stretch
RUN go get -u -v github.com/radovskyb/watcher/... \
&& go get -u -v github.com/go-task/task/cmd/task
WORKDIR /app
COPY ./Taskfile.yml ./Taskfile.yml
ENTRYPOINT ["/go/bin/watcher", "-cmd", "/go/bin/task dev", "-startcmd"]
I expect only the process with the target PID is killed and not the parent process that spawned it it.
You can use process manager like "supervisord" and configure it to re-execute your script or the command even if you killed it's process which will keep your container up and running.

rsync daemon behaving eratically

I'm running an rsync daemon (providing a mirror for the SaneSecurity signatures).
rsync is started like this (from runit):
/usr/bin/rsync -v --daemon --no-detach
And the config contains:
use chroot = no
munge symlinks = no
max connections = 200
timeout = 30
syslog facility = local5
transfer logging = no
log file = /var/log/rsync.log
reverse lookup = no
[sanesecurity]
comment = SaneSecurity ClamAV Mirror
path = /srv/mirror/sanesecurity
read only = yes
list = no
uid = nobody
gid = nogroup
But what I'm seeing is a lot of "lingering" rsync processes:
# ps auxwww|grep rsync
root 423 0.0 0.0 4244 1140 ? Ss Oct30 0:00 runsv rsync
root 2529 0.0 0.0 11156 2196 ? S 15:00 0:00 /usr/bin/rsync -v --daemon --no-detach
nobody 4788 0.0 0.0 20536 2860 ? S 15:10 0:00 /usr/bin/rsync -v --daemon --no-detach
nobody 5094 0.0 0.0 19604 2448 ? S 15:13 0:00 /usr/bin/rsync -v --daemon --no-detach
root 5304 0.0 0.0 11156 180 ? S 15:15 0:00 /usr/bin/rsync -v --daemon --no-detach
root 5435 0.0 0.0 11156 180 ? S 15:16 0:00 /usr/bin/rsync -v --daemon --no-detach
root 5797 0.0 0.0 11156 180 ? S 15:19 0:00 /usr/bin/rsync -v --daemon --no-detach
nobody 5913 0.0 0.0 20536 2860 ? S 15:20 0:00 /usr/bin/rsync -v --daemon --no-detach
nobody 6032 0.0 0.0 20536 2860 ? S 15:21 0:00 /usr/bin/rsync -v --daemon --no-detach
root 6207 0.0 0.0 11156 180 ? S 15:22 0:00 /usr/bin/rsync -v --daemon --no-detach
nobody 6292 0.0 0.0 20544 2744 ? S 15:23 0:00 /usr/bin/rsync -v --daemon --no-detach
root 6467 0.0 0.0 11156 180 ? S 15:25 0:00 /usr/bin/rsync -v --daemon --no-detach
root 6905 0.0 0.0 11156 180 ? S 15:29 0:00 /usr/bin/rsync -v --daemon --no-detach
(it's currently 15:30)
So there's processes (not even having dropped privileges!) hanging around since 15:10, 15:13 and the like.
And what are they doing?
Let's check:
# strace -p 5304
strace: Process 5304 attached
select(4, [3], NULL, [3], {25, 19185}^C
strace: Process 5304 detached
<detached ...>
# strace -p 5797
strace: Process 5797 attached
select(4, [3], NULL, [3], {48, 634487}^C
strace: Process 5797 detached
<detached ...>
This happended with both rsync from Ubuntu Xenial as well as installed from PPA (currently using rsync 3.1.2-1~ubuntu16.04.1york0 )
One process is created for each connection. Before a client selects the module the process does not know if it should drop privileges.
You can easily create such a process.
nc $host 873
You will notice that the connection will not be closed after 30s because the timeout is just a disk i/o timeout. The rsync client have a --contimeout option, but it seems that a server side option is missing.
In the end, I resorted to invoking rsync from (x)inetd instead of running it standalone.
service rsync
{
disable = no
socket_type = stream
wait = no
user = root
server = /usr/bin/timeout
server_args = -k 60s 60s /usr/bin/rsync --daemon
log_on_failure += USERID
flags = IPv6
}
As an additional twist, I wrapped the rsync invocation with timeout, adding another safeguard against long-running processes.

How does 'kill -STOP and kill -CONT' work?

I'm facing an issue.
We have a clean script using to clean old files, and sometimes we need stop it for and will start it again later. Like the below processes. We use kill -STOP $pid and kill -CONT $pid in check.sh to control the clean.sh, $pid is all the pids of clean.sh (at there, they are 23939, 25804):
root 4321 0.0 0.0 74876 1184 ? Ss 2015 0:25 crond
root 23547 0.0 0.0 102084 1604 ? S 2015 0:00 \_ crond
root 23571 0.0 0.0 8728 972 ? Ss 2015 0:00 \_ /bin/bash -c bash /home/test/sbin/check.sh >>/home/test/log/check.log 2>&1
root 23577 0.0 0.0 8732 1092 ? S 2015 0:00 \_ bash /home/test/sbin/check.sh
root 23939 0.0 0.0 8860 1192 ? S 2015 0:45 \_ bash /home/test/bin/clean.sh 30
root 25804 0.0 0.0 8860 620 ? S 2015 0:00 \_ bash /home/test/bin/clean.sh 30
root 25805 0.0 0.0 14432 284 ? T 2015 0:00 \_ ls -d ./455bb4cba6142427156d2b959b8b0986/120x60/ ./455bb4cba6142427156d2b959b8b0986/80x
root 25808 0.0 0.0 3816 432 ? S 2015 0:00 \_ wc -l
Once the check.sh stopped clean.sh, hours later, check.sh started clean.sh, but there is a strange thing, after a stop and continue, there is a child process 'ls -d ....', it's still stopping.
Could you tell me if it's caused by wrong use of the signal? And how can I modify it?
ok, same like my description is not clear, my bad English...
Not sure what's the reason, but there is a way to sovle it:
kill -CONT $pid
pkill -CONT -P $pid
This will continue the child process.

How to see a terminal output from a previously closed terminal

I connect to a remote server using SSH
I was compiling using cmake and then make, it's not common to have a progress percentage in compilation process, but this time it has. I was watching the compilation process until my internet connection failed, so puTTY closed the session and I had to connect again to my server. I though that all the progress was lost, but i first make sure by watching the processes list by ps aux command, and I noticed that the processes related to the compilation are still running:
1160 tty1 Ss+ 0:00 /sbin/mingetty tty1
2265 ? Ss 0:00 sshd: root#pts/1
2269 pts/1 Ss 0:00 -bash
2353 pts/1 S+ 0:00 make
2356 pts/1 S+ 0:00 make -f CMakeFiles/Makefile2 all
2952 ? S 0:00 pickup -l -t fifo -u
3085 ? Ss 0:00 sshd: root#pts/0
3089 pts/0 Ss 0:00 -bash
3500 pts/1 S+ 0:01 make -f src/compiler/CMakeFiles/hphp_analysis.dir/bui
3509 pts/1 S+ 0:00 /bin/sh -c cd /root/hiphop/hiphop-php/src/compiler &&
3510 pts/1 S+ 0:00 /usr/bin/g++44 -DNO_JEMALLOC=1 -DNO_TCMALLOC=1 -D_GNU
3511 pts/1 R+ 0:03 /usr/libexec/gcc/x86_64-redhat-linux6E/4.4.4/cc1plus
3512 pts/0 R+ 0:00 ps ax
I would like to know if is possible to watch the current progress of the compilation by watching the previously closed terminal output. Something similar like 'cat /dev/vcsa1' or something
As per the comment above, you should have used screen.
As it is, you could try to peek at the file descriptors used by sshd and the shell that you started, but I don't think this will get you very far.

Resources