ps command cpu utilization for gunicorn always reads 0% - linux

I am charting my CPU/MEM usage for some gunicorn workers I have running with multiple django apps and I am getting the info from the ps command
I am running this...
/bin/ps -C gunicorn -o pcpu -o pmem -o cmd
which always outputs like this:
%CPU %MEM CMD
0.0 0.5 command
0.0 0.5 command
0.0 2.2 command
0.0 0.6 command
0.0 0.7 command
0.0 0.7 command
which is great but the CPU% is always at 0...even if I make my app do some CPU heavy process like pulling A LOT of info out of a postgresql database.
Shouldn't the CPU% be going up when the gunicorn worker is told to do something like that?

Related

There will be messy code when redirect output of "top" command to a file

Not sure whether there are existing similar questions already. Sometimes we need to execute "top" for once and redirect its out to a file, such as:
top -n 1 -o %CPU > top.log
But there will be messy code in top.log:
^[[?1h^[=^[[?25l^[[H^[[2J^[(B^[[mtop - 16:27:45 up 916 days, 17:43, 152 users, load average: 5.51, 5.39, 5.42^[(B^[[m^[[39;49m^[(B^[[m^[[39;49m^[[K
How to fix it?
When redirecting "top" command output to a file, we need to use the batch mode (-b) according to the manual:
-b :Batch-mode operation Starts top in Batch mode, which could be useful for sending output from top to other programs or to a file. In this mode, top will not accept input and runs until the iterations limit you've set with the -n' command-line option or until killed.
So we can fix the issue by:
top -b -n 1 -o %CPU > top.log
And top.log will be something like:
top - 16:35:07 up 916 days, 17:50, 152 users, load average: 4.68, 4.96, 5.24
Tasks: 2106 total, 4 running, 2065 sleeping, 8 stopped, 22 zombie
%Cpu(s): 9.7 us, 5.8 sy, 0.0 ni, 84.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

NPM test getting mysteriously stuck when run in Jenkins

We have two TypeScript apps, both created through CRA, and a CI pipeline which runs a series of npm commands to run tests/lint and build the apps for later stages:
time npm install --no-optional --unsafe-perm
npm test -- --coverage
npm run tsc
npm run lint
export REACT_APP_VERSION=$VERSION
export REACT_APP_COMMIT=$GIT_COMMIT
npm run build
npm run build-storybook
Our CI pipeline runs in Jenkins, and we're using the kubernetes plugin in order to get executors on-demand.
The script is run in parallel for app1 and app2 via the following logic in our Jenkinsfile:
stage('Frontend - App1') {
agent {
kubernetes {
label 'node'
defaultContainer 'jnlp'
yamlFile 'infrastructure/scripts/ci/pod-templates/node.yaml'
idleMinutes 30
}
}
environment {
CI = 'true'
NPMRC_SECRET_FILE_PATH = credentials('verdaccio-npmrc')
}
steps {
dir('frontend/app1') {
container('node') {
sh 'cp $NPMRC_SECRET_FILE_PATH ~/.npmrc'
sh 'chmod u+rw ~/.npmrc'
sh '../../infrastructure/scripts/ci/build-frontend.sh'
}
publishHTML(target: [
allowMissing : false,
alwaysLinkToLastBuild: false,
keepAll : true,
reportDir : 'coverage',
reportFiles : 'index.html',
reportName : "Coverage Report (app1)"
])
junit 'testing/junit.xml'
stash includes: 'build/**/*', name: 'app1-build'
stash includes: 'storybook-static/**/*', name: 'app1-storybook-build'
}
}
}
So, onto what we're seeing. Repeatedly yesterday we saw the same symptoms: the frontend stage for app1 would complete (the smaller of the two), whilst app2 would mysteriously stop in the middle of running tests (the last line of logging in Jenkins was always PASS src/x/y/file.test.ts, but not always the same test). It would remain in this state for a full hour before getting killed by our pipeline timeout (or a bored developer).
We ran kubectl exec -it node-blah sh to get onto the pod that was running the stuck stage and get some diagnostics. Running ps aux | cat gives us this:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
node 1 0.0 0.0 4396 720 ? Ss+ 08:51 0:00 cat
node 17 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 32 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 47 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 664 0.0 0.0 0 0 ? Z 09:04 0:00 [sh] <defunct>
.
.
.
node 6760 0.0 0.0 4340 108 ? S 10:36 0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node 6761 0.0 0.0 4340 1060 ? S 10:36 0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node 6762 0.0 0.0 4340 812 ? S 10:36 0:00 sh -xe /home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/script.sh
node 6764 0.0 0.0 20096 2900 ? S 10:36 0:00 /bin/bash ../../infrastructure/scripts/ci/build-frontend.sh
node 6804 0.0 0.5 984620 38552 ? Sl 10:37 0:00 npm
node 6816 0.0 0.0 4356 836 ? S 10:37 0:00 sh -c react-app-rewired test --reporters default --reporters jest-junit "--coverage"
node 6817 0.0 0.4 877704 30220 ? Sl 10:37 0:00 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/.bin/react-app-rewired test --reporters default --reporters jest-junit --coverage
node 6823 0.4 1.3 1006148 97108 ? Sl 10:37 0:06 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/react-app-rewired/scripts/test.js --reporters default --reporters jest-junit --coverage
node 6881 2.8 2.6 1065992 194076 ? Sl 10:37 0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6886 2.8 2.6 1067004 195748 ? Sl 10:37 0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6898 2.9 2.5 1058872 187360 ? Sl 10:37 0:43 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6905 2.8 2.4 1054256 183492 ? Sl 10:37 0:42 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6910 2.8 2.6 1067812 196344 ? Sl 10:37 0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6911 2.7 2.6 1063680 191088 ? Sl 10:37 0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6950 0.8 1.9 1018536 145396 ? Sl 10:38 0:11 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 7833 0.0 0.0 4340 804 ? Ss 10:59 0:00 sh
node 7918 0.0 0.0 4240 652 ? S 11:01 0:00 sleep 3
node 7919 0.0 0.0 17508 2048 ? R+ 11:01 0:00 ps aux
node 7920 0.0 0.0 4396 716 ? S+ 11:01 0:00 cat
From the manual on ps:
S interruptible sleep (waiting for an event to complete)
l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
So I think what this shows is that the tests have started running fine, spawned child processes to run them in parallel, and then for whatever reason after 40 seconds or so those processes have all gone to sleep and are no longer doing anything.
We're pretty stumped with how to investigate this further, particularly as we have the awkwardness of not easily being able to install whatever we like into the pod for further investigation (no easy root access)... but any suggested theories / next steps would be welcomed!
** EDIT **
It seems idleMinutes wasn't the culprit, as several times today we've seen the issue happen again since reverting it. I've been able to verify that the script was running in a brand new node in kubernetes which hadn't been used by any other builds previously. So now I have no idea what's even changed recently to make this start happening :(
Having banged my head against this some more, I'm pretty confident that the root cause was the tests using excessive memory in the pod. We got lucky that for a few builds yesterday we saw an ENOMEM error printed out amongst the logging, before it got stuck in an identical way. I can't explain why we weren't always seeing this (we went back and checked previous examples and it wasn't there), but that's what put us onto the right track.
Doing some more digging around, I happened to run a kubectl top pods in time to catch one of the node pods going crazy - you can see that node-thk0r-5vpzk is using 3131Mi at this particular moment in time, and we'd set the limit on the pod to be 3Gi:
Looking back at the corresponding build in Jenkins, I saw that it was now in the stuck state but with no ENOMEM logging. Subsequent kubectl top pods commands showed the memory had now decreased to a reasonable level in node-thk0r-5vpzk, but clearly the damage was already done as we now had all the child processes in the weird sleep state not doing anything.
This also (potentially) explains why the problem became way more common after I introduced the idleMinutes behaviour - if there's any sort of memory leak then re-using the same pod over and over for npm test will make it more and more likely to hit the memory ceiling and freak out. Our fix for now has been to limit the number of workers using the --maxWorkers setting, which keeps us well below our 3Gi limit. We're also planning to look into the memory usage a bit using --detectLeaks to see if there's something crazy in our tests we can fix to solve the rampant memory usage.
Hoping this can help someone else if they see a similar problem. Just another day in the crazy DevOps world...

Killing subprocess from inside a Docker container kills the entire container

On my Windows machine, I started a Docker container from docker compose. My entrypoint is a Go filewatcher that runs a task of a taskmanager on every filechange. The executed task builds and runs the Go program.
But before I can build and run the program again after filechanges I have to kill the previous running version. But every time I kill the app process, the container is also gone.
The goal is to kill only the svc1 process with PID 74 in this example. I tried pkill -9 svc1 and kill $(pgrep svc1). But every time the parent processes are killed too.
The commandline output from inside the container:
root#bf073c39e6a2:/app/cmd/svc1# ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 2.5 0.0 104812 2940 ? Ssl 13:38 0:00 /go/bin/watcher
root 13 0.0 0.0 294316 7576 ? Sl 13:38 0:00 /go/bin/task de
root 74 0.0 0.0 219284 4908 ? Sl 13:38 0:00 /svc1
root 82 0.2 0.0 18184 3160 pts/0 Ss 13:38 0:00 /bin/bash
root 87 0.0 0.0 36632 2824 pts/0 R+ 13:38 0:00 ps -aux
root#bf073c39e6a2:/app/cmd/svc1# ps -afx
PID TTY STAT TIME COMMAND
82 pts/0 Ss 0:00 /bin/bash
88 pts/0 R+ 0:00 \_ ps -afx
1 ? Ssl 0:01 /go/bin/watcher -cmd /go/bin/task dev -startcmd
13 ? Sl 0:00 /go/bin/task dev
74 ? Sl 0:00 \_ /svc1
root#bf073c39e6a2:/app/cmd/svc1# pkill -9 svc1
root#bf073c39e6a2:/app/cmd/svc1
Switching to the containerlog:
task: Failed to run task "dev": exit status 255
2019/08/16 14:20:21 exit status 1
"dev" is the name of the task in the taskmanger.
The Dockerfile:
FROM golang:stretch
RUN go get -u -v github.com/radovskyb/watcher/... \
&& go get -u -v github.com/go-task/task/cmd/task
WORKDIR /app
COPY ./Taskfile.yml ./Taskfile.yml
ENTRYPOINT ["/go/bin/watcher", "-cmd", "/go/bin/task dev", "-startcmd"]
I expect only the process with the target PID is killed and not the parent process that spawned it it.
You can use process manager like "supervisord" and configure it to re-execute your script or the command even if you killed it's process which will keep your container up and running.

Can't stop/restart Apache2 service

Trying to stop Apache2 service, but get PID error:
#service apache2 stop
[FAIL] Stopping web server: apache2 failed!
[....] There are processes named 'apache2' running which do not match your pid file which are left untouched in the name of safety, Plea[warnview the situation by hand. ... (warning).
Trying to kill, those processes:
#kill -9 $(ps aux | grep apache2 | awk '{print $2}')
but they get re-spawned again:
#ps aux | grep apache2
root 19279 0.0 0.0 4080 348 ? Ss 05:10 0:00 runsv apache2
root 19280 0.0 0.0 4316 648 ? S 05:10 0:00 /bin/sh /usr/sbin/apache2ctl -D FOREGROUND
root 19282 0.0 0.0 91344 5424 ? S 05:10 0:00 /usr/sbin/apache2 -D FOREGROUND
www-data 19284 0.0 0.0 380500 2812 ? Sl 05:10 0:00 /usr/sbin/apache2 -D FOREGROUND
www-data 19285 0.0 0.0 380500 2812 ? Sl 05:10 0:00 /usr/sbin/apache2 -D FOREGROUND
And though the processes are running i can't connect to the server on port 80. /var/log/apache2/error.log.1 has no new messages when i do the kill -9.
Before I tried to restart everything worked perfectly.
Running on Debian: Linux adara 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux
UPD:
also tried apache2ctl:
#/usr/sbin/apache2ctl -k stop
AH00526: Syntax error on line 76 of /etc/apache2/apache2.conf:
PidFile takes one argument, A file for logging the server process ID
Action '-k stop' failed.
The Apache error log may have more information.
but there is no pid file in /var/run/apache2
I'm new to linux, looks like it has to do something with startup scripts, but can't figure out what exactly.
Below is the command to find out the process running on port 80
lsof -i tcp:80
Kill the process with PID.Restart the system once to check if their is any start up script executing and using the Port 80 which is preventing you to start your service.
For start up scripts you can check
/etc/init.d/ or /etc/rc.local or crontab - e
You can try Apache official documentation for stop/restart operations.
link

in linux - show a list of all processes and note if they are running or suspended

I'm new to linux.
How can I show a list of all processes that says about each process if it's running or suspended?
I've tried
ps -ef|grep myusername
but it doesn't say if the processes are running or not.
also tried
ps ux
same thing, it doesn't say if the processes are running or not.
I'm looking for something like this list:
I get this list when I move a process to background, I don't know how to see it otherwise...
You can use "ps" to list processes, This (ps aux) will list all the processes. Given an example output of it below.
ps aux | more
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 189160 9376 ? Ss 15:51 0:04 /usr/lib/systemd/systemd --switched-root --system --deserialize 20
root 2 0.0 0.0 0 0 ? S 15:51 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S 15:51 0:00 [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S< 15:51 0:00 [kworker/0:0H]
root 7 0.0 0.0 0 0 ? S 15:51 0:06 [rcu_sched]
root 8 0.0 0.0 0 0 ? S 15:51 0:00 [rcu_bh]
root 9 0.0 0.0 0 0 ? S 15:51 0:04 [rcuos/0]
By checking the STAT of the process ( UNDER "STAT" ) you can identify the process states, Below are some possible states codes.
R running or runnable (on run queue)
D uninterruptible sleep (usually IO)
S interruptible sleep (waiting for an event to complete)
Z defunct/zombie, terminated but not reaped by its parent
T stopped, either by a job control signal or because it is being
traced
You can type "man ps" to get more info.
You can use htop to see the list of processes and there is a column for process state
What does a C process status mean in htop?
http://www.howtogeek.com/howto/ubuntu/using-htop-to-monitor-system-processes-on-linux/
ps -p PID -o comm=
Enter the code above where PID is PID of the process.
Following command will be more helpful to you.
Use the command : sudo lsof -i -n -P
This command lists the Application Name, PID, User, IP version, Device ID and the Node with Port Name. It shows both TCP and UDP.
Variations :
To format it in a nice, readable way; use :
sudo lsof -i -n -P | more
To view view only TCP connections :
sudo lsof -i -n -P | grep TCP | more
To view view only UDP connections :
sudo lsof -i -n -P | grep UDP | more

Resources