I tried a similar code to this snippet
ft.dfs(entityset = es,
target_entity = ...,
n_jobs=-1,) # or n_jobs=40
But it doesn't seem to work on a machine with 40 threads:
S CPU% MEM% TIME+ Command
S 0.0 0.7 0:00.00 python test.py
S 0.0 0.7 0:00.00 python test.py
S 0.0 0.7 0:00.00 python test.py
S 0.0 0.7 0:00.00 python test.py
S 0.0 0.7 0:00.00 python test.py
S 0.0 0.7 0:00.00 python test.py
S 0.0 0.7 0:00.00 python test.py
R 78.0 0.7 23:24.72 python test.py
As you can see, there's no 40 processes and a single one is running (varying from 78-100% on that thread). Does anyone know what is happening here? I left this running for 25 minutes before killing it and I didn't see any change in the usage.
Thanks in advance!
It may be that Featuretools has not reached the parallelized part of the dfs function. Feature calculation is parallelized but currently the feature exploration step (DeepFeatureSynthesis) is not. You could check if this is the issue by setting features_only=True in the dfs call and seeing how long that takes to run. You could then use the calculate_feature_matrix method to compute the returned features.
Related
Not sure whether there are existing similar questions already. Sometimes we need to execute "top" for once and redirect its out to a file, such as:
top -n 1 -o %CPU > top.log
But there will be messy code in top.log:
^[[?1h^[=^[[?25l^[[H^[[2J^[(B^[[mtop - 16:27:45 up 916 days, 17:43, 152 users, load average: 5.51, 5.39, 5.42^[(B^[[m^[[39;49m^[(B^[[m^[[39;49m^[[K
How to fix it?
When redirecting "top" command output to a file, we need to use the batch mode (-b) according to the manual:
-b :Batch-mode operation Starts top in Batch mode, which could be useful for sending output from top to other programs or to a file. In this mode, top will not accept input and runs until the iterations limit you've set with the -n' command-line option or until killed.
So we can fix the issue by:
top -b -n 1 -o %CPU > top.log
And top.log will be something like:
top - 16:35:07 up 916 days, 17:50, 152 users, load average: 4.68, 4.96, 5.24
Tasks: 2106 total, 4 running, 2065 sleeping, 8 stopped, 22 zombie
%Cpu(s): 9.7 us, 5.8 sy, 0.0 ni, 84.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
I'm really struggling with how to find processes by name in linux. I'm sure it's probably something simple that I'm missing.
Are you looking for the command ps ?
Here an example
nabil#LAPTOP:~$ ps xua | grep python
rootwsl 327 0.0 0.1 29568 17880 ? Ss Jan30 0:02 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
rootwsl 411 0.0 0.1 108116 20740 ? Ssl Jan30 0:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
nabil 106387 0.0 0.0 3444 736 pts/1 S+ 23:26 0:00 grep --color=auto python
We have two TypeScript apps, both created through CRA, and a CI pipeline which runs a series of npm commands to run tests/lint and build the apps for later stages:
time npm install --no-optional --unsafe-perm
npm test -- --coverage
npm run tsc
npm run lint
export REACT_APP_VERSION=$VERSION
export REACT_APP_COMMIT=$GIT_COMMIT
npm run build
npm run build-storybook
Our CI pipeline runs in Jenkins, and we're using the kubernetes plugin in order to get executors on-demand.
The script is run in parallel for app1 and app2 via the following logic in our Jenkinsfile:
stage('Frontend - App1') {
agent {
kubernetes {
label 'node'
defaultContainer 'jnlp'
yamlFile 'infrastructure/scripts/ci/pod-templates/node.yaml'
idleMinutes 30
}
}
environment {
CI = 'true'
NPMRC_SECRET_FILE_PATH = credentials('verdaccio-npmrc')
}
steps {
dir('frontend/app1') {
container('node') {
sh 'cp $NPMRC_SECRET_FILE_PATH ~/.npmrc'
sh 'chmod u+rw ~/.npmrc'
sh '../../infrastructure/scripts/ci/build-frontend.sh'
}
publishHTML(target: [
allowMissing : false,
alwaysLinkToLastBuild: false,
keepAll : true,
reportDir : 'coverage',
reportFiles : 'index.html',
reportName : "Coverage Report (app1)"
])
junit 'testing/junit.xml'
stash includes: 'build/**/*', name: 'app1-build'
stash includes: 'storybook-static/**/*', name: 'app1-storybook-build'
}
}
}
So, onto what we're seeing. Repeatedly yesterday we saw the same symptoms: the frontend stage for app1 would complete (the smaller of the two), whilst app2 would mysteriously stop in the middle of running tests (the last line of logging in Jenkins was always PASS src/x/y/file.test.ts, but not always the same test). It would remain in this state for a full hour before getting killed by our pipeline timeout (or a bored developer).
We ran kubectl exec -it node-blah sh to get onto the pod that was running the stuck stage and get some diagnostics. Running ps aux | cat gives us this:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
node 1 0.0 0.0 4396 720 ? Ss+ 08:51 0:00 cat
node 17 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 32 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 47 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 664 0.0 0.0 0 0 ? Z 09:04 0:00 [sh] <defunct>
.
.
.
node 6760 0.0 0.0 4340 108 ? S 10:36 0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node 6761 0.0 0.0 4340 1060 ? S 10:36 0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node 6762 0.0 0.0 4340 812 ? S 10:36 0:00 sh -xe /home/jenkins/workspace/app_master/frontend/app2#tmp/durable-f617acc8/script.sh
node 6764 0.0 0.0 20096 2900 ? S 10:36 0:00 /bin/bash ../../infrastructure/scripts/ci/build-frontend.sh
node 6804 0.0 0.5 984620 38552 ? Sl 10:37 0:00 npm
node 6816 0.0 0.0 4356 836 ? S 10:37 0:00 sh -c react-app-rewired test --reporters default --reporters jest-junit "--coverage"
node 6817 0.0 0.4 877704 30220 ? Sl 10:37 0:00 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/.bin/react-app-rewired test --reporters default --reporters jest-junit --coverage
node 6823 0.4 1.3 1006148 97108 ? Sl 10:37 0:06 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/react-app-rewired/scripts/test.js --reporters default --reporters jest-junit --coverage
node 6881 2.8 2.6 1065992 194076 ? Sl 10:37 0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6886 2.8 2.6 1067004 195748 ? Sl 10:37 0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6898 2.9 2.5 1058872 187360 ? Sl 10:37 0:43 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6905 2.8 2.4 1054256 183492 ? Sl 10:37 0:42 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6910 2.8 2.6 1067812 196344 ? Sl 10:37 0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6911 2.7 2.6 1063680 191088 ? Sl 10:37 0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6950 0.8 1.9 1018536 145396 ? Sl 10:38 0:11 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 7833 0.0 0.0 4340 804 ? Ss 10:59 0:00 sh
node 7918 0.0 0.0 4240 652 ? S 11:01 0:00 sleep 3
node 7919 0.0 0.0 17508 2048 ? R+ 11:01 0:00 ps aux
node 7920 0.0 0.0 4396 716 ? S+ 11:01 0:00 cat
From the manual on ps:
S interruptible sleep (waiting for an event to complete)
l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
So I think what this shows is that the tests have started running fine, spawned child processes to run them in parallel, and then for whatever reason after 40 seconds or so those processes have all gone to sleep and are no longer doing anything.
We're pretty stumped with how to investigate this further, particularly as we have the awkwardness of not easily being able to install whatever we like into the pod for further investigation (no easy root access)... but any suggested theories / next steps would be welcomed!
** EDIT **
It seems idleMinutes wasn't the culprit, as several times today we've seen the issue happen again since reverting it. I've been able to verify that the script was running in a brand new node in kubernetes which hadn't been used by any other builds previously. So now I have no idea what's even changed recently to make this start happening :(
Having banged my head against this some more, I'm pretty confident that the root cause was the tests using excessive memory in the pod. We got lucky that for a few builds yesterday we saw an ENOMEM error printed out amongst the logging, before it got stuck in an identical way. I can't explain why we weren't always seeing this (we went back and checked previous examples and it wasn't there), but that's what put us onto the right track.
Doing some more digging around, I happened to run a kubectl top pods in time to catch one of the node pods going crazy - you can see that node-thk0r-5vpzk is using 3131Mi at this particular moment in time, and we'd set the limit on the pod to be 3Gi:
Looking back at the corresponding build in Jenkins, I saw that it was now in the stuck state but with no ENOMEM logging. Subsequent kubectl top pods commands showed the memory had now decreased to a reasonable level in node-thk0r-5vpzk, but clearly the damage was already done as we now had all the child processes in the weird sleep state not doing anything.
This also (potentially) explains why the problem became way more common after I introduced the idleMinutes behaviour - if there's any sort of memory leak then re-using the same pod over and over for npm test will make it more and more likely to hit the memory ceiling and freak out. Our fix for now has been to limit the number of workers using the --maxWorkers setting, which keeps us well below our 3Gi limit. We're also planning to look into the memory usage a bit using --detectLeaks to see if there's something crazy in our tests we can fix to solve the rampant memory usage.
Hoping this can help someone else if they see a similar problem. Just another day in the crazy DevOps world...
I ran the top -H -p for a process which gave me the few threads with LWPs.
But when I sort the results with smallest PID first, I noticed the time in first thread is constant but the other threads time is changing. Why TIME+ is different?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16989 root 20 0 106m 28m 2448 S 0.0 0.2 0:22.31 glusterfs
16990 root 20 0 106m 28m 2448 S 0.0 0.2 0:00.00 glusterfs
16992 root 20 0 106m 28m 2448 S 0.0 0.2 0:00.00 glusterfs
16993 root 20 0 106m 28m 2448 S 0.0 0.2 0:00.00 glusterfs
16997 root 20 0 106m 28m 2448 S 0.0 0.2 0:11.71 glusterfs
17010 root 20 0 106m 28m 2448 S 0.0 0.2 0:21.07 glusterfs
17061 root 20 0 106m 28m 2448 S 0.0 0.2 0:00.00 glusterfs
Why TIME+ is different?
Because different threads are doing different percentages of the work. There could be a number of reasons for this1, but the most likely is that the application (glusterfs) is not attempting to distribute work evenly across the worker threads.
It is not something to worry about. It doesn't matter which thread does the work if the work level (see the %CPU) is negligible.
1 - If someone had the time and inclination, they could look at the source code of glusterfs to try to understand its behavior. However, I don't think the effort is warranted.
Because the time column referes to the time consumed by a process, so when a process time does not change it probably means that this process is "sleeping" or simply waiting for an other process to finish, but there could be many more reasons.
http://linux.about.com/od/commands/l/blcmdl1_top.htm
TIME:
Total CPU time the task has used since it started. If cumulative mode
is on, this also includes the CPU time used by the process's children
which have died. You can set cumulative mode with the S command line
option or toggle it with the interactive command S. The header line
will then be changed to CTIME.
How can I get high precision memory usage per proccess with "ps aux"?
$ ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 3672 1984 ? Ss Dec11 0:07 /sbin/init
root 2 0.0 0.0 0 0 ? S Dec11 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S Dec11 0:23 [ksoftirqd/0]
root 6 0.0 0.0 0 0 ? S Dec11 0:00 [migration/0]
...
I need more than 1 digit after point.
Maybe I can format column with %MEM?
Look into the proc filesystem /proc/[pid]/status, /proc/[pid]/statm, /proc/[pid]/smaps.
To get fully detailed memory map /proc/[pid]/maps
Read the proc(5) manual page for all the details.
the ps command have that. you can type man ps for detail.
when the termianl show info after you typed it, you can type /memory , then heighlight contain 'memory''s string. you can type n show next palace with memory.