I have a multi-threaded (three threads) application in Linux 3.4.0 with RT7 (realtime) patch. The application needs realtime execution with ~20ms tolerance.The application runs for a while (1 min to 50min) with realtime then I find that while one of the threads is doing some processing, a context switch happens and it comes back to the thread about 80 to 500ms later. I need to find out what process takes away the time slice. All my threads together consume ~5% CPU time. Is there any tool to see process execution history with time stamp?
Thanks,
Hakim
Consider using SystemTap. It is dynamic instrumentation engine inspired by DTrace. It dynamically patches kernel (so it will need debuginformation for it).
For example, your task may be achieved with this script:
probe scheduler.cpu_on, scheduler.cpu_off {
if(pid() == target()) {
printf("%ld %s\n", gettimeofday_us(), pn());
}
}
Use -c option to attach this script to a command or -x to a running PID:
root#lkdevel:~# stap -c 'dd if=/dev/zero of=/dev/null count=1' ./schedtrace.stp
...
1423701880670656 scheduler.cpu_on
1423701880673498 scheduler.cpu_off
1423701880674208 scheduler.cpu_on
1423701880689407 scheduler.cpu_off
1423701880689829 scheduler.cpu_on
...
Related
When I write in JenkinsFile something like this:
node {
parallel (
phase1: { sh "./firstProgram" },
phase2: { sh "./secondProgram" }
)
}
does jenkins run it as different processes or just only as different threads?
The workflow-cps plugin is what provides the parallel functionality. Below is an excerpt from the plugin's page which discusses how parallel works.
All program logic is run inside a “CPS VM thread”, which is just a Java thread pool that can run binary methods and figure out which continuation to do next. The parallel step uses “green threads” (also known as cooperative multitasking): it records logical thread (~ branch) names for various actions, but does not literally run them simultaneously. The program may seem to perform tasks concurrently, but only because most steps run asynchronously, while the VM thread is idle, and they may overlap in time. No Java thread is consumed except during the typically brief intervals when Groovy code is actually being run on the VM thread. The executor widget only displays an entry for the “flyweight” executor on the built-in node when the VM thread is busy; normally it is hidden.
The GC time is too long in my spark streaming programme. In the GC log, I found that Someone called System.gc() in the programme. I do not call System.gc() in my code. So the caller should be the api I used.
I add -XX:-DisableExplicitGC to JVM and fix this problem. However, I want to know who call the System.gc().
I tried some methods.
Use jstack. But the GC is not so frequent, it is difficult to dump the thread that call the method.
I add trigger that add thread dump when invoke method java.lang.System.gc() in JProfiler. But it doesn't seem to work.
How can I know who call System.gc() in spark streaming program?
You will not catch System.gc with jstack, because during stop-the-world pauses JVM does not accept connections from Dynamic Attach tools, including jstack, jmap, jcmd and similar.
It's possible to trace System.gc callers with async-profiler:
Start profiling beforehand:
$ profiler.sh start -e java.lang.System.gc <pid>
After one or more System.gc happens, stop profiling and print the stack traces:
$ profiler.sh stop -o traces <pid>
Example output:
--- Execution profile ---
Total samples : 6
Frame buffer usage : 0.0007%
--- 4 calls (66.67%), 4 samples
[ 0] java.lang.System.gc
[ 1] java.nio.Bits.reserveMemory
[ 2] java.nio.DirectByteBuffer.<init>
[ 3] java.nio.ByteBuffer.allocateDirect
[ 4] Allocate.main
--- 2 calls (33.33%), 2 samples
[ 0] java.lang.System.gc
[ 1] sun.misc.GC$Daemon.run
In the above example, System.gc is called 6 times from two places. Both are typical situations when JDK internally forces Garbage Collection.
The first one is from java.nio.Bits.reserveMemory. When there is not enough free memory to allocate a new direct ByteBuffer (because of -XX:MaxDirectMemorySize limit), JDK forces full GC to reclaim unreachable direct ByteBuffers.
The second one is from GC Daemon thread. This is called periodically by Java RMI runtime. For example, if you use JMX remote, periodic GC is automatically enabled once per hour. This can be tuned with -Dsun.rmi.dgc.client.gcInterval system property.
I am started to learn parallel programming and for calculating the performance i should know the accurate time that program seek.
so I want to measure the amount of time that my C program seek under the Linux but It just show me some divergent answer.
In my opinion it should be related to other processes get the time,by the way i am using this instructions :
double start ,end;
start = omp_get_wtime();
.
.
.
end = omp_get_wtime();
result = end- start;
Thank you in advance.
For conducting accurate benchmarks, it is imperative that the external influences are suppressed as much as possible. If your system has enough CPU cores, you can isolate some of them using kernel parameters and thus prevent any other process and/or kernel tasks from using those cores:
... isolcpus=3,4,5 nohz_full=3,4,5 rcu_nocbs=3,4,5 ...
Those parameters will almost completely isolate CPUs 3, 4, and 5 by preventing the OS scheduler from running processes on them by default (isolcpus), the kernel RCU system from running tasks of them (rcu_nocbs), and prevent the periodic scheduler timer ticks (nohz_full). Make sure that you do not isolate all CPUs!
You can now explicitly assign a process to those cores using taskset -c 3-5 ... or the mechanism built into the OpenMP runtime, e.g., export GOMP_CPU_AFFINITY="3,4,5" for GCC. Note that, even if you do not use dedicated isolated CPUs, simply turning on thread pinning with export OMP_PROCBIND=true or by setting GOMP_CPU_AFFINITY (KMP_AFFINITY for Intel) should decrease the run time divergence.
Why not just use clock?
clock_t start = clock();
/* do whatever you like here */
clock_t end = clock();
double total_time = (double)(end - start) / CLOCKS_PER_SEC;
or the function
getrusage(...)
...
I am running into a situation where a go program is taking up 15gig of virtual memory and continues to grow. The problem only happens on our CentOS server. On my OSX devel machine, I can't reproduce it.
Have I discovered a bug in go, or am I doing something incorrectly?
I have boiled the problem down to a simple demo, which I'll describe now. First build and run this go server:
package main
import (
"net/http"
"os/exec"
)
func main() {
http.HandleFunc("/startapp", startAppHandler)
http.ListenAndServe(":8081", nil)
}
func startCmd() {
cmd := exec.Command("/tmp/sleepscript.sh")
cmd.Start()
cmd.Wait()
}
func startAppHandler(w http.ResponseWriter, r *http.Request) {
startCmd()
w.Write([]byte("Done"))
}
Make a file named /tmp/sleepscript.sh and chmod it to 755
#!/bin/bash
sleep 5
And then make several concurrent requests to /startapp. In a bash shell, you can do it this way:
for i in {1..300}; do (curl http://localhost:8081/startapp &); done
The VIRT memory should now be several gigabytes. If you re-run the above for loop, the VIRT memory will continue to grow by gigabytes every time.
Update 1: The problem is that I am hitting OOM issues on CentOS. (thanks #nos)
Update 2: Worked around the problem by using daemonize and syncing the calls to Cmd.Run(). Thanks #JimB for confirming that .Wait() running in it's own thread is part of the POSIX api and there isn't a way to avoid calling .Wait() without leaking resources.
Each request you make requires Go to spawn a new OS thread to Wait on the child process. Each thread will consume a 2MB stack, and a much larger chunk of VIRT memory (that's less relevant, since it's virtual, but you may still be hitting a ulimit setting). Threads are reused by the Go runtime, but they are currently never destroyed, since most programs that use a large number of threads will do so again.
If you make 300 simultaneous requests, and wait for them to complete before making any others, memory should stabilize. However if you continue to send more requests before the others have completed, you will exhaust some system resource: either memory, file descriptors, or threads.
The key point is that spawning a child process and calling wait isn't free, and if this were a real-world use case you need to limit the number of times startCmd() can be called concurrently.
I was looking at the 'os' and 'process' module source and there does not appear to be a way to determine which core a node.js process is running on, before/during/after runtime.
I am looking for something like:
process.env.CORE_ID //not real
I just want to confirm that different node.js processes are running on different cores. It seems reasonable that, although the operating system ultimately chooses which core a node.js process is executed on, we should be able to read that data once the OS starts the process.
Processes are not attached to specific core in any operating systems (except, maybe, in some real-time oriented one).
Processors (and cores) are resources that can be assigned to any process whenever it needs. One thread can be executed only in a single core at once, but cores are shared by all processes. The operating system is responsible to schedule processes over available cores. So, when any process is "paused" due to let execute (or continue execution of) another process in the same core, there is no reason to expect that process to be resumed in the same core next time.
This is slightly observable when you have single process with a high cpu cosumtion in a (multi-core) machine with a relative low cpu activity by simply executing htop. Then you could see that there is always a highly occupied core, but which core is it will change regularly.
While trying to think about why you would want to know this kind of information, I think I understand the core of your confusion. Note that a process may switch cores multiple times per millisecond. The OS does a very good job of scheduling this, and I cannot imagine that performance is so much an issue that you would need to improve on this (because in this case you would be writing your own OS layer, and not in node). There will be absolutely NO observable delay between a process that runs on a core and that has to be started on that core, except possibly on embedded hardware with a custom OS.
Note that modern systems can even switch cores on and off. So imagine a webserver with 2 cores and 2 node processes serving requests. At night when there is not much work, the OS switches core 2 off, and both processes run happily both on core 1, serving the occasional request. Then when it gets busier, the OS starts the second core and most of the time both node processes will run at the same time, each on its own core. However note that the OS also has plenty of other processes (for instance, updating the realtime-clock). So it may very well be that at some point node process A runs on core 1, and node process B on core 2. Then core 1 is used to update the clock, while core 2 is still running B. Halfway through updating the clock however process B is stopped on core 2, and process A is started there. When the clock update is done, process B is started again on core 1. All of this happens in less than a microsecond.
So context switches happen millions of times a second on modern architectures, don't worry about them. Even if you find that at some point the 2 node processes run on different cores, there is no guarantee this is still true a microsecond later. And, again, probably the OS will do a much better job optimising this than you.
Now a thing that you could be interested in, is knowing if not for some reason the two processes ALWAYS run on the same core (e.g. because secretly they are not two processes, but one process, such as different requests to the same node server). This is something you only need to check for once. Just load both node instances fully, and check in top/process explorer/etc if their combined CPU usage is above 100%. If so, you can assume they are at least capable of running on different cores, and further assume that the OS will schedule them to different cores if they would benefit from this.
Possible linux way:
function getPSR( pid, callback ) {
var exec = require('child_process').execSync;
var command = 'ps -A -o pid,psr -p ' + pid + ' | grep ' + pid + ' | grep -v grep |head -n 1 | awk \'{print $2}\'';
var result = exec( command );
return result.toString("utf-8").trim();
}
function getTIDs( pid ) {
var exec = require('child_process').execSync;
var command = 'ps -mo tid,psr -p ' + pid + ' | grep -v grep | awk \'/[0-9]/ {print("{\\042id\\042 : ", $1, ",\\042psr\\042:", $2, " }," )}\'';
var tids = '[ ' + exec(command) + '{"id": false} ]';
return JSON.parse(tids);
}
function setPSR( pid, psr ) {
var exec = require('child_process').execSync;
var command = 'taskset -pc ' + psr + ' ' + pid + '; kill -STOP ' + pid + '; kill -CONT ' + pid;
var result = exec(command);
return result.toString("utf-8").trim();
}
function setTIDsPSR( pid, psr ) {
var tids = getTIDs(pid);
console.log(tids);
for (var i in tids) {
if (tids[i].id) {
console.log( setPSR( tids[i].id, psr ) );
}
}
}
Try this https://www.npmjs.com/package/nodeaffinity
This works for Windows and Linux based OS except OSX.
var nc = require('nodeaffinity');
//Returns the cpus/cores (affinity mask) on which current node process is allowed to run
//Failure returns -1
console.log(nc.getAffinity());
//Sets process CPU affinity, here 3 means 011 i.e. process will be allowed to run on cpu0 and cpu1
// returns same mask id success , if failure retuen -1.
console.log(nc.setAffinity(3));