Trying to minimize startup time for a C++ program compiled with LLVM. For my specific application minimizing startup time is important.
One thought I've had was to tell the program to start with a large heap allocated so it doesn't have to grow and make system calls with subsequent malloc calls? Do I need to write my own malloc to do this?
If startup time is so important, then you make sure that your app is launched a long time before it is actually needed, and that all kinds of initialisations are done at that point. So when the app is really needed, you don't have startup cost, it's up and ready without delay.
Related
I have a monitor script will check a specified process, if it crash, the script will relaunch it without waiting for the core dump writing complete. Does this incur bad things? Will it affect the core dump file or the relaunched process?
Yes, you can. A process is a different thing than a program. As you can have several instances of the ls command in unix running in parallel, there's nothing to impede you to relaunch the same program (but a different, new process) again while it is saving the core file. The only difference from an normal process writing a file is that the process writing a core just does it in kernel mode. Nothing else.
Core dump is executed by the process killed executing in kernel mode, as a previous to die task. For the purposes of process state, the process in state exiting and nothing can affect it until the core dump is finished (it can only be interrupted by a write error in the dump file, or perhaps this is an interruptible state)
The only problem you can have, is that the next instance you launch, as it tries to write the same core file name, will have to wait for it to end (i think the inode is only locked on a per write basis only, not for the whole file) and you get a bunch of processes dying and writing the same core file. That's not the case if the core happens to a new, different file (the file is unlinked before creating it) but that depends on implementation. Probably an exploit should be a DOS attack to begin generating cores at a high pace, to make the writing of core files to queue a lot of processes in non interrupting state. But I think this is difficult to achieve... most probably only you'll get high load by many processes writing different core files just to be erased next (as a consequence of the unlink system call made by then next core generating task).
A core(5) dump is very bad, and you should fix its root cause. It is generally the result of some unexpected and unhandled signal(7) (perhaps some memory corruption giving a SIGSEGV, etc...; read also about undefined behavior and be very scared of UB).
if it crash, the script will relaunch it without waiting for the core dump writing complete.
So your approach is flawed, except as a temporary measure. BTW, in many cases, the virtual address space of the faulty process is small enough for the core to be dumped in a small fraction of a second. In some cases, the dumping of the core might take many minutes (think of a big HPC process dealing with hundreds of gigabytes of data on a supercomputer).
It is rumored that, in the previous century, some huge core files took half an hour to be dumped on Cray supercomputers.
You really should fix your program to avoid dumping core.
We don't know at all what is your buggy program which dumps core. But if it has some persistent state (e.g. in some database or some file) which you care about, your approach is very wrong: the core dump might perhaps happen in the code which produces that state, and then, if you restart the same program, it could reuse that faulty state.
Does this incur bad things?
Yes in general. Perhaps not in your specific case (but we don't know what your program is doing).
So, you'll better understand why is that core happening. In general, you would compile your program with all warnings and debug info (so gcc -Wall -Wextra -g with GCC) and use gdb to analyze post-mortem the core dump (see this).
You really should not write programs which dump core (even if that happens to all of us; but it is a strong bug that should be fixed ASAP). And you should not accept core dumps as an acceptable behavior of your programs.
The core dumps are here to help the developer to fix some serious problem. Read also about Unix philosophy. It is socially unacceptable to consider as "normal" a core dump, which is definitely an abnormal program behavior.
(there are several ways to avoid core dumps; but that makes a different question; and you need to explain what kind of programs you are writing and monitoring, and why and how it is dumping core.)
Suppose I need to peek on a thread's state at regular intervals and record its state along the whole execution of a program. I wouldn't know how to start thinking about this. Any pointers (pun?)? I'm on Linux, using gcc, phreads and C and have access to all usual Linux tools. Basically, I guess I'm asking about how to build a simple profiler for threads that will tell me how long a thread has been in some or other state during the execution of the program.
I want to be able to create graphs like Threadscope does. The X axis is time, the Y axis is core/thread number and the "colors" are state: green means running, orange is garbage collection, and so on. Does this make more sense now?
.
For Linux specific solution, you might like to have a look at /proc/<pid>/stat and /proc/<pid>/task/<tid>/stat for process and thread statistics, respectively. Have a look at proc(5) manual page for full description of all the fields there (online http://man7.org/linux/man-pages/man5/proc.5.html - search for /proc/[pid]/stat). Specifically, at least the fields cutime and stime are of interests to you. These are monotonically increasing times, so you need to remember the previously measured value to be able to produce the time spent in the process/thread during the given time slice, in order to produce the data for your graphs. (This is how top(1) works.)
However, for the profiler to distinguish different states makes the problem more complicated. How do the profiler distinguish that the profiled program is in which state? It seems to me the profiled program threads need to signal this in some way to the profiler. You need to have some kind of tailored solution for this state sharing (unless you can run the different states in different threads and make the distinction this way, which I doubt).
If the state transitions are done in single place (e.g. enter GC and leave GC in your example), then one way would be as follows:
The monitored threads would get the start and end times of the special states by using POSIX function clock_gettime() - with clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &tp) you can get the process time and with clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tp) you can get the thread time (both monotonically increasing, again).
The thread could communicate these timings to the profiler program with some kind of IPC.
If the profiler application knows the thread times of entering and leaving a state, then because it knows the thread time values at the change of measuring slices, it can determine how much of the thread time is spent in the reported states within a reporting time slice (and of course here we need to adjust the start time for a state to equal the start of the next reporting time slice).
The time the whole process has spent on a specific state can be calculated by summing up the thread times for that state.
Note that through /proc/<pid>/stat or /proc/<pid>/task/<tid>/stat, the measurement accuracy is not very good (clock ticks, often units of 10ms), but I do not know other way of getting timing information from outside of the process/thread. The function clock_gettime() gives very accurate times (nominally nanosecond accuracy, but note that at least in some MIPS and ARM systems the accuracy is as bad as with the stat files under /proc due to unexisting implementation of accurate timer reading for these fields within Linux kernel). You also would need to do some experimentation to make sure these two timing sources really would give the same results (by reading both values from the same threads). You can of course use these /proc/.../stat files inside the thread, but the accuracy just is not very good unless you spend a lot of time within a state.
Well, the direct match to profiling info produced by the haskell compiler and processed by Threadscope is, using C and GCC, the gprof utility (it's part of the GNU binutils).
For it to work correctly with pthreads you need each thread to trigger some timer initialization function. This can be done without modifying your code with this pthreads wrapper library: http://sam.zoy.org/writings/programming/gprof.html . I haven't dealt with the problem recently, it may be that something has changed and the wrapper isn't needed anymore...
As to GUI to interpret the profiling results, there is kprof (http://kprof.sourceforge.net). Unfortunately, AFAIK it doesn't produce thread duration graphs, for that you'll have to work your own solution with the textual info produced by gprof.
If you are not picky about using the "standard" solution offered by the GCC, you may wanna try this: http://code.google.com/p/gperftools/?redir=1 (didn't try it personally, but heard good opinions).
Good luck!
Take a look at at Intel VTune Amplifier XE (formerly … Intel Thread Profiler) to see if it will meet your needs.
This and other Intel Linux development tools are available free for non-commercial use.
In the video Using the Timeline in Intel VTune Amplifier XE showing a timeline of a multi-threaded application, at 9:20 the presenter mentions
"...with the frame API you can programmatically mark certain events or phases in your code. And these marks will appear on the timeline."
I think it will be rather difficult build a simple profiler simply because there are many different factors that you have to consider and system profiling is an inherently complex task, made all the more so when you are profiling a multithreaded application. The best advice I can think of is to look at something that already exists, for example OProfile.
One advantage of OProfile is that it is open source so the source code is available. But beyond this I suspect that asking how to build a profiling application might be beyond the scope of what someone can answer in a SO question, which might be why this question hasn't gotten very many responses. Hopefully looking at some example will help you get started and then perhaps if you have more focused questions you could get some more detailed responses.
Fixed:
Well this seems a bit silly. Turns out top was not displaying correctly and programs actually continue to run. Perhaps the CPU time became too large to display? Either way, the program seems to be working fine and this whole question was moot.
Thanks (and sorry for the silly question).
Original Q:
I am running a simulation on a computer running Ubuntu server 10.04.3. Short runs (<24 hours) run fine, but long runs eventually stall. By stall, I mean that the program no longer gets any CPU time, but it still holds all information in memory. In order to run these simulations, I SSH and nohup the program and pipe any output to a file.
Miscellaneous information:
The system is definitely not running out of RAM. The program does not need to read or write to the hard drive until completion; the computation is done completely in memory. The program is not killed, as it still has a PID after it stalls. I am using openmp, but have increased the max number of processes and the max time is unlimited. I am finding the largest eigenvalues of a matrix using the ARPACK fortran library.
Any thoughts on what is causing this behavior or how to resume my currently stalled program?
Thanks
I assume this is an OpenMP program from your tags, though you never actually state this. Is ARPACK threadsafe?
It sounds like you are hitting a deadlock (more common in MPI programs than OpenMP, but it's definitely possible). The first thing to do is to compile with debugging flags on, then the next time you find this problem, attach with a debugger and find out what the various threads are doing. For gdb, for instance, some instructions for switching between threads are shown here.
Next time your program "stalls", attach GDB to it and do thread apply all where.
If all your threads are blocked waiting for some mutex, you have a
deadlock.
If they are waiting for something else (e.g. read), then you need to figure out what prevents the operation from completing.
Generally on UNIX you don't need to rebuild with debug flags on to get a meaningful stack trace. You wouldn't get file/line numbers, but they may not be necessary to diagnose the problem.
A possible way of understanding what a running program (that is, a process) is doing is to attach a debugger to it with gdb program *pid* (which works well only when the program has been compiled with debugging enabled with -g), or to use strace on it, using strace -p *pid*. the strace command is an utility (technically, a specialized debugger built above the ptrace system call interface) which shows you all the system calls done by a program or a process.
There is also a variant, called ltrace that intercepts the call to functions in dynamic libraries.
To get a feeling of it, try for instance strace ls
Of course, strace won't help you much if the running program is not doing any system calls.
Regards.
Basile Starynkevitch
I need to use Valgrind to detect any memory access violations made in a server application. The server creates many threads. I suspect that there is a racing condition that causes the server to crash every 1 hour or so. We used Valgrind to analyze its memory usage but the server process' speed decreased dramatically. The server's speed decreased so much that it was hardly usable and no racing conditions where probable.
Is there anyway to run Valgrind in parallel with our application so we don't lose that much performance?
You can't do that. Valgrind doesn't actually execute your code natively - instead it runs it inside a simulator. That's why it's so slow. So, there's no way to make it run faster, and still get the benefit of Valgrind.
Your best bet is to set the ulimit so that your program generates a core file when it crashes. Then you can try to work out what the problem was by examining the core.
It's worth noting that Valgrind, while supporting multi-threaded programs, will not actually run the program's threads in parallel if you have multiiple cores available. It also interleaves threads at a finer grain than the native OS scheduler. These 2 facts combined might make it so a program with race conditions or other concurrent anomalies will behave differently.
You may want to try Helgrind, a tool primarily aimed at detecting correct locking discipline and drd, a tool primarily aimed at detecting data races.
This is not directly answering your question, but if you suspect a synchronization error, have you tried using the Valgrind tool Helgrind?
Valgrind works by hooking into your malloc calls, so you can expect your program to run slower under valgrind. So, I would say that you could not run your program faster under valgrind and get the benefit of analysing memory errors.
Right now i am loading a file then using gettimeofday and tracking the CPU time with tv_usec
My results varies, i get 250's to 280s but sometimes 300's or 500's. I wrote usleep and sleep (0) and (1) with no success. The time still varies vastly. I thought sleep(1) (seconds in linux, not the windows Sleep in ms) would have solved it. How can i keep track of time in a more consistent way for testing? Maybe i should wait until i have a much larger test data and more complex code before starting measurements?
The currently recommended interface for high-rez time on Linux (and POSIX in general) is clock_gettime. See the man page.
clock_gettime(CLOCK_REALTIME, struct timespec *tp) // for wall-clock time
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, struct timespec *tp) // for CPU time
But read the man page. Note that you need to link with -lrt, because POSIX says so, I guess. Maybe to avoid symbol conflicts in -lc, for old programs that defined their own clock_gettime? But dynamic libs use weak symbols...
The best sleep function is nanosleep. It doesn't mess around with signals or any crap like usleep. It is defined to just sleep, and not have any other side effects. And it tells you if you woke up early (e.g. from signals), so you don't necessarily have to call another time function.
Anyway, you're going to have a hard time testing one rep of something that short that involves a system call. There's a huge amount of opportunity for variation. e.g. the scheduler may decide that some other work needs doing (unlikely if your process just started; you won't have used up your timeslice yet). CPU cache (L2 and TLB) are easily possible.
If you have a multi-core machine and a single-threaded benchmark for the code you're optimizing, you can give it realtime priority pinned to one of your cores. Make sure you choose the core that isn't handling interrupts, or your keyboard (and everything else) will be locked out until it's done. Use taskset (for pinning to one CPU) and chrt (for setting realtime prio).
See this mail I sent to gmp-devel with this trick:
http://gmplib.org/list-archives/gmp-devel/2008-March/000789.html
Oh yeah, for the most precise timing, you can use rdtsc yourself (on x86/amd64). If you don't have any other syscalls in what you're benching, it's not a bad idea. Grab a benchmarking framework to put your function into. GMP has a pretty decent one. It's maybe not set up well for benchmarking functions that aren't in GMP and called mpn_whatever, though. I don't remember, and it's worth a look.
Are you trying to measure how long it takes to load a file? Usually if you're performance testing some bit of code that is already pretty fast (sub-second), then you will want to repeat the same code a number of times (say a thousand or a million), time the whole lot, then divide the total time by the number of iterations.
Having said that, I'm not quite sure what you're using sleep() for. Can you post an example of what you intend to do?
I would recommend putting that code in a for loop. Run it over 1000 or 10000 iterations. There's problems with this if you're doing only a few instructions, but it should help.
Larger data sets also help of course.
sleep is going to deschedule your thread from the cpu. It does not accurately count time with precision.