I have to use cygwin on Windows but altough (like many users complain) the boot is fast, the execution of commands is very, very, very slow. When run on Linux partition it takes at least one tenth of the time. Is there some way to make it work faster? I followed the steps found here (second answer) but it didn't work.
In Cygwin what is slow is the execution of the fork call. As Microsoft systems do not provide the tools to easily mimic a POSIX fork, Cygwin DLL implements a lot of trick to execute a fork and that makes the full process slow.
Running a single program that does not fork is as fast as a normal Windows program.
Related
Suppose I have a multi-core laptop.
I write some code in python, and run it;
then while my python code is running, I open my matlab and run some other code.
What is going on underneath? Will this two process be processed in parallel using multi-core auomatically?
Or the computer waits for one to finish and then process the other?
Thank you !
P.S. The two programs I am referring to can be considered the simplest in nature, e.g. calculate 1+2+3.....+10000000
The answer is... it depends!
Your operating system is constantly switching which processes are running. There are tons of processes always running in the background - refreshing the screen, posting sound to the speakers, checking for updates, polling the mouse, etc. - and those processes can only actually execute if they get some amount of processor time. If you have many cores, the OS will use some sort of heuristics to figure out which processes should get some time on the cores. You have the illusion that everything is running at the same time because (1) in some sense, things are running at the same time because you have multiple cores, and (2) the switching happens so fast that you can't notice it happen.
The reason I'm bringing this up is that if you run both Python and MATLAB at the same time, while in principle they could easily run at the same time, it's not guaranteed that that happens because you may have a ton of other things going on as well. It might be that both Python and MATLAB run for a bit concurrently, then both temporarily get paused to allow some program that's playing music to load the next sound clip to be played into memory, then one pauses while the OS pages in some memory from disk and another takes over, etc.
Can you assume that they'll run in parallel? Sure! Most reasonable OSes will figure that out and do it correct. Can you assume that they exclusively are running in parallel and nothing else is? Not necessarily.
I am porting one Linux Application to Windows. I observed many changes need to be done in multithreading part.
what will be the equivalent structure for "pthread_t" (which is in Linux), in windows?
what will be the equivalent for structure for "pthread_attr_t" (which is in Linux), in windows?
Can you please guide me some tips while porting.
Thanks...
The equivalent to pthread_t would be (as is so often the case) a HANDLE on Windows - which is what CreateThread returns.
There is no direct equivalent of pthread_attr_t. Instead, the attributes of a flag such as the stack size, whether the thread is initially suspended and other things are passed to CreateThread via arguments.
In the cases I saw so far, writing a small wrapper around pthreads so that you can have an alternative implementation for Windows was surprisingly simple. The most irritating thing for me was that on Windows, a Mutex is not the same thing as on Linux: on Windows, it's a handle which can be accessed from multiple processes. The thing which the pthread library calls mutex is called "critical section" on Windows.
That being said, if you find yourself finding more than just a few dozen lines of wrapper code you might want have a look at the c++11 thread library or at the thread support in Boost to avoid reinventing the wheel (and possibly wrongly so).
Here is your tip - "pthread is POSIX".
Mingw has pthreads,
Cygwin have pthreads and so on.
My advice is to stick with mingw and try not to do any changes.
Fixed:
Well this seems a bit silly. Turns out top was not displaying correctly and programs actually continue to run. Perhaps the CPU time became too large to display? Either way, the program seems to be working fine and this whole question was moot.
Thanks (and sorry for the silly question).
Original Q:
I am running a simulation on a computer running Ubuntu server 10.04.3. Short runs (<24 hours) run fine, but long runs eventually stall. By stall, I mean that the program no longer gets any CPU time, but it still holds all information in memory. In order to run these simulations, I SSH and nohup the program and pipe any output to a file.
Miscellaneous information:
The system is definitely not running out of RAM. The program does not need to read or write to the hard drive until completion; the computation is done completely in memory. The program is not killed, as it still has a PID after it stalls. I am using openmp, but have increased the max number of processes and the max time is unlimited. I am finding the largest eigenvalues of a matrix using the ARPACK fortran library.
Any thoughts on what is causing this behavior or how to resume my currently stalled program?
Thanks
I assume this is an OpenMP program from your tags, though you never actually state this. Is ARPACK threadsafe?
It sounds like you are hitting a deadlock (more common in MPI programs than OpenMP, but it's definitely possible). The first thing to do is to compile with debugging flags on, then the next time you find this problem, attach with a debugger and find out what the various threads are doing. For gdb, for instance, some instructions for switching between threads are shown here.
Next time your program "stalls", attach GDB to it and do thread apply all where.
If all your threads are blocked waiting for some mutex, you have a
deadlock.
If they are waiting for something else (e.g. read), then you need to figure out what prevents the operation from completing.
Generally on UNIX you don't need to rebuild with debug flags on to get a meaningful stack trace. You wouldn't get file/line numbers, but they may not be necessary to diagnose the problem.
A possible way of understanding what a running program (that is, a process) is doing is to attach a debugger to it with gdb program *pid* (which works well only when the program has been compiled with debugging enabled with -g), or to use strace on it, using strace -p *pid*. the strace command is an utility (technically, a specialized debugger built above the ptrace system call interface) which shows you all the system calls done by a program or a process.
There is also a variant, called ltrace that intercepts the call to functions in dynamic libraries.
To get a feeling of it, try for instance strace ls
Of course, strace won't help you much if the running program is not doing any system calls.
Regards.
Basile Starynkevitch
Now here's something interesting. When I have more than one thread in Tcl invoking package require Expect, I get a seg fault.
e.g.
package require Threads
package require Expect
set t [thread::create]
thread::send {package require Expect}
puts "blarg! Damned thing crashes before I get here"
This is not a good time. Any thoughts?
Expect and Threads don't go together too well. Its the complexity you get from fork() + threads that can bite a lot there and lead to deadlocks and all kinds of uglyness. Usually not a good idea to combine the two.
If you really need Expect and the added concurrency a multi process approach with on multi threaded driver program and one single threaded expect process might work better. If you used tcllibs comm package the api's for sending commands are not that much different either (you mostly miss the tsv and tpool stuff if you used comm).
But it shouldn't segfault for sure. Which Expect/Threads/Tcl core combination did you use (e.g. ActiveStates ActiveTcl bundle or some self compiled stuff on an unusual platform?)
It's all from the latest debian packages, Ubuntu 9.0.4, 64 bit.
One alternative is to organize the code such that one thread is dedicated to handling all expect calls...which isn't the most elegant, generic solution but it might have to do.
The C code of the expect library (loaded with package require Expect) is not thread-safe (it probably uses global variables or else). I tried a lot to work around this limitation because I wanted to have a load balancing algorithm based on the Thread library which would pilot some expect code launching builds on a pool of slave machines. Unless you are very good at C and want to enhance expect, I would rather suggest to launch expect interpreters (in their own OS process) each time you need to use it from your Thread-enabled program. But of course I don't know your problem to solve, and this would only work if the "expect works" are unrelated.. Good luck anyway..
What's the fastest, best way on modern Linux of achieving the same effect as a fork-execve combo from a large process ?
My problem is that the process forking is ~500MByte big, and a simple benchmarking test achieves only about 50 forks/s from the process (c.f ~1600 forks/s from a minimally sized process) which is too slow for the intended application.
Some googling turns up vfork as having being invented as the solution to this problem... but also warnings about not to use it. Modern Linux seems to have acquired related clone and posix_spawn calls; are these likely to help ? What's the modern replacement for vfork ?
I'm using 64bit Debian Lenny on an i7 (the project could move to Squeeze if posix_spawn would help).
On Linux, you can use posix_spawn(2) with the POSIX_SPAWN_USEVFORK flag to avoid the overhead of copying page tables when forking from a large process.
See Minimizing Memory Usage for Creating Application Subprocesses for a good summary of posix_spawn(2), its advantages and some examples.
To take advantage of vfork(2), make sure you #define _GNU_SOURCE before #include <spawn.h> and then simply posix_spawnattr_setflags(&attr, POSIX_SPAWN_USEVFORK)
I can confirm that this works on Debian Lenny, and provides a massive speed-up when forking from a large process.
benchmarking the various spawns over 1000 runs at 100M RSS
user system total real
fspawn (fork/exec): 0.100000 15.460000 40.570000 ( 41.366389)
pspawn (posix_spawn): 0.010000 0.010000 0.540000 ( 0.970577)
Outcome: I was going to go down the early-spawned helper subprocess route as suggested by other answers here, but then I came across this re using huge page support to improve fork performance.
Having tried it myself using libhugetlbfs to simply make all my app's mallocs allocate huge pages, I'm now getting around 2400 forks/s regardless of the process size (over the range I'm interested in anyway). Amazing.
Did you actually measure how much time forks take? Quoting the page you linked,
Linux never had this problem; because Linux used copy-on-write semantics internally, Linux only copies pages when they changed (actually, there are still some tables that have to be copied; in most circumstances their overhead is not significant)
So the number of forks doesn't really show how big the overhead will be. You should measure the time consumed by forks, and (which is a generic advice) consumed only by the forks you actually perform, not by benchmarking maximum performance.
But if you really figure out that forking a large process is a slow, you may spawn a small ancillary process, pipe master process to its input, and receive commands to exec from it. The small process will fork and exec these commands.
posix_spawn()
This function, as far as I understand, is implemented via fork/exec on desktop systems. However, in embedded systems (particularly, in those without MMU on board), processes are spawned via a syscall, interface to which is posix_spawn or a similar function. Quoting the informative section of POSIX standard describing posix_spawn:
Swapping is generally too slow for a realtime environment.
Dynamic address translation is not available everywhere that POSIX might be useful.
Processes are too useful to simply option out of POSIX whenever it must run without address translation or other MMU services.
Thus, POSIX needs process creation and file execution primitives that can be efficiently implemented without address translation or other MMU services.
I don't think that you will benefit from this function on desktop if your goal is to minimize time consumption.
If you know the number of subprocess ahead of time, it might be reasonable to pre-fork your application on startup then distribute the execv information via a pipe. Alternatively, if there is some sort of "lull" in your program it might be reasonable to fork ahead of time a subprocess or two for quick turnaround at a later time. Neither of these options would directly solve the problem but if either approach is suitable to your app, it might allow you to side-step the issue.
I've come across this blog post: http://blog.famzah.net/2009/11/20/a-much-faster-popen-and-system-implementation-for-linux/
pid = clone(fn, stack_aligned, CLONE_VM | SIGCHLD, arg);
Excerpt:
The system call clone() comes to the rescue. Using clone() we create a
child process which has the following features:
The child runs in the same memory space as the parent. This means that no memory structures are copied when the child process is
created. As a result of this, any change to any non-stack variable
made by the child is visible by the parent process. This is similar to
threads, and therefore completely different from fork(), and also very
dangerous – we don’t want the child to mess up the parent.
The child starts from an entry function which is being called right after the child was created. This is like threads, and unlike fork().
The child has a separate stack space which is similar to threads and fork(), but entirely different to vfork().
The most important: This thread-like child process can call exec().
In a nutshell, by calling clone in the following way, we create a
child process which is very similar to a thread but still can call
exec():
However I think it may still be subject to the setuid problem:
http://ewontfix.com/7/ "setuid and vfork"
Now we get to the worst of it. Threads and vfork allow you to get in a
situation where two processes are both sharing memory space and
running at the same time. Now, what happens if another thread in the
parent calls setuid (or any other privilege-affecting function)? You
end up with two processes with different privilege levels running in a
shared address space. And this is A Bad Thing.
Consider for example a multi-threaded server daemon, running initially
as root, that’s using posix_spawn, implemented naively with vfork, to
run an external command. It doesn’t care if this command runs as root
or with low privileges, since it’s a fixed command line with fixed
environment and can’t do anything harmful. (As a stupid example, let’s
say it’s running date as an external command because the programmer
couldn’t figure out how to use strftime.)
Since it doesn’t care, it calls setuid in another thread without any
synchronization against running the external program, with the intent
to drop down to a normal user and execute user-provided code (perhaps
a script or dlopen-obtained module) as that user. Unfortunately, it
just gave that user permission to mmap new code over top of the
running posix_spawn code, or to change the strings posix_spawn is
passing to exec in the child. Whoops.