I try to get some code running which is "embarassingly parallel", so I just started to look into parallel processing. I am trying to use parLapply on a Linux machine (because it works perfectly fine under my Windows machine, whereas mclapply would limit the code to Linux) but I encounter some problems.
This is how my code looks like:
cl <- makeCluster(detectCores(), type="FORK") # fork -> psock when I use Win
clusterExport(cl, some.list.of.things)
out <- parLapply(cl, some.fun)
stopCluster(cl)
At first, I noted that the parallel implementation is actually much slower than the sequential one, the reason being that on my Linux machine, each child process inherits the CPU of the parent. At least I think I can draw this conclusion by making the observation that in the systems monitor, all my r-session processes had only about 8% or so CPU time, and only one core was used. See this really helpful thread here.
I ended up using the code of that last thread, namely:
system(sprintf("taskset -p 0xffffffff %d", Sys.getpid()))
I need to mention here that I am not in any way familiar with any Linux basics. It is my university server run by other people, and I have no idea what the above code actually means and does apart from changing "1" to "ff" (whatever "ff" stands for). Anyway, after executing the above code, I can see that 3 out of 8 of my child processes receive almost full CPU time, which is a big improvement.
Having said that, there are 8 cores (determined by detectCores()), and 8 child processes (as seen in the systems monitor), but "only" 3 child processes are working.
Given that I am completely new to parallel processing, I was wondering if you could give me some guidance as to how to make all 8 cores used. I feel like a blind person that doesn't know what he should be looking for to fix that situation. Any pointers to what I should change or what might be the problem would be highly appreciated!
Related
Suppose I have a multi-core laptop.
I write some code in python, and run it;
then while my python code is running, I open my matlab and run some other code.
What is going on underneath? Will this two process be processed in parallel using multi-core auomatically?
Or the computer waits for one to finish and then process the other?
Thank you !
P.S. The two programs I am referring to can be considered the simplest in nature, e.g. calculate 1+2+3.....+10000000
The answer is... it depends!
Your operating system is constantly switching which processes are running. There are tons of processes always running in the background - refreshing the screen, posting sound to the speakers, checking for updates, polling the mouse, etc. - and those processes can only actually execute if they get some amount of processor time. If you have many cores, the OS will use some sort of heuristics to figure out which processes should get some time on the cores. You have the illusion that everything is running at the same time because (1) in some sense, things are running at the same time because you have multiple cores, and (2) the switching happens so fast that you can't notice it happen.
The reason I'm bringing this up is that if you run both Python and MATLAB at the same time, while in principle they could easily run at the same time, it's not guaranteed that that happens because you may have a ton of other things going on as well. It might be that both Python and MATLAB run for a bit concurrently, then both temporarily get paused to allow some program that's playing music to load the next sound clip to be played into memory, then one pauses while the OS pages in some memory from disk and another takes over, etc.
Can you assume that they'll run in parallel? Sure! Most reasonable OSes will figure that out and do it correct. Can you assume that they exclusively are running in parallel and nothing else is? Not necessarily.
Let me begin by saying I do not have in depth knowledge of Perl so please pardon me if there is something obvious that I have missed :)
In the system (running in Windows environment) that I am looking at, we have a perl process which has to download ~5000-6000 files. Since each file can be independently downloaded, we forked separate threads for each file. The thread is supposed to download the file and die. On running the process, I noticed that the memory of the process goes up to ~1.7 GB and then dies due to the memory limit of each process.
On searching and asking a few people, I came across this concept of circular referencing due to which the garbage collector will not free up memory. I searched a bit and found the Devel-Cycle package which can find out if there are any cycles in the object. I got this package and added a line to check if the main object in the process has any cycles. find_cycle came back with the following statement for each thread.
DBD::Oracle::db FIRSTKEY failed: handle 2 is owned by thread 256004 not current thread c0ea29c (handles can't be shared between threads and your driver may need a CLONE method added) at C:/Program Files/Perl/site/lib/Devel/Cycle.pm line 151.
I got to know that DB handles cannot be shared between threads. I looked at the code again and realised that after the fork happens, the child process does actually create a new DB handle (which I guess is why the process still continues to run fine till it reaches the memory limit). I guess there might be more db handles from the parent in the object that are not used by the child but are still referenced.
Questons that I have -
Is the circular reference the only reason for the problem or could there be other issues causing the process to use so much memory?
Could the sharing of the handle cause the blow up in memory (in other words is the shared DB handle causing the GC to not free up space)?
If it is indeed the shared DB handle, I guess I can just say $dbHandle = 0 to get rid of the reference (if $dbHabndle is referencing that particular handle). Am I correct here?
I am trying to go through the code to see where else there is a reference to the parent DB handle (and found at least one more reference). Is there any other way I can do this? Is there a method to print out all the properties of an object?
EDIT:
Not all threads (due to the perl fork call in windows) are spawned at the same time. It spawns a max of n number of threads (where n is a configurable number). Once a thread has finished its execution, the process spawns another thread. At this moment n is set to 10, however I had changed n to 1 (so only one extra thread is running at one time), and I still hit the memory limit.
edit: Turns out, this does not solve the Ops problem. Still might be helpful for a future reader.
We do not really know a lot about your situation and your program sounds quite complex to just fork it 6000 times to me. But i will still attempt to answer, please correct me if my assumptions are wrong.
It appears you are on Windows. It is important to note, that Windows has no fork() system call. And as you specifically note that you "fork", i just assume that you actually use that Perl command. On windows, this will try to emulate fork() as best as it can but what that basically means is, that all the forked processes you see, are in fact just threads within the original process, just pretending to be processes to you. To do this, they copy the complete interpreter state. See http://perldoc.perl.org/perlfork.html for more information. Especially the following part seems to apply to you:
Resource limits
In the eyes of the operating system, pseudo-processes created via the fork() emulation are simply threads in the same process. This means that any process-level limits imposed by the operating system apply to all pseudo-processes taken together. This includes any limits imposed by the operating system on the number of open file, directory and socket handles, limits on disk space usage, limits on memory size, limits on CPU utilization etc.
If you fork so many pseudo processes, you need a lot of memory as you also have to copy the interpreter state as often. And depending on the complexity of your program and how it is structured, that may very well be a non-trivial amount of memory.
And as http://msdn.microsoft.com/en-us/library/windows/desktop/aa366778%28v=vs.85%29.aspx tells us, the 1.7GB you mentioned, is not far away from the 2GB that some Windows versions impose on you as memory limit for a single process.
My wild guess would be, that you in fact just hit that limit by spawning all those many many threads, each with its own copy of the interpreter state and everything.
You will probably be off a lot better using some threading library instead of asking Perl to emulate individual processes for you. Needless to mention (i hope) that you do not really gain any advantage by having 6000 threads over lets say 16. If you try to have all of them do something at the same time, you will in fact most likely experience slowdowns, depending on how the threading is handled.
In addition to the comments already provided, I want to emphasize the point DeVadder made regarding the behavior of fork in Windows and that Perl threading is likely a better solution but are you sure that the DBD module is safe to be used by multiple processes / forks / threads, etc without setting some extra parameters?
I had a similar error when using the DBI module to access a SQLite DB in multi-processed code using the threads module. It was solved by setting the 'use_immediate_transaction' option for the database handle provided by DBI to 1. If you aren't familiar with how Perl threads work, they aren't threads, they create a copy of the interpreter and everything you have in memory at the time of their creation, but even if I made the database handle separately in each "thread" I would get 'database locked' and various other errors. Without some of these extra options DBD may not function correctly in a multiprocessed environment.
Also, why make 6000 forks, use thread::queue and the threads module, make a worker pool of a few workers (one per core?) and recycle the workers. You are doing alot of overhead every fork for no gain.
I have been asked to write the test cases to show practically the performance of semaphore and read write semaphore in case of more readers and less writers and vice versa.
I have implemented the semaphore (in kernel space we were asked actually) but not getting how to write the use cases and do a live practical evaluation ( categorically ) of same.
Why don't you just write your two versions of the code (Semaphore / R/W Semaphore) to start. The use cases will depend on the actual feature being tested. Is it a device driver? Is it IO related at all? Is it networking related? It's hard to come up with use cases without knowing this.
Generally what I would do for something like an IO benchmark would be running multiple simulations over an increasing memory footprint for a set of runs. Another set of runs might be over an increasing process load. Another may be over different block sizes. I would compare each one of those against something like aggregate bandwidth and see how performance (aggregate bandwidth in this case) changed over those tests.
Again, your use cases might be completely different if you are testing something like a USB driver.
Using your custom semaphores, write the following 2 C programs and compile them
reader.c
writer.c
As a simple rudimentary test, write a shell script test.sh and add your commands to load the test binaries as follows.
#!/bin/sh
./reader &
./reader &
./reader &
./reader &
./writer &
Launching the above shell script as ./test.sh will launch 4 readers and 1 writer. Customise this to your test scenario.
Ensure that your programs are operating properly i.e. verify data is being exchanged properly first before trying to profile the performance.
Once you are sure that IPC is working as expected, profile the cpu usage. Prior to launching test.sh, run the top command in another terminal. Observe the cpu usage patterns for varying number of readers/writers during the run-time of test script.
Also you can launch the individual binaries(or in the test-script) with :
time <binary>
To print the total lifetime and time spent waiting on the kernel driver.
perf record <binary>
and after completion, run perf annotate main
To obtain the relative amount of time spent in various sections of the code.
I’ve begun studying Erlang and find the BEAM runtime environment fascinating. It’s commonly stated that in Erlang, processes belong to the language rather than the OS (meaning the runtime, meaning BEAM in this case). These are the lightweight, “green processes” that Erlang is getting famous for. It’s further stated (on page 5 of this paper) that BEAM uses one (1) OS thread per CPU core for scheduling and another OS thread for i/o. So I wonder: From what thread do the CPU cycles needed to actually execute the Erlang code come from?
Further, if I’m running on a dual core machine I would expect -- based on what I’ve read so far -- to see three (3) threads running under the BEAM process: two schedulers (one for each core) and one i/o thread. But I see 10. Sometimes 11. Sometimes it starts at 13 and, like high-quality amplifiers, goes to 11.
I’m confused. Any insight will be appreciated.
Following #user425720's advice, I asked my question on the erlang-questions LISTSERV. It's also available as a Google Group. Kresten Krab Thorup of Trifork answered me almost at once. My thanks to go out to Kreston. Here is his answer. (Parentheticals and emphasis are mine.)
Here is AFAIK, the basic scenario:
Erlang code will be run in as many
"green threads" as there are
processes; the process limit is
controlled by the +P (command line) flag.
The green threads are mapped on to S
threads, where S is the number of
cores/CPUs. The fact that these
threads are also called schedulers
can seem somewhat
confusing, but from the VMs point of
view they are. From the developer's
point of view, they are the threads
that run your erlang code. The
number S can be controlled with
the +S option to the erl command line.
In addition hereto, there are a number
of so-called "Async Threads". That's
a thread pool which is used by I/O
processes called linked in drivers, to
react to select / poll etc. The
number of asynch threads is dynamic,
but limited by the +A flag.
So, the 11 threads you see on a
dual-core may be 2 schedulers, and 9
async threads. For instance.
Read more about the flags here.
Erlang processes are not 'green' as threads are green in java. Erlang processes are structures, which do not share memory and they are maintained by Erlang VM.
It may sound strange but this paper could be 'old' (even though bio from 2007). It all changed around R13 release when we got brand new handling of run time queues (with dynamic balancing stuff and other goodies). Here is some presentation by Ulf Wiger about it http://ulf.wiger.net/weblog/2009/01/23/erlang-programming-for-multicore/
To sum up, processes are completely transparent and you may adjust number of run time queues and schedulers, but OS realization is not intact. I do not want to speculate why there are like 11 of threads..
EDIT: I am wrong about OS a bit:
+S Schedulers:SchedulerOnline
Sets the amount of scheduler threads to create and scheduler threads to set online when SMP support has been enabled.
Valid range for both values are 1-1024. If the Erlang runtime system is able to determine the amount of logical processors configured and logical processors available, Schedulers will default to logical processors configured, and SchedulersOnline will default to logical processors available; otherwise, the default values will be 1. Schedulers may be omitted if :SchedulerOnline is not and vice versa. The amount of schedulers online can be changed at run time via erlang:system_flag(schedulers_online, SchedulersOnline).
...
This flag will be ignored if the emulator doesn't have SMP support enabled (see the -smp flag).
from here: http://www.erlang.org/doc/man/erl.html
EDIT2: Interesting discussion on erlang-question mailing list on pros and cons of many VMs vs many schedulers. Unfortunately it is also from 2008 and may not be valid with huge improvements in new OTP releases. http://www.erlang.org/cgi-bin/ezmlm-cgi?4:mss:38165:200809:nbihpkepgjcfnffkoobf
What's the fastest, best way on modern Linux of achieving the same effect as a fork-execve combo from a large process ?
My problem is that the process forking is ~500MByte big, and a simple benchmarking test achieves only about 50 forks/s from the process (c.f ~1600 forks/s from a minimally sized process) which is too slow for the intended application.
Some googling turns up vfork as having being invented as the solution to this problem... but also warnings about not to use it. Modern Linux seems to have acquired related clone and posix_spawn calls; are these likely to help ? What's the modern replacement for vfork ?
I'm using 64bit Debian Lenny on an i7 (the project could move to Squeeze if posix_spawn would help).
On Linux, you can use posix_spawn(2) with the POSIX_SPAWN_USEVFORK flag to avoid the overhead of copying page tables when forking from a large process.
See Minimizing Memory Usage for Creating Application Subprocesses for a good summary of posix_spawn(2), its advantages and some examples.
To take advantage of vfork(2), make sure you #define _GNU_SOURCE before #include <spawn.h> and then simply posix_spawnattr_setflags(&attr, POSIX_SPAWN_USEVFORK)
I can confirm that this works on Debian Lenny, and provides a massive speed-up when forking from a large process.
benchmarking the various spawns over 1000 runs at 100M RSS
user system total real
fspawn (fork/exec): 0.100000 15.460000 40.570000 ( 41.366389)
pspawn (posix_spawn): 0.010000 0.010000 0.540000 ( 0.970577)
Outcome: I was going to go down the early-spawned helper subprocess route as suggested by other answers here, but then I came across this re using huge page support to improve fork performance.
Having tried it myself using libhugetlbfs to simply make all my app's mallocs allocate huge pages, I'm now getting around 2400 forks/s regardless of the process size (over the range I'm interested in anyway). Amazing.
Did you actually measure how much time forks take? Quoting the page you linked,
Linux never had this problem; because Linux used copy-on-write semantics internally, Linux only copies pages when they changed (actually, there are still some tables that have to be copied; in most circumstances their overhead is not significant)
So the number of forks doesn't really show how big the overhead will be. You should measure the time consumed by forks, and (which is a generic advice) consumed only by the forks you actually perform, not by benchmarking maximum performance.
But if you really figure out that forking a large process is a slow, you may spawn a small ancillary process, pipe master process to its input, and receive commands to exec from it. The small process will fork and exec these commands.
posix_spawn()
This function, as far as I understand, is implemented via fork/exec on desktop systems. However, in embedded systems (particularly, in those without MMU on board), processes are spawned via a syscall, interface to which is posix_spawn or a similar function. Quoting the informative section of POSIX standard describing posix_spawn:
Swapping is generally too slow for a realtime environment.
Dynamic address translation is not available everywhere that POSIX might be useful.
Processes are too useful to simply option out of POSIX whenever it must run without address translation or other MMU services.
Thus, POSIX needs process creation and file execution primitives that can be efficiently implemented without address translation or other MMU services.
I don't think that you will benefit from this function on desktop if your goal is to minimize time consumption.
If you know the number of subprocess ahead of time, it might be reasonable to pre-fork your application on startup then distribute the execv information via a pipe. Alternatively, if there is some sort of "lull" in your program it might be reasonable to fork ahead of time a subprocess or two for quick turnaround at a later time. Neither of these options would directly solve the problem but if either approach is suitable to your app, it might allow you to side-step the issue.
I've come across this blog post: http://blog.famzah.net/2009/11/20/a-much-faster-popen-and-system-implementation-for-linux/
pid = clone(fn, stack_aligned, CLONE_VM | SIGCHLD, arg);
Excerpt:
The system call clone() comes to the rescue. Using clone() we create a
child process which has the following features:
The child runs in the same memory space as the parent. This means that no memory structures are copied when the child process is
created. As a result of this, any change to any non-stack variable
made by the child is visible by the parent process. This is similar to
threads, and therefore completely different from fork(), and also very
dangerous – we don’t want the child to mess up the parent.
The child starts from an entry function which is being called right after the child was created. This is like threads, and unlike fork().
The child has a separate stack space which is similar to threads and fork(), but entirely different to vfork().
The most important: This thread-like child process can call exec().
In a nutshell, by calling clone in the following way, we create a
child process which is very similar to a thread but still can call
exec():
However I think it may still be subject to the setuid problem:
http://ewontfix.com/7/ "setuid and vfork"
Now we get to the worst of it. Threads and vfork allow you to get in a
situation where two processes are both sharing memory space and
running at the same time. Now, what happens if another thread in the
parent calls setuid (or any other privilege-affecting function)? You
end up with two processes with different privilege levels running in a
shared address space. And this is A Bad Thing.
Consider for example a multi-threaded server daemon, running initially
as root, that’s using posix_spawn, implemented naively with vfork, to
run an external command. It doesn’t care if this command runs as root
or with low privileges, since it’s a fixed command line with fixed
environment and can’t do anything harmful. (As a stupid example, let’s
say it’s running date as an external command because the programmer
couldn’t figure out how to use strftime.)
Since it doesn’t care, it calls setuid in another thread without any
synchronization against running the external program, with the intent
to drop down to a normal user and execute user-provided code (perhaps
a script or dlopen-obtained module) as that user. Unfortunately, it
just gave that user permission to mmap new code over top of the
running posix_spawn code, or to change the strings posix_spawn is
passing to exec in the child. Whoops.