Faster forking of large processes on Linux?

Faster forking of large processes on Linux? - linux

What's the fastest, best way on modern Linux of achieving the same effect as a fork-execve combo from a large process ?
My problem is that the process forking is ~500MByte big, and a simple benchmarking test achieves only about 50 forks/s from the process (c.f ~1600 forks/s from a minimally sized process) which is too slow for the intended application.
Some googling turns up vfork as having being invented as the solution to this problem... but also warnings about not to use it. Modern Linux seems to have acquired related clone and posix_spawn calls; are these likely to help ? What's the modern replacement for vfork ?
I'm using 64bit Debian Lenny on an i7 (the project could move to Squeeze if posix_spawn would help).

On Linux, you can use posix_spawn(2) with the POSIX_SPAWN_USEVFORK flag to avoid the overhead of copying page tables when forking from a large process.
See Minimizing Memory Usage for Creating Application Subprocesses for a good summary of posix_spawn(2), its advantages and some examples.
To take advantage of vfork(2), make sure you #define _GNU_SOURCE before #include <spawn.h> and then simply posix_spawnattr_setflags(&attr, POSIX_SPAWN_USEVFORK)
I can confirm that this works on Debian Lenny, and provides a massive speed-up when forking from a large process.
benchmarking the various spawns over 1000 runs at 100M RSS
user system total real
fspawn (fork/exec): 0.100000 15.460000 40.570000 ( 41.366389)
pspawn (posix_spawn): 0.010000 0.010000 0.540000 ( 0.970577)

Outcome: I was going to go down the early-spawned helper subprocess route as suggested by other answers here, but then I came across this re using huge page support to improve fork performance.
Having tried it myself using libhugetlbfs to simply make all my app's mallocs allocate huge pages, I'm now getting around 2400 forks/s regardless of the process size (over the range I'm interested in anyway). Amazing.

Did you actually measure how much time forks take? Quoting the page you linked,
Linux never had this problem; because Linux used copy-on-write semantics internally, Linux only copies pages when they changed (actually, there are still some tables that have to be copied; in most circumstances their overhead is not significant)
So the number of forks doesn't really show how big the overhead will be. You should measure the time consumed by forks, and (which is a generic advice) consumed only by the forks you actually perform, not by benchmarking maximum performance.
But if you really figure out that forking a large process is a slow, you may spawn a small ancillary process, pipe master process to its input, and receive commands to exec from it. The small process will fork and exec these commands.
posix_spawn()
This function, as far as I understand, is implemented via fork/exec on desktop systems. However, in embedded systems (particularly, in those without MMU on board), processes are spawned via a syscall, interface to which is posix_spawn or a similar function. Quoting the informative section of POSIX standard describing posix_spawn:
Swapping is generally too slow for a realtime environment.
Dynamic address translation is not available everywhere that POSIX might be useful.
Processes are too useful to simply option out of POSIX whenever it must run without address translation or other MMU services.
Thus, POSIX needs process creation and file execution primitives that can be efficiently implemented without address translation or other MMU services.
I don't think that you will benefit from this function on desktop if your goal is to minimize time consumption.

If you know the number of subprocess ahead of time, it might be reasonable to pre-fork your application on startup then distribute the execv information via a pipe. Alternatively, if there is some sort of "lull" in your program it might be reasonable to fork ahead of time a subprocess or two for quick turnaround at a later time. Neither of these options would directly solve the problem but if either approach is suitable to your app, it might allow you to side-step the issue.

I've come across this blog post: http://blog.famzah.net/2009/11/20/a-much-faster-popen-and-system-implementation-for-linux/
pid = clone(fn, stack_aligned, CLONE_VM | SIGCHLD, arg);
Excerpt:
The system call clone() comes to the rescue. Using clone() we create a
child process which has the following features:
The child runs in the same memory space as the parent. This means that no memory structures are copied when the child process is
created. As a result of this, any change to any non-stack variable
made by the child is visible by the parent process. This is similar to
threads, and therefore completely different from fork(), and also very
dangerous – we don’t want the child to mess up the parent.
The child starts from an entry function which is being called right after the child was created. This is like threads, and unlike fork().
The child has a separate stack space which is similar to threads and fork(), but entirely different to vfork().
The most important: This thread-like child process can call exec().
In a nutshell, by calling clone in the following way, we create a
child process which is very similar to a thread but still can call
exec():
However I think it may still be subject to the setuid problem:
http://ewontfix.com/7/ "setuid and vfork"
Now we get to the worst of it. Threads and vfork allow you to get in a
situation where two processes are both sharing memory space and
running at the same time. Now, what happens if another thread in the
parent calls setuid (or any other privilege-affecting function)? You
end up with two processes with different privilege levels running in a
shared address space. And this is A Bad Thing.
Consider for example a multi-threaded server daemon, running initially
as root, that’s using posix_spawn, implemented naively with vfork, to
run an external command. It doesn’t care if this command runs as root
or with low privileges, since it’s a fixed command line with fixed
environment and can’t do anything harmful. (As a stupid example, let’s
say it’s running date as an external command because the programmer
couldn’t figure out how to use strftime.)
Since it doesn’t care, it calls setuid in another thread without any
synchronization against running the external program, with the intent
to drop down to a normal user and execute user-provided code (perhaps
a script or dlopen-obtained module) as that user. Unfortunately, it
just gave that user permission to mmap new code over top of the
running posix_spawn code, or to change the strings posix_spawn is
passing to exec in the child. Whoops.

Related

exec without replacing current process, posix_spawn kernel implementation

Although the kernel marks pages (and page tables) as copy on write to make the fork syscall work efficiently, the creation and tear-down of page tables and related structures is still an expensive task.
Thus I wonder why the linux community has never managed to implement posix_spawn as a real kernel syscall that just spawns a new process, eliminating the need to call fork beforehand.
Instead, posix_spawn is just a poor glibc wrapper around fork and exec.
The performance gains would be significantly for workloads that have to spawn thousands of new processes every second. The latency for launching new processes would be improved as well.

That's basically what posix_spawn is for. It is also a more flexible API. The real bug is that the Linux exec man page still doesn't include a cross-reference for it.

Fork with copy-on-write is very expensive. To illustrate this, you might want to read the implementation of classic vfork semantics in NetBSD. The mail provides some hard numbers for a real world use case, building software. COW for very large programs is also an easily measurable penalty. A friend of mine wrote his own spawn daemon for his Java application, because forking+exec from a 8GB+ JVM took way too long.
The main problem with vfork in the modern world is that it can interact badly with multi-threading. I.e. consider that the post-vfork code has to reference a function that hasn't been resolved by the dynamic linker yet. The dynamic linker now has to lock itself. This can result in dead locks with the original program for example.

GC not able to collect back memory using fork-emulation on Windows

Let me begin by saying I do not have in depth knowledge of Perl so please pardon me if there is something obvious that I have missed :)
In the system (running in Windows environment) that I am looking at, we have a perl process which has to download ~5000-6000 files. Since each file can be independently downloaded, we forked separate threads for each file. The thread is supposed to download the file and die. On running the process, I noticed that the memory of the process goes up to ~1.7 GB and then dies due to the memory limit of each process.
On searching and asking a few people, I came across this concept of circular referencing due to which the garbage collector will not free up memory. I searched a bit and found the Devel-Cycle package which can find out if there are any cycles in the object. I got this package and added a line to check if the main object in the process has any cycles. find_cycle came back with the following statement for each thread.
DBD::Oracle::db FIRSTKEY failed: handle 2 is owned by thread 256004 not current thread c0ea29c (handles can't be shared between threads and your driver may need a CLONE method added) at C:/Program Files/Perl/site/lib/Devel/Cycle.pm line 151.
I got to know that DB handles cannot be shared between threads. I looked at the code again and realised that after the fork happens, the child process does actually create a new DB handle (which I guess is why the process still continues to run fine till it reaches the memory limit). I guess there might be more db handles from the parent in the object that are not used by the child but are still referenced.
Questons that I have -
Is the circular reference the only reason for the problem or could there be other issues causing the process to use so much memory?
Could the sharing of the handle cause the blow up in memory (in other words is the shared DB handle causing the GC to not free up space)?
If it is indeed the shared DB handle, I guess I can just say $dbHandle = 0 to get rid of the reference (if $dbHabndle is referencing that particular handle). Am I correct here?
I am trying to go through the code to see where else there is a reference to the parent DB handle (and found at least one more reference). Is there any other way I can do this? Is there a method to print out all the properties of an object?
EDIT:
Not all threads (due to the perl fork call in windows) are spawned at the same time. It spawns a max of n number of threads (where n is a configurable number). Once a thread has finished its execution, the process spawns another thread. At this moment n is set to 10, however I had changed n to 1 (so only one extra thread is running at one time), and I still hit the memory limit.

edit: Turns out, this does not solve the Ops problem. Still might be helpful for a future reader.
We do not really know a lot about your situation and your program sounds quite complex to just fork it 6000 times to me. But i will still attempt to answer, please correct me if my assumptions are wrong.
It appears you are on Windows. It is important to note, that Windows has no fork() system call. And as you specifically note that you "fork", i just assume that you actually use that Perl command. On windows, this will try to emulate fork() as best as it can but what that basically means is, that all the forked processes you see, are in fact just threads within the original process, just pretending to be processes to you. To do this, they copy the complete interpreter state. See http://perldoc.perl.org/perlfork.html for more information. Especially the following part seems to apply to you:
Resource limits
In the eyes of the operating system, pseudo-processes created via the fork() emulation are simply threads in the same process. This means that any process-level limits imposed by the operating system apply to all pseudo-processes taken together. This includes any limits imposed by the operating system on the number of open file, directory and socket handles, limits on disk space usage, limits on memory size, limits on CPU utilization etc.
If you fork so many pseudo processes, you need a lot of memory as you also have to copy the interpreter state as often. And depending on the complexity of your program and how it is structured, that may very well be a non-trivial amount of memory.
And as http://msdn.microsoft.com/en-us/library/windows/desktop/aa366778%28v=vs.85%29.aspx tells us, the 1.7GB you mentioned, is not far away from the 2GB that some Windows versions impose on you as memory limit for a single process.
My wild guess would be, that you in fact just hit that limit by spawning all those many many threads, each with its own copy of the interpreter state and everything.
You will probably be off a lot better using some threading library instead of asking Perl to emulate individual processes for you. Needless to mention (i hope) that you do not really gain any advantage by having 6000 threads over lets say 16. If you try to have all of them do something at the same time, you will in fact most likely experience slowdowns, depending on how the threading is handled.

In addition to the comments already provided, I want to emphasize the point DeVadder made regarding the behavior of fork in Windows and that Perl threading is likely a better solution but are you sure that the DBD module is safe to be used by multiple processes / forks / threads, etc without setting some extra parameters?
I had a similar error when using the DBI module to access a SQLite DB in multi-processed code using the threads module. It was solved by setting the 'use_immediate_transaction' option for the database handle provided by DBI to 1. If you aren't familiar with how Perl threads work, they aren't threads, they create a copy of the interpreter and everything you have in memory at the time of their creation, but even if I made the database handle separately in each "thread" I would get 'database locked' and various other errors. Without some of these extra options DBD may not function correctly in a multiprocessed environment.
Also, why make 6000 forks, use thread::queue and the threads module, make a worker pool of a few workers (one per core?) and recycle the workers. You are doing alot of overhead every fork for no gain.

When is clone() and fork better than pthreads?

I am beginner in this area.
I have studied fork(), vfork(), clone() and pthreads.
I have noticed that pthread_create() will create a thread, which is less overhead than creating a new process with fork(). Additionally the thread will share file descriptors, memory, etc with parent process.
But when is fork() and clone() better than pthreads? Can you please explain it to me by giving real world example?
Thanks in Advance.

clone(2) is a Linux specific syscall mostly used to implement threads (in particular, it is used for pthread_create). With various arguments, clone can also have a fork(2)-like behavior. Very few people directly use clone, using the pthread library is more portable. You probably need to directly call clone(2) syscall only if you are implementing your own thread library - a competitor to Posix-threads - and this is very tricky (in particular because locking may require using futex(2) syscall in machine-tuned assembly-coded routines, see futex(7)). You don't want to directly use clone or futex because the pthreads are much simpler to use.
(The other pthread functions require some book-keeping to be done internally in libpthread.so after a clone during a pthread_create)
As Jonathon answered, processes have their own address space and file descriptor set. And a process can execute a new executable program with the execve syscall which basically initialize the address space, the stack and registers for starting a new program (but the file descriptors may be kept, unless using close-on-exec flag, e.g. thru O_CLOEXEC for open).
On Unix-like systems, all processes (except the very first process, usuallyinit, of pid 1) are created by fork (or variants like vfork; you could, but don't want to, use clone in such way as it behaves like fork).
(technically, on Linux, there are some few weird exceptions which you can ignore, notably kernel processes or threads and some rare kernel-initiated starting of processes like /sbin/hotplug ....)
The fork and execve syscalls are central to Unix process creation (with waitpid and related syscalls).
A multi-threaded process has several threads (usually created by pthread_create) all sharing the same address space and file descriptors. You use threads when you want to work in parallel on the same data within the same address space, but then you should care about synchronization and locking. Read a pthread tutorial for more.
I suggest you to read a good Unix programming book like Advanced Unix Programming and/or the (freely available) Advanced Linux Programming

The strength and weakness of fork (and company) is that they create a new process that's a clone of the existing process.
This is a weakness because, as you pointed out, creating a new process has a fair amount of overhead. It also means communication between the processes has to be done via some "approved" channel (pipes, sockets, files, shared-memory region, etc.)
This is a strength because it provides (much) greater isolation between the parent and the child. If, for example, a child process crashes, you can kill it and start another fairly easily. By contrast, if a child thread dies, killing it is problematic at best -- it's impossible to be certain what resources that thread held exclusively, so you can't clean up after it. Likewise, since all the threads in a process share a common address space, one thread that ran into a problem could overwrite data being used by all the other threads, so just killing that one thread wouldn't necessarily be enough to clean up the mess.
In other words, using threads is a little bit of a gamble. As long as your code is all clean, you can gain some efficiency by using multiple threads in a single process. Using multiple processes adds a bit of overhead, but can make your code quite a bit more robust, because it limits the damage a single problem can cause, and makes it much easy to shut down and replace a process if it does run into a major problem.
As far as concrete examples go, Apache might be a pretty good one. It will use multiple threads per process, but to limit the damage in case of problems (among other things), it limits the number of threads per process, and can/will spawn several separate processes running concurrently as well. On a decent server you might have, for example, 8 processes with 8 threads each. The large number of threads helps it service a large number of clients in a mostly I/O bound task, and breaking it up into processes means if a problem does arise, it doesn't suddenly become completely un-responsive, and can shut down and restart a process without losing a lot.

These are totally different things. fork() creates a new process. pthread_create() creates a new thread, which runs under the context of the same process.
Thread share the same virtual address space, memory (for good or for bad), set of open file descriptors, among other things.
Processes are (essentially) totally separate from each other and cannot modify each other.
You should read this question:
What is the difference between a process and a thread?
As for an example, if I am your shell (eg. bash), when you enter a command like ls, I am going to fork() a new process, and then exec() the ls executable. (And then I wait() on the child process, but that's getting out of scope.) This happens in an entire different address space, and if ls blows up, I don't care, because I am still executing in my own process.
On the other hand, say I am a math program, and I have been asked to multiply two 100x100 matrices. We know that matrix multiplication is an Embarrassingly Parallel problem. So, I have the matrices in memory. I spawn of N threads, who each operate on the same source matrices, putting their results in the appropriate location in the result matrix. Remember, these operate in the context of the same process, so I need to make sure they are not stamping on each other's data. If N is 8 and I have an eight-core CPU, I can effectively calculate each part of the matrix simultaneously.

Process creation mechanism on unix using fork() (and family) is very efficient.
Morever , most unix system doesnot support kernel level threads i.e thread is not entity recognized by kernel. Hence thread on such system cannot get benefit of CPU scheduling at kernel level. pthread library does that which is not kerenl rather some process itself.
Also on such system pthreads are implemented using vfork() and as light weight process only.
So using threading has no point except portability on such system.
As per my understanding Sun-solaris and windows has kernel level thread and linux family doesn't support kernel threads.
with processes pipes and unix doamin sockets are very efficient IPC without synchronization issues.
I hope it clears why and when thread should be used practically.

Linux system call for creating process and thread

I read in a paper that the underlying system call to create processes and threads is actually the same, and thus the cost of creating processes over threads is not that great.
First, I wanna know what is the system call that creates
processes/threads (possibly a sample code or a link?)
Second, is
the author correct to assume that creating processes instead of
threads is inexpensive?
EDIT:
Quoting article:
Replacing pthreads with processes is surprisingly inexpensive,
especially on Linux where both pthreads and processes are invoked
using the same underlying system call.

Processes are usually created with fork, threads (lightweight processes) are usually created with clone nowadays. However, anecdotically, there exist 1:N thread models, too, which don't do either.
Both fork and clone map to the same kernel function do_fork internally. This function can create a lightweight process that shares the address space with the old one, or a separate process (and many other options), depending on what flags you feed to it. The clone syscall is more or less a direct forwarding of that kernel function (and used by the higher level threading libraries) whereas fork wraps do_fork into the functionality of the 50 year old traditional Unix function.
The important difference is that fork guarantees that a complete, separate copy of the address space is made. This, as Basil points out correctly, is done with copy-on-write nowadays and therefore is not nearly as expensive as one would think.
When you create a thread, it just reuses the original address space and the same memory.
However, one should not assume that creating processes is generally "lightweight" on unix-like systems because of copy-on-write. It is somewhat less heavy than for example under Windows, but it's nowhere near free.
One reason is that although the actual pages are not copied, the new process still needs a copy of the page table. This can be several kilobytes to megabytes of memory for processes that use larger amounts of memory.
Another reason is that although copy-on-write is invisible and a clever optimization, it is not free, and it cannot do magic. When data is modified by either process, which inevitably happens, the affected pages fault.
Redis is a good example where you can see that fork is everything but lightweight (it uses fork to do background saves).

The underlying system call to create threads is clone(2) (it is Linux specific). BTW, the list of Linux system calls is on syscalls(2), and you could use the strace(1) command to understand the syscalls done by some process or command. Processes are usually created with fork(2) (or vfork(2), which is not much useful these days). However, you could (and some C standard libraries might do that) create them with some particular form of clone. I guess that the kernel is sharing some code to implement clone, fork etc... (since some functionalities, e.g. management of the virtual address space, are common).
Indeed, process creation (and also thread creation) is generally quite fast on most Unix systems (because they use copy-on-write machinery for the virtual memory), typically a small fraction of a millisecond. But you could have pathological cases (e.g. thrashing) which makes that much longer.
Since most C standard library implementations are free software on Linux, you could study the source code of the one on your system (often GNU glibc, but sometimes musl-libc or something else).

Threads vs Processes in Linux [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
The community reviewed whether to reopen this question last year and left it closed:
Original close reason(s) were not resolved
Improve this question
I've recently heard a few people say that in Linux, it is almost always better to use processes instead of threads, since Linux is very efficient in handling processes, and because there are so many problems (such as locking) associated with threads. However, I am suspicious, because it seems like threads could give a pretty big performance gain in some situations.
So my question is, when faced with a situation that threads and processes could both handle pretty well, should I use processes or threads? For example, if I were writing a web server, should I use processes or threads (or a combination)?

Linux uses a 1-1 threading model, with (to the kernel) no distinction between processes and threads -- everything is simply a runnable task. *
On Linux, the system call clone clones a task, with a configurable level of sharing, among which are:
CLONE_FILES: share the same file descriptor table (instead of creating a copy)
CLONE_PARENT: don't set up a parent-child relationship between the new task and the old (otherwise, child's getppid() = parent's getpid())
CLONE_VM: share the same memory space (instead of creating a COW copy)
fork() calls clone(least sharing) and pthread_create() calls clone(most sharing). **
forking costs a tiny bit more than pthread_createing because of copying tables and creating COW mappings for memory, but the Linux kernel developers have tried (and succeeded) at minimizing those costs.
Switching between tasks, if they share the same memory space and various tables, will be a tiny bit cheaper than if they aren't shared, because the data may already be loaded in cache. However, switching tasks is still very fast even if nothing is shared -- this is something else that Linux kernel developers try to ensure (and succeed at ensuring).
In fact, if you are on a multi-processor system, not sharing may actually be beneficial to performance: if each task is running on a different processor, synchronizing shared memory is expensive.
* Simplified. CLONE_THREAD causes signals delivery to be shared (which needs CLONE_SIGHAND, which shares the signal handler table).
** Simplified. There exist both SYS_fork and SYS_clone syscalls, but in the kernel, the sys_fork and sys_clone are both very thin wrappers around the same do_fork function, which itself is a thin wrapper around copy_process. Yes, the terms process, thread, and task are used rather interchangeably in the Linux kernel...

Linux (and indeed Unix) gives you a third option.
Option 1 - processes
Create a standalone executable which handles some part (or all parts) of your application, and invoke it separately for each process, e.g. the program runs copies of itself to delegate tasks to.
Option 2 - threads
Create a standalone executable which starts up with a single thread and create additional threads to do some tasks
Option 3 - fork
Only available under Linux/Unix, this is a bit different. A forked process really is its own process with its own address space - there is nothing that the child can do (normally) to affect its parent's or siblings address space (unlike a thread) - so you get added robustness.
However, the memory pages are not copied, they are copy-on-write, so less memory is usually used than you might imagine.
Consider a web server program which consists of two steps:
Read configuration and runtime data
Serve page requests
If you used threads, step 1 would be done once, and step 2 done in multiple threads. If you used "traditional" processes, steps 1 and 2 would need to be repeated for each process, and the memory to store the configuration and runtime data duplicated. If you used fork(), then you can do step 1 once, and then fork(), leaving the runtime data and configuration in memory, untouched, not copied.
So there are really three choices.

That depends on a lot of factors. Processes are more heavy-weight than threads, and have a higher startup and shutdown cost. Interprocess communication (IPC) is also harder and slower than interthread communication.
Conversely, processes are safer and more secure than threads, because each process runs in its own virtual address space. If one process crashes or has a buffer overrun, it does not affect any other process at all, whereas if a thread crashes, it takes down all of the other threads in the process, and if a thread has a buffer overrun, it opens up a security hole in all of the threads.
So, if your application's modules can run mostly independently with little communication, you should probably use processes if you can afford the startup and shutdown costs. The performance hit of IPC will be minimal, and you'll be slightly safer against bugs and security holes. If you need every bit of performance you can get or have a lot of shared data (such as complex data structures), go with threads.

Others have discussed the considerations.
Perhaps the important difference is that in Windows processes are heavy and expensive compared to threads, and in Linux the difference is much smaller, so the equation balances at a different point.

Once upon a time there was Unix and in this good old Unix there was lots of overhead for processes, so what some clever people did was to create threads, which would share the same address space with the parent process and they only needed a reduced context switch, which would make the context switch more efficient.
In a contemporary Linux (2.6.x) there is not much difference in performance between a context switch of a process compared to a thread (only the MMU stuff is additional for the thread).
There is the issue with the shared address space, which means that a faulty pointer in a thread can corrupt memory of the parent process or another thread within the same address space.
A process is protected by the MMU, so a faulty pointer will just cause a signal 11 and no corruption.
I would in general use processes (not much context switch overhead in Linux, but memory protection due to MMU), but pthreads if I would need a real-time scheduler class, which is a different cup of tea all together.
Why do you think threads are have such a big performance gain on Linux? Do you have any data for this, or is it just a myth?

I think everyone has done a great job responding to your question. I'm just adding more information about thread versus process in Linux to clarify and summarize some of the previous responses in context of kernel. So, my response is in regarding to kernel specific code in Linux. According to Linux Kernel documentation, there is no clear distinction between thread versus process except thread uses shared virtual address space unlike process. Also note, the Linux Kernel uses the term "task" to refer to process and thread in general.
"There are no internal structures implementing processes or threads, instead there is a struct task_struct that describe an abstract scheduling unit called task"
Also according to Linus Torvalds, you should NOT think about process versus thread at all and because it's too limiting and the only difference is COE or Context of Execution in terms of "separate the address space from the parent " or shared address space. In fact he uses a web server example to make his point here (which highly recommend reading).
Full credit to linux kernel documentation

If you want to create a pure a process as possible, you would use clone() and set all the clone flags. (Or save yourself the typing effort and call fork())
If you want to create a pure a thread as possible, you would use clone() and clear all the clone flags (Or save yourself the typing effort and call pthread_create())
There are 28 flags that dictate the level of resource sharing. This means that there are over 268 million flavours of tasks that you can create, depending on what you want to share.
This is what we mean when we say that Linux does not distinguish between a process and a thread, but rather alludes to any flow of control within a program as a task. The rationale for not distinguishing between the two is, well, not uniquely defining over 268 million flavours!
Therefore, making the "perfect decision" of whether to use a process or thread is really about deciding which of the 28 resources to clone.

How tightly coupled are your tasks?
If they can live independently of each other, then use processes. If they rely on each other, then use threads. That way you can kill and restart a bad process without interfering with the operation of the other tasks.

To complicate matters further, there is such a thing as thread-local storage, and Unix shared memory.
Thread-local storage allows each thread to have a separate instance of global objects. The only time I've used it was when constructing an emulation environment on linux/windows, for application code that ran in an RTOS. In the RTOS each task was a process with it's own address space, in the emulation environment, each task was a thread (with a shared address space). By using TLS for things like singletons, we were able to have a separate instance for each thread, just like under the 'real' RTOS environment.
Shared memory can (obviously) give you the performance benefits of having multiple processes access the same memory, but at the cost/risk of having to synchronize the processes properly. One way to do that is have one process create a data structure in shared memory, and then send a handle to that structure via traditional inter-process communication (like a named pipe).

In my recent work with LINUX is one thing to be aware of is libraries. If you are using threads make sure any libraries you may use across threads are thread-safe. This burned me a couple of times. Notably libxml2 is not thread-safe out of the box. It can be compiled with thread safe but that is not what you get with aptitude install.

I'd have to agree with what you've been hearing. When we benchmark our cluster (xhpl and such), we always get significantly better performance with processes over threads. </anecdote>

The decision between thread/process depends a little bit on what you will be using it to.
One of the benefits with a process is that it has a PID and can be killed without also terminating the parent.
For a real world example of a web server, apache 1.3 used to only support multiple processes, but in in 2.0 they added an abstraction so that you can swtch between either. Comments seems to agree that processes are more robust but threads can give a little bit better performance (except for windows where performance for processes sucks and you only want to use threads).

For most cases i would prefer processes over threads.
threads can be useful when you have a relatively smaller task (process overhead >> time taken by each divided task unit) and there is a need of memory sharing between them. Think a large array.
Also (offtopic), note that if your CPU utilization is 100 percent or close to it, there is going to be no benefit out of multithreading or processing. (in fact it will worsen)

Threads -- > Threads shares a memory space,it is an abstraction of the CPU,it is lightweight.
Processes --> Processes have their own memory space,it is an abstraction of a computer.
To parallelise task you need to abstract a CPU.
However the advantages of using a process over a thread is security,stability while a thread uses lesser memory than process and offers lesser latency.
An example in terms of web would be chrome and firefox.
In case of Chrome each tab is a new process hence memory usage of chrome is higher than firefox ,while the security and stability provided is better than firefox.
The security here provided by chrome is better,since each tab is a new process different tab cannot snoop into the memory space of a given process.

Multi-threading is for masochists. :)
If you are concerned about an environment where you are constantly creating threads/forks, perhaps like a web server handling requests, you can pre-fork processes, hundreds if necessary. Since they are Copy on Write and use the same memory until a write occurs, it's very fast. They can all block, listening on the same socket and the first one to accept an incoming TCP connection gets to run with it. With g++ you can also assign functions and variables to be closely placed in memory (hot segments) to ensure when you do write to memory, and cause an entire page to be copied at least subsequent write activity will occur on the same page. You really have to use a profiler to verify that kind of stuff but if you are concerned about performance, you should be doing that anyway.
Development time of threaded apps is 3x to 10x times longer due to the subtle interaction on shared objects, threading "gotchas" you didn't think of, and very hard to debug because you cannot reproduce thread interaction problems at will. You may have to do all sort of performance killing checks like having invariants in all your classes that are checked before and after every function and you halt the process and load the debugger if something isn't right. Most often it's embarrassing crashes that occur during production and you have to pore through a core dump trying to figure out which threads did what. Frankly, it's not worth the headache when forking processes is just as fast and implicitly thread safe unless you explicitly share something. At least with explicit sharing you know exactly where to look if a threading style problem occurs.
If performance is that important, add another computer and load balance. For the developer cost of debugging a multi-threaded app, even one written by an experienced multi-threader, you could probably buy 4 40 core Intel motherboards with 64gigs of memory each.
That being said, there are asymmetric cases where parallel processing isn't appropriate, like, you want a foreground thread to accept user input and show button presses immediately, without waiting for some clunky back end GUI to keep up. Sexy use of threads where multiprocessing isn't geometrically appropriate. Many things like that just variables or pointers. They aren't "handles" that can be shared in a fork. You have to use threads. Even if you did fork, you'd be sharing the same resource and subject to threading style issues.

If you need to share resources, you really should use threads.
Also consider the fact that context switches between threads are much less expensive than context switches between processes.
I see no reason to explicitly go with separate processes unless you have a good reason to do so (security, proven performance tests, etc...)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string