sand boxing threads without separate processes - sandbox

In the interest of ease of programming (local function calls instead of IPC) and performance (e.g. avoiding copies of large buffers), I'd like to have a Java VM call native code using JNI instead of through interprocess communication. There would be lots of worker threads, each doing computer vision on some image and sending back a list of detected features.
I've found a few other posts about this topic:
How to implement a native code sandbox?
Linux: Is it possible to sandbox shared library code
but in all cases, the agreed upon solution is to use multiple processes.
But I would like to explore the feasibility of partly sand boxing threads. Clearly, this goes against common sense, but I think if your client processes aren't malicious and if you can recover from faults, and in the worst case, are willing to tolerate a whole system crash once in a blue moon, it might work.
There are some hints that this is possible such as from jmajnert in #2. You would have to capture segfaults and other crashes, and terminate and restart the crashed thread. But I also want to reset the heap of the thread. That means each thread should have a private heap, but I don't know of any common malloc implementation that lets you create multiple heaps (AIX seems to).
Then I would want to close all files opened by the thread when it gets restarted.
Also, if Java objects get compromised by the native code, would it be practical to provide some fault tolerance like recreating them?

Because if complexity of hopping models between some native code, and the JVM -- The idea itself is really not even feasible.
To be feasible, you'd need to be within a single machine/threading model.
Lets assume you're in posix/ansi c.
You'd need to write a custom allocator that allocated from pools. Each time you launched a thread you'd allocate a new pool and set that pool as a thread local variable that all your custom_malloc() functions would allocate from. This way, when your thread died you could crush all of it's memory along with it.
Next, you'll need to set up some niftyness with setjmp/longjmp and signal to catch all those segfaults etc. exit the thread, crush it's memory and restart.
If you have objects from the "parent process" that you don't want to get corrupted, you'd have to create some custom mutexes that would have rollback functions that could be triggered when a threads signal handler was triggered to destroy the thread.

Related

Why threads implemented in kernel space are slow?

When a thread does something that may cause it to become blocked locally, for example, waiting for another thread in its process to complete some work, it calls a run-time system procedure. This procedure checks to see if the thread must be put into blocked state. If so, it stores the thread's registers in the thread table, looks in the table for a ready thread to run, and reloads the machine registers with the new thread's saved values. As soon as the stack pointer and program counter have been switched, the new thread comes to life again automatically. If the machine happens to have an instruction to store all the registers and another one to load them all, the entire thread switch can be done in just a handful of instructions. Doing thread switching like this is at least an order of magnitude-maybe more-faster than trapping to the kernel and is a strong argument in favor of user-level threads packages.
Source: Modern Operating Systems (Andrew S. Tanenbaum | Herbert Bos)
The above argument is made in favor of user-level threads. The user-level thread implementation is depicted as kernel managing all the processes, where individual processes can have their own run-time (made available by a library package) that manages all the threads in that process.
Of course, merely calling a function in the run-time than trapping to kernel might have a few less instructions to execute but why the difference is so huge?
For example, if threads are implemented in kernel space, every time a thread has to be created the program is required to make a system call. Yes. But the call only involves adding an entry to the thread table with certain attributes (which is also the case in user space threads). When a thread switch has to happen, kernel can simply do what the run-time (at user-space) would do. The only real difference I can see here is that the kernel is being involved in all this. How can the performance difference be so significant?
Threads implemented as a library package in user space perform significantly better. Why?
They're not.
The fact is that most task switches are caused by threads blocking (having to wait for IO from disk or network, or from user, or for time to pass, or for some kind of semaphore/mutex shared with a different process, or some kind of pipe/message/packet from a different process) or caused by threads unblocking (because whatever they were waiting for happened); and most reasons to block and unblock involve the kernel in some way (e.g. device drivers, networking stack, ...); so doing task switches in kernel when you're already in the kernel is faster (because it avoids the overhead of switching to user-space and back for no sane reason).
Where user-space task switching "works" is when kernel isn't involved at all. This mostly only happens when someone failed to do threads properly (e.g. they've got thousands of threads and coarse-grained locking and are constantly switching between threads due to lock contention, instead of something sensible like a "worker thread pool"). It also only works when all threads are the same priority - you don't want a situation where very important threads belonging to one process don't get CPU time because very unimportant threads belonging to a different process are hogging the CPU (but that's exactly what happens with user-space threading because one process has no idea about threads belonging to a different process).
Mostly; user-space threading is a silly broken mess. It's not faster or "significantly better"; it's worse.
When a thread does something that may cause it to become blocked locally, for example, waiting for another thread in its process to complete some work, it calls a run-time system procedure. This procedure checks to see if the thread must be put into blocked state. If so, it stores the thread's registers in the thread table, looks in the table for a ready thread to run, and reloads the machine registers with the new thread's saved values. As soon as the stack pointer and program counter have been switched, the new thread comes to life again automatically. If the machine happens to have an instruction to store all the registers and another one to load them all, the entire thread switch can be done in just a handful of instructions. Doing thread switching like this is at least an order of magnitude-maybe more-faster than trapping to the kernel and is a strong argument in favor of user-level threads packages.
This is talking about a situation where the CPU itself does the actual task switch (and either the kernel or a user-space library tells the CPU when to do a task switch to what). This has some relatively interesting history behind it...
In the 1980s Intel designed a CPU ("iAPX" - see https://en.wikipedia.org/wiki/Intel_iAPX_432 ) for "secure object oriented programming"; where each object has its own isolated memory segments and its own privilege level, and can transfer control directly to other objects. The general idea being that you'd have a single-tasking system consisting of global objects using cooperating flow control. This failed for multiple reasons, partly because all the protection checks ruined performance, and partly because the majority of software at the time was designed for "multi-process preemptive time sharing, with procedural programming".
When Intel designed protected mode (80286, 80386) they still had hopes for "single-tasking system consisting of global objects using cooperating flow control". They included hardware task/object switching, local descriptor table (so each task/object can have its own isolated segments), call gates (so tasks/objects can transfer control to each other directly), and modified a few control flow instructions (call far and jmp far) to support the new control flow. Of course this failed for the same reason iAPX failed; and (as far as I know) nobody has ever used these things for the "global objects using cooperative flow control" they were originally designed for. Some people (e.g. very early Linux) did try to use the hardware task switching for more traditional "multi-process preemptive time sharing, with procedural programming" systems; but found that it was slow because the hardware task switch did too many protection checks that could be avoided by software task switching and saved/reloaded too much state that could be avoided by a software task switching;p and didn't do any of the other stuff needed for a task switch (e.g. keeping statistics of CPU time used, saving/restoring debug registers, etc).
Now.. Andrew S. Tanenbaum is a micro-kernel advocate. His ideal system consists of isolated pieces in user-space (processes, services, drivers, ...) communicating via. synchronous messaging. In practice (ignoring superficial differences in terminology) this "isolated pieces in user-space communicating via. synchronous messaging" is almost entirely identical to Intel's twice failed "global objects using cooperative flow control".
Mostly; in theory (if you ignore all the practical problems, like CPU not saving all of the state, and wanting to do extra work on task switches like tracking statistics), for a specific type of OS that Andrew S. Tanenbaum prefers (micro-kernel with synchronous message passing, without any thread priorities), it's plausible that the paragraph quoted above is more than just wishful thinking.
I think the answer to this can use a lot of OS and parallel distributive computing knowledge (And I am not sure about the answer but I will try my best)
So if you think about it. The library package will have a greater amount of performance than you write in the kernel itself. In the package thing, interrupt given by this code will be held at once and al the execution will be done. While when you write in kernel different other interrupts can come before. Plus accessing threads again and again is harsh on the kernel since everytime there will be an interrupt. I hope it will be a better view.
it's not correct to say the user-space threads are better that the kernel-space threads since each one has its own pros and cons.
in terms of user-space threads, as the application is responsible for managing thread, its easier to implement such threads and that kind of threads have not much reliance on OS. however, you are not able to use the advantages of multi processing.
In contrary, the kernel space modules are handled by OS, so you need to implement them according to the OS that you use, and it would be a more complicated task. However, you have more control over your threads.
for more comprehensive tutorial, take a look here.

Why do we need semaphores on single cpu?

I have read that we use semaphores inside the linux kerenl,and i have read that semaphores has advantages even in one single cpu (we can run only one process\thread). Can anyone please give me an example of a problem that semaphore solves(inside the kernel)?
In my view, there can be a problem only if we have more than one cpu, because two process may call system calls that use the same data structure, and probablly cause problems.
Thank you for your help!
You don't really need more than one CPU for concurrency. The multiple CPUs are really "an implementation detail," a piece of hardware quirkiness that you can abstract away from. Concurrency is a logical property of programs. You can have concurrency without multiple CPUs, and use multiple CPUs without "real concurrency".
Consider a web server. It has to be "concurrent," in the sense that it must serve multiple clients at once, hold information about multiple connections and once, and process multiple requests at once. You can have it literally do this, by having multiple CPUs all working at the same time. Yet, the program only has to appear to do multiple things at once. It could just as well be running on one CPU and context switching to fairly service all the work put to it. The fact that a web-server does multiple things at once is part of its interface: the I/O for the connections are interleaved, if a request has exclusively locked a resource, another request won't start trying to manipulate that same resource, etc. Writing a web server without concurrency produces a program that is wrong.
Semaphores help you with concurrency, by letting you control the way processes access resources. You asked, if you had one process running, how another could run at the same time with only a single core. Well, as I said, concurrency doesn't need multiple cores. The first process can be paused, and the second one started while the first one is still unfinished. This is just an implementation detail; logically, to the program writer, the two processes are running simultaneously, whether there are multiple cores or not. If the program was written without semaphores (or had broken concurrency in some other way), it would be wrong, even on a single core. Physically, this will be because context switching can abruptly pause one computation and start another at any time, and, without semaphores, the newly live thread won't know what resources it can and cannot access. Logically, this will be because the processes are running simultaneously, once you abstract yourself away from the implementation, and, in general, processes running simultaneously can walk over each other if not properly synchronized.
For an example applicable to an OS kernel, consider that every process is logically running concurrently with every other process. A kernel provides the implementation that makes this concurrency work. A resource that two processes may want simultaneously is a hard drive. A semaphore might be used in the kernel to track whether a given drive is currently busy with a read or write. A process trying to read or write to the same disk will ask the kernel to do so, and the kernel can check the semaphore to see that the disk is still busy and force the offending process to wait. Now, an operating system does count as low level code, so in some places, yes, you might want to omit some otherwise vital concurrency safeguards when running on a single CPU, because your job is to handle such implementation details, but higher level parts may still use them.
In contrast, consider a number-crunching program. Let's say it's processing each element of a huge array of data into an equal-sized array of modified data (a functional map operation). It can use multiple CPUs to do this more quickly, but it can also work one CPU. The observable behavior of the program is the same, and you never get any idea that it's doing multiple things at once from its behavior. Numbers go in, numbers come out, who cares what happens in the middle? Writing such a program without the ability to do multiple things at once does not produce a logically incorrect program, just a slow one. Such a program probably does not need semaphores when running on a single CPU, because it didn't need concurrency in the first place.

Necessity of gracefully ending a thread

If I am building a multithreaded application, all its threads would automatically get killed when I abort the application.
If I want a thread to have a lifetime equal to that of the main thread, do I really need to gracefully end the thread, or let the application abort take care of killing it?
Edit: As threading rules depend on the OS, I'd like to hear opinions for the following too:
Android
Linux
iOS
It depends on what the thread is doing.
When a thread is killed, it's execution stops at any point in the code, meaning some operations may not be finished, like
writing a file
sending network messages
But the OS will
close all handles the application owns
release any locks
free all memory
close any open file
etc...
So, as long as you can make sure that all your files etc. are in a consistent state, you don't have to worry about the system resources.
I know this is true for Windows, and I would be very surprised if it was different on other OSes. The time when a application that didn't release all resources could affect the entire system is long gone, fortunately.
No. With most non-trivial OS, you do not need to explicitly/gracefully terminate app-lifetime threads unless there is a specific and overriding need to do so.
Just one reason is that you cannot always actually do it with user code. User-level code cannot stop a thread that is running on another core than the thread requesting the stop. The OS can, and does.
Your linux/Windows OS is very good indeed at stopping threads in any state on an core and releasing resources like thread stacks, heaps, OS object handles/fd's etc. at process-termination. It's had millions of hours of testing on systems world-wide, something that your own user code is very unlikely to ever experience. If you can do so, you should let the OS do what it's good at.
In other posts, several cases have been made where user-level termination of a thread may be unavoidable. Inter-process comms is one area, as are DB connections/transactions. If you are forced into it by your requirements, then fine, go for it but, otherwise, don't try - it's a waste of time and effort writing/testing/debugging thread-stop code to do what the OS can do effectively on its own.
Beware of premature stoptimization.

"Multi-process" vs. "single-process multi-threading" for software modules communicating via messaging

We need to build a software framework (or middleware) that will enable messaging between different software components (or modules) running on a single machine. This framework will provide such features:
Communication between modules are through 'messaging'.
Each module will have its own message queue and message handler thread that will synchronously handle each incoming message.
With the above requirements, which of the following approach is the correct one (with its reasoning)?:
Implementing modules as processes, and messaging through shared memory
Implementing modules as threads in a single process, and messaging by pushing message objects to the destination module's message queue.
Of source, there are some apparent cons & pros:
In Option-2, if one module causes segmentation fault, the process (thus the whole application) will crash. And one module can access/mutate another module's memory directly, which can lead to difficult-to-debug runtime errors.
But with Option-1, you need to take care of the states where a module you need to communicate has just crashed. If there are N modules in the software, there can be 2^N many alive/crashed states of the system that affects the algorithms running on the modules.
Again in Option-1, sender cannot assume that the receiver has received the message, because it might have crashed at that moment. (But the system can alert all the modules that a particular module has crashed; that way, sender can conclude that the receiver will not be able to handle the message, even though it has successfully received it)
I am in favor of Option-2, but I am not sure whether my arguments are solid enough or not. What are your opinions?
EDIT: Upon requests for clarification, here are more specification details:
This is an embedded application that is going to run on Linux OS.
Unfortunately, I cannot tell you about the project itself, but I can say that there are multiple components of the project, each component will be developed by its own team (of 3-4 people), and it is decided that the communication between these components/modules are through some kind of messaging framework.
C/C++ will be used as programming language.
What the 'Module Interface API' will automatically provide to the developers of a module are: (1) An message/event handler thread loop, (2) a synchronous message queue, (3) a function pointer member variable where you can set your message handler function.
Here is what I could come up with:
Multi-process(1) vs. Single-process, multi-threaded(2):
Impact of segmentation faults: In (2), if one module causes segmentation fault, the whole application crashes. In (1), modules have different memory regions and thus only the module that cause segmentation fault will crash.
Message delivery guarantee: In (2), you can assume that message delivery is guaranteed. In (1) the receiving module may crash before the receival or during handling of the message.
Sharing memory between modules: In (2), the whole memory is shared by all modules, so you can directly send message objects. In (1), you need to use 'Shared Memory' between modules.
Messaging implementation: In (2), you can send message objects between modules, in (1) you need to use either of network socket, unix socket, pipes, or message objects stored in a Shared Memory. For the sake of efficiency, storing message objects in a Shared Memory seems to be the best choice.
Pointer usage between modules: In (2), you can use pointers in your message objects. The ownership of heap objects (accessed by pointers in the messages) can be transferred to the receiving module. In (1), you need to manually manage the memory (with custom malloc/free functions) in the 'Shared Memory' region.
Module management: In (2), you are managing just one process. In (1), you need to manage a pool of processes each representing one module.
Sounds like you're implementing Communicating Sequential Processes. Excellent!
Tackling threads vs processes first, I would stick to threads; the context switch times are faster (especially on Windows where process context switches are quite slow).
Second, shared memory vs a message queue; if you're doing full synchronous message passing it'll make no difference to performance. The shared memory approach involves a shared buffer that gets copied to by the sender and copied from by the reader. That's the same amount of work as is required for a message queue. So for simplicity's sake I would stick with the message queue.
in fact you might like to consider using a pipe instead of a message queue. You have to write code to make the pipe synchronous (they're normally asynchronous, which would be Actor Model; message queues can often be set to zero length which does what you want for it to be synchronous and properly CSP), but then you could just as easily use a socket instead. Your program can then become multi-machine distributed should the need arise, but you've not had to change the architecture at all. Also named pipes between processes is an equivalent option, so on platforms where process context switch times are good (e.g. linux) the whole thread vs process question goes away. So working a bit harder to use a pipe gives you very significant scalability options.
Regarding crashing; if you go the multiprocess route and you want to be able to gracefully handle the failure of a process you're going to have to do a bit of work. Essentially you will need a thread at each end of the messaging channel simply to monitor the responsiveness of the other end (perhaps by bouncing a keep-awake message back and forth between themselves). These threads need to feed status info into their corresponding main thread to tell it when the other end has failed to send a keep-awake on schedule. The main thread can then act accordingly. When I did this I had the monitor thread automatically reconnect as and when it could (e.g. the remote process has come back to life), and tell the main thread that too. This means that bits of my system can come and go and the rest of it just copes nicely.
Finally, your actual application processes will end up as a loop, with something like select() at the top to wait for message inputs from all the different channels (and monitor threads) that it is expecting to hear from.
By the way, this sort of thing is frustratingly hard to implement in Windows. There's just no proper equivalent of select() anywhere in any Microsoft language. There is a select() for sockets, but you can't use it on pipes, etc. like you can in Unix. The Cygwin guys had real problems implementing their version of select(). I think they ended up with a polling thread per file descriptor; massively inefficient.
Good luck!
Your question lacks a description of how the "modules" are implemented and what do they do, and possibly a description of the environment in which you are planning to implement all of this.
For example:
If the modules themselves have some requirements which makes them hard to implement as threads (e.g. they use non-thread-safe 3rd party libraries, have global variables, etc.), your message delivery system will also not be implementable with threads.
If you are using an environment such as Python which does not handle thread parallelism very well (because of its global interpreter lock), and running on Linux, you will not gain any performance benefits with threads over processes.
There are more things to consider. If you are just passing data between modules, who says your system needs to use either multiple threads or multiple processes? There are other architectures which do the same thing without either of them, such as event-driven with callbacks (a message receiver can register a callback with your system, which is invoked when a message generator generates a message). This approach will be absolutely the fastest in any case where parallelism isn't important and where receiving code can be invoked in the execution context of the caller.
tl;dr: you have only scratched the surface with your question :)

Threads vs Processes in Linux [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
The community reviewed whether to reopen this question last year and left it closed:
Original close reason(s) were not resolved
Improve this question
I've recently heard a few people say that in Linux, it is almost always better to use processes instead of threads, since Linux is very efficient in handling processes, and because there are so many problems (such as locking) associated with threads. However, I am suspicious, because it seems like threads could give a pretty big performance gain in some situations.
So my question is, when faced with a situation that threads and processes could both handle pretty well, should I use processes or threads? For example, if I were writing a web server, should I use processes or threads (or a combination)?
Linux uses a 1-1 threading model, with (to the kernel) no distinction between processes and threads -- everything is simply a runnable task. *
On Linux, the system call clone clones a task, with a configurable level of sharing, among which are:
CLONE_FILES: share the same file descriptor table (instead of creating a copy)
CLONE_PARENT: don't set up a parent-child relationship between the new task and the old (otherwise, child's getppid() = parent's getpid())
CLONE_VM: share the same memory space (instead of creating a COW copy)
fork() calls clone(least sharing) and pthread_create() calls clone(most sharing). **
forking costs a tiny bit more than pthread_createing because of copying tables and creating COW mappings for memory, but the Linux kernel developers have tried (and succeeded) at minimizing those costs.
Switching between tasks, if they share the same memory space and various tables, will be a tiny bit cheaper than if they aren't shared, because the data may already be loaded in cache. However, switching tasks is still very fast even if nothing is shared -- this is something else that Linux kernel developers try to ensure (and succeed at ensuring).
In fact, if you are on a multi-processor system, not sharing may actually be beneficial to performance: if each task is running on a different processor, synchronizing shared memory is expensive.
* Simplified. CLONE_THREAD causes signals delivery to be shared (which needs CLONE_SIGHAND, which shares the signal handler table).
** Simplified. There exist both SYS_fork and SYS_clone syscalls, but in the kernel, the sys_fork and sys_clone are both very thin wrappers around the same do_fork function, which itself is a thin wrapper around copy_process. Yes, the terms process, thread, and task are used rather interchangeably in the Linux kernel...
Linux (and indeed Unix) gives you a third option.
Option 1 - processes
Create a standalone executable which handles some part (or all parts) of your application, and invoke it separately for each process, e.g. the program runs copies of itself to delegate tasks to.
Option 2 - threads
Create a standalone executable which starts up with a single thread and create additional threads to do some tasks
Option 3 - fork
Only available under Linux/Unix, this is a bit different. A forked process really is its own process with its own address space - there is nothing that the child can do (normally) to affect its parent's or siblings address space (unlike a thread) - so you get added robustness.
However, the memory pages are not copied, they are copy-on-write, so less memory is usually used than you might imagine.
Consider a web server program which consists of two steps:
Read configuration and runtime data
Serve page requests
If you used threads, step 1 would be done once, and step 2 done in multiple threads. If you used "traditional" processes, steps 1 and 2 would need to be repeated for each process, and the memory to store the configuration and runtime data duplicated. If you used fork(), then you can do step 1 once, and then fork(), leaving the runtime data and configuration in memory, untouched, not copied.
So there are really three choices.
That depends on a lot of factors. Processes are more heavy-weight than threads, and have a higher startup and shutdown cost. Interprocess communication (IPC) is also harder and slower than interthread communication.
Conversely, processes are safer and more secure than threads, because each process runs in its own virtual address space. If one process crashes or has a buffer overrun, it does not affect any other process at all, whereas if a thread crashes, it takes down all of the other threads in the process, and if a thread has a buffer overrun, it opens up a security hole in all of the threads.
So, if your application's modules can run mostly independently with little communication, you should probably use processes if you can afford the startup and shutdown costs. The performance hit of IPC will be minimal, and you'll be slightly safer against bugs and security holes. If you need every bit of performance you can get or have a lot of shared data (such as complex data structures), go with threads.
Others have discussed the considerations.
Perhaps the important difference is that in Windows processes are heavy and expensive compared to threads, and in Linux the difference is much smaller, so the equation balances at a different point.
Once upon a time there was Unix and in this good old Unix there was lots of overhead for processes, so what some clever people did was to create threads, which would share the same address space with the parent process and they only needed a reduced context switch, which would make the context switch more efficient.
In a contemporary Linux (2.6.x) there is not much difference in performance between a context switch of a process compared to a thread (only the MMU stuff is additional for the thread).
There is the issue with the shared address space, which means that a faulty pointer in a thread can corrupt memory of the parent process or another thread within the same address space.
A process is protected by the MMU, so a faulty pointer will just cause a signal 11 and no corruption.
I would in general use processes (not much context switch overhead in Linux, but memory protection due to MMU), but pthreads if I would need a real-time scheduler class, which is a different cup of tea all together.
Why do you think threads are have such a big performance gain on Linux? Do you have any data for this, or is it just a myth?
I think everyone has done a great job responding to your question. I'm just adding more information about thread versus process in Linux to clarify and summarize some of the previous responses in context of kernel. So, my response is in regarding to kernel specific code in Linux. According to Linux Kernel documentation, there is no clear distinction between thread versus process except thread uses shared virtual address space unlike process. Also note, the Linux Kernel uses the term "task" to refer to process and thread in general.
"There are no internal structures implementing processes or threads, instead there is a struct task_struct that describe an abstract scheduling unit called task"
Also according to Linus Torvalds, you should NOT think about process versus thread at all and because it's too limiting and the only difference is COE or Context of Execution in terms of "separate the address space from the parent " or shared address space. In fact he uses a web server example to make his point here (which highly recommend reading).
Full credit to linux kernel documentation
If you want to create a pure a process as possible, you would use clone() and set all the clone flags. (Or save yourself the typing effort and call fork())
If you want to create a pure a thread as possible, you would use clone() and clear all the clone flags (Or save yourself the typing effort and call pthread_create())
There are 28 flags that dictate the level of resource sharing. This means that there are over 268 million flavours of tasks that you can create, depending on what you want to share.
This is what we mean when we say that Linux does not distinguish between a process and a thread, but rather alludes to any flow of control within a program as a task. The rationale for not distinguishing between the two is, well, not uniquely defining over 268 million flavours!
Therefore, making the "perfect decision" of whether to use a process or thread is really about deciding which of the 28 resources to clone.
How tightly coupled are your tasks?
If they can live independently of each other, then use processes. If they rely on each other, then use threads. That way you can kill and restart a bad process without interfering with the operation of the other tasks.
To complicate matters further, there is such a thing as thread-local storage, and Unix shared memory.
Thread-local storage allows each thread to have a separate instance of global objects. The only time I've used it was when constructing an emulation environment on linux/windows, for application code that ran in an RTOS. In the RTOS each task was a process with it's own address space, in the emulation environment, each task was a thread (with a shared address space). By using TLS for things like singletons, we were able to have a separate instance for each thread, just like under the 'real' RTOS environment.
Shared memory can (obviously) give you the performance benefits of having multiple processes access the same memory, but at the cost/risk of having to synchronize the processes properly. One way to do that is have one process create a data structure in shared memory, and then send a handle to that structure via traditional inter-process communication (like a named pipe).
In my recent work with LINUX is one thing to be aware of is libraries. If you are using threads make sure any libraries you may use across threads are thread-safe. This burned me a couple of times. Notably libxml2 is not thread-safe out of the box. It can be compiled with thread safe but that is not what you get with aptitude install.
I'd have to agree with what you've been hearing. When we benchmark our cluster (xhpl and such), we always get significantly better performance with processes over threads. </anecdote>
The decision between thread/process depends a little bit on what you will be using it to.
One of the benefits with a process is that it has a PID and can be killed without also terminating the parent.
For a real world example of a web server, apache 1.3 used to only support multiple processes, but in in 2.0 they added an abstraction so that you can swtch between either. Comments seems to agree that processes are more robust but threads can give a little bit better performance (except for windows where performance for processes sucks and you only want to use threads).
For most cases i would prefer processes over threads.
threads can be useful when you have a relatively smaller task (process overhead >> time taken by each divided task unit) and there is a need of memory sharing between them. Think a large array.
Also (offtopic), note that if your CPU utilization is 100 percent or close to it, there is going to be no benefit out of multithreading or processing. (in fact it will worsen)
Threads -- > Threads shares a memory space,it is an abstraction of the CPU,it is lightweight.
Processes --> Processes have their own memory space,it is an abstraction of a computer.
To parallelise task you need to abstract a CPU.
However the advantages of using a process over a thread is security,stability while a thread uses lesser memory than process and offers lesser latency.
An example in terms of web would be chrome and firefox.
In case of Chrome each tab is a new process hence memory usage of chrome is higher than firefox ,while the security and stability provided is better than firefox.
The security here provided by chrome is better,since each tab is a new process different tab cannot snoop into the memory space of a given process.
Multi-threading is for masochists. :)
If you are concerned about an environment where you are constantly creating threads/forks, perhaps like a web server handling requests, you can pre-fork processes, hundreds if necessary. Since they are Copy on Write and use the same memory until a write occurs, it's very fast. They can all block, listening on the same socket and the first one to accept an incoming TCP connection gets to run with it. With g++ you can also assign functions and variables to be closely placed in memory (hot segments) to ensure when you do write to memory, and cause an entire page to be copied at least subsequent write activity will occur on the same page. You really have to use a profiler to verify that kind of stuff but if you are concerned about performance, you should be doing that anyway.
Development time of threaded apps is 3x to 10x times longer due to the subtle interaction on shared objects, threading "gotchas" you didn't think of, and very hard to debug because you cannot reproduce thread interaction problems at will. You may have to do all sort of performance killing checks like having invariants in all your classes that are checked before and after every function and you halt the process and load the debugger if something isn't right. Most often it's embarrassing crashes that occur during production and you have to pore through a core dump trying to figure out which threads did what. Frankly, it's not worth the headache when forking processes is just as fast and implicitly thread safe unless you explicitly share something. At least with explicit sharing you know exactly where to look if a threading style problem occurs.
If performance is that important, add another computer and load balance. For the developer cost of debugging a multi-threaded app, even one written by an experienced multi-threader, you could probably buy 4 40 core Intel motherboards with 64gigs of memory each.
That being said, there are asymmetric cases where parallel processing isn't appropriate, like, you want a foreground thread to accept user input and show button presses immediately, without waiting for some clunky back end GUI to keep up. Sexy use of threads where multiprocessing isn't geometrically appropriate. Many things like that just variables or pointers. They aren't "handles" that can be shared in a fork. You have to use threads. Even if you did fork, you'd be sharing the same resource and subject to threading style issues.
If you need to share resources, you really should use threads.
Also consider the fact that context switches between threads are much less expensive than context switches between processes.
I see no reason to explicitly go with separate processes unless you have a good reason to do so (security, proven performance tests, etc...)

Resources