How can I use threads / multi threading on mod_perl

How can I use threads / multi threading on mod_perl - multithreading

The code below works fine on mod_cgi but not on mod_perl. The code just crashes upon creation of the first thread. Any thoughts on how to implement threads on mod_perl? Is it really possible? What am I missing?
use strict;
use warnings;
use threads;
print "Content-type: text/html\n\n";
sub testthread{
my $value = shift;
print "<br>test - $value"
}
my #threads = ();
push(#threads, threads->new(\&testthread, 1));
push(#threads, threads->new(\&testthread, 2));
foreach (#threads) {
$_->join;
}
exit;
There is no error message given. My web page / script just stops and gives this error to the browser:
The www.abcd.com page isn’t working
www.abcd.com didn’t send any data.
What's weird is if I execute the script via command line it works fine.

so threads really do not work with mod_perl?
I have not investigated whether threads work or do not work with mod_perl2. Keep in mind that the point of mod_perl is to run a set of persistent Perl interpreters within the Apache process.
Also keep in mind that Apache can kill CGI scripts that are taking too long to respond. Now, add to that mix a set of threads which your CGI script that can die at any time is managing within some persistent Perl process within Apache, and I hope it is easy to see why people's first reaction is "Great idea!" This would never occur to me as a solution to any problem I have encountered in CGI programming.
What you may or may not be able to do may depend greatly on the MPM you choose and how httpd and perl you are using were compiled etc.
You need to explain why you think you need threads. What is the problem you are trying to solve, and how will using threads within a CGI script run under mod_perl will help you solve it?
See mod_perl documentation:
Threads Support
In order to adapt to the Apache 2.0 threads architecture (for threaded MPMs), mod_perl 2.0 needs to use thread-safe Perl interpreters, also known as "ithreads" (Interpreter Threads). This mechanism can be enabled at compile time and ensures that each Perl interpreter uses its private PerlInterpreter structure for storing its symbol tables, stacks and other Perl runtime mechanisms. When this separation is engaged any number of threads in the same process can safely perform concurrent callbacks into Perl. This of course requires each thread to have its own PerlInterpreter object, or at least that each instance is only accessed by one thread at any given time.
The first mod_perl generation has only a single PerlInterpreter, which is constructed by the parent process, then inherited across the forks to child processes. mod_perl 2.0 has a configurable number of PerlInterpreters and two classes of interpreters, parent and clone. A parent is like that in mod_perl 1.0, where the main interpreter created at startup time compiles any pre-loaded Perl code. A clone is created from the parent using the Perl API perl_clone() function. At request time, parent interpreters are only used for making more clones, as the clones are the interpreters which actually handle requests. Care is taken by Perl to copy only mutable data, which means that no runtime locking is required and read-only data such as the syntax tree is shared from the parent, which should reduce the overall mod_perl memory footprint.
Rather than create a PerlInterpreter per-thread by default, mod_perl creates a pool of interpreters. The pool mechanism helps cut down memory usage a great deal. As already mentioned, the syntax tree is shared between all cloned interpreters. If your server is serving more than mod_perl requests, having a smaller number of PerlInterpreters than the number of threads will clearly cut down on memory usage. Finally and perhaps the biggest win is memory re-use: as calls are made into Perl subroutines, memory allocations are made for variables when they are used for the first time. Subsequent use of variables may allocate more memory, e.g. if a scalar variable needs to hold a longer string than it did before, or an array has new elements added. As an optimization, Perl hangs onto these allocations, even though their values "go out of scope". mod_perl 2.0 has a much better control over which PerlInterpreters are used for incoming requests. The interpreters are stored in two linked lists, one for available interpreters and another for busy ones. When needed to handle a request, one interpreter is taken from the head of the available list and put back into the head of the same list when done. This means if for example you have 10 interpreters configured to be cloned at startup time, but no more than 5 are ever used concurrently, those 5 continue to reuse Perl's allocations, while the other 5 remain much smaller, but ready to go if the need arises.
Various attributes of the pools are configurable using threads mode specific directives.
The interpreters pool mechanism has been abstracted into an API known as "tipool", Thread Item Pool. This pool can be used to manage any data structure, in which you wish to have a smaller number than the number of configured threads. For example a replacement for Apache::DBI based on the tipool will allow to reuse database connections between multiple threads of the same process.
Thread-environment Issues
While mod_perl itself is thread-safe, you may have issues with the thread-safety of your code. For more information refer to Threads Coding Issues Under mod_perl.
Another issue is that "global" variables are only global to the interpreter in which they are created. It's possible to share variables between several threads running in the same process. For more information see: Shared Variables.

We run mod_perl with the Apache event Multi-Processing Module (MPM) and APACHE::DBI with MySQL 5.7 to handle high surges of users logging in and starting quizzes/exams.
We have Apache 2.4 configured to start with 50 children and each child runs 100 threads for a total of 5,000 threads available for requests. Because mod_perl also runs multi threaded and is bundled into Apache (upon startup), memory consumption is minimal when compared to other languages.
We set Apache to always have 3,500 threads available and to max out at 200 children (20,000 threads). Children recycle after handling 10,000 requests (in case of any memory leakage). When Apache recycles its logs every night, Apache gets restarted along with mod_perl and all database connections and starts with the 5,000 threads available again.
With mod_perl we start with 50 children to match the apache children and it also recycles it's children.
In Virtual Server directive we have:
PerlInterpStart 50
PerlInterpMaxSpares 50
PerlInterpMax 100
Real simplicity for very high scalability and stability and low cost hardware. And it runs well on windows too. If your stuff doesn't run multi threaded it's limited to each request requiring a child (a 1:1 ratio) instead of a thread (a 1:100 ratio). Each child requires an instance of your program to run (other than Java?) so memory consumption is very high versus using mod_perl. We know people say it's old, but it runs so well and stable and is supported on probably more operating systems than any other language and it just works.

Related

GC not able to collect back memory using fork-emulation on Windows

Let me begin by saying I do not have in depth knowledge of Perl so please pardon me if there is something obvious that I have missed :)
In the system (running in Windows environment) that I am looking at, we have a perl process which has to download ~5000-6000 files. Since each file can be independently downloaded, we forked separate threads for each file. The thread is supposed to download the file and die. On running the process, I noticed that the memory of the process goes up to ~1.7 GB and then dies due to the memory limit of each process.
On searching and asking a few people, I came across this concept of circular referencing due to which the garbage collector will not free up memory. I searched a bit and found the Devel-Cycle package which can find out if there are any cycles in the object. I got this package and added a line to check if the main object in the process has any cycles. find_cycle came back with the following statement for each thread.
DBD::Oracle::db FIRSTKEY failed: handle 2 is owned by thread 256004 not current thread c0ea29c (handles can't be shared between threads and your driver may need a CLONE method added) at C:/Program Files/Perl/site/lib/Devel/Cycle.pm line 151.
I got to know that DB handles cannot be shared between threads. I looked at the code again and realised that after the fork happens, the child process does actually create a new DB handle (which I guess is why the process still continues to run fine till it reaches the memory limit). I guess there might be more db handles from the parent in the object that are not used by the child but are still referenced.
Questons that I have -
Is the circular reference the only reason for the problem or could there be other issues causing the process to use so much memory?
Could the sharing of the handle cause the blow up in memory (in other words is the shared DB handle causing the GC to not free up space)?
If it is indeed the shared DB handle, I guess I can just say $dbHandle = 0 to get rid of the reference (if $dbHabndle is referencing that particular handle). Am I correct here?
I am trying to go through the code to see where else there is a reference to the parent DB handle (and found at least one more reference). Is there any other way I can do this? Is there a method to print out all the properties of an object?
EDIT:
Not all threads (due to the perl fork call in windows) are spawned at the same time. It spawns a max of n number of threads (where n is a configurable number). Once a thread has finished its execution, the process spawns another thread. At this moment n is set to 10, however I had changed n to 1 (so only one extra thread is running at one time), and I still hit the memory limit.

edit: Turns out, this does not solve the Ops problem. Still might be helpful for a future reader.
We do not really know a lot about your situation and your program sounds quite complex to just fork it 6000 times to me. But i will still attempt to answer, please correct me if my assumptions are wrong.
It appears you are on Windows. It is important to note, that Windows has no fork() system call. And as you specifically note that you "fork", i just assume that you actually use that Perl command. On windows, this will try to emulate fork() as best as it can but what that basically means is, that all the forked processes you see, are in fact just threads within the original process, just pretending to be processes to you. To do this, they copy the complete interpreter state. See http://perldoc.perl.org/perlfork.html for more information. Especially the following part seems to apply to you:
Resource limits
In the eyes of the operating system, pseudo-processes created via the fork() emulation are simply threads in the same process. This means that any process-level limits imposed by the operating system apply to all pseudo-processes taken together. This includes any limits imposed by the operating system on the number of open file, directory and socket handles, limits on disk space usage, limits on memory size, limits on CPU utilization etc.
If you fork so many pseudo processes, you need a lot of memory as you also have to copy the interpreter state as often. And depending on the complexity of your program and how it is structured, that may very well be a non-trivial amount of memory.
And as http://msdn.microsoft.com/en-us/library/windows/desktop/aa366778%28v=vs.85%29.aspx tells us, the 1.7GB you mentioned, is not far away from the 2GB that some Windows versions impose on you as memory limit for a single process.
My wild guess would be, that you in fact just hit that limit by spawning all those many many threads, each with its own copy of the interpreter state and everything.
You will probably be off a lot better using some threading library instead of asking Perl to emulate individual processes for you. Needless to mention (i hope) that you do not really gain any advantage by having 6000 threads over lets say 16. If you try to have all of them do something at the same time, you will in fact most likely experience slowdowns, depending on how the threading is handled.

In addition to the comments already provided, I want to emphasize the point DeVadder made regarding the behavior of fork in Windows and that Perl threading is likely a better solution but are you sure that the DBD module is safe to be used by multiple processes / forks / threads, etc without setting some extra parameters?
I had a similar error when using the DBI module to access a SQLite DB in multi-processed code using the threads module. It was solved by setting the 'use_immediate_transaction' option for the database handle provided by DBI to 1. If you aren't familiar with how Perl threads work, they aren't threads, they create a copy of the interpreter and everything you have in memory at the time of their creation, but even if I made the database handle separately in each "thread" I would get 'database locked' and various other errors. Without some of these extra options DBD may not function correctly in a multiprocessed environment.
Also, why make 6000 forks, use thread::queue and the threads module, make a worker pool of a few workers (one per core?) and recycle the workers. You are doing alot of overhead every fork for no gain.

Is it bad practice to just kick off new threads for blocking operations (Perl)

If doing CPU intensive tasks I believe it is optimal to have one thread per core. If you have a 4 core CPU you can run 4 instances of a CPU intensive subroutine without any penalty. For example I once experimentally ran four instances of a CPU intensive algorithm on a four core CPU. Up to four times the time per process did not decrease. At the fifth instances all instances took longer.
What is the case for blocking operations? Let's say I have a list of 1,000 URLs. I have been doing the following:
(Please don't mind any syntax errors, I just mocked this up)
my #threads;
foreach my $url (#urlList) {
push #threads, async {
my $response = $ua->get($url);
return $response->content;
}
}
foreach my $thread (#threads) {
my $response = $thread->join;
do_stuff($response);
}
I am essentially kicking off as many threads as there are URLs in the URL list. If there are a million URLs then a million threads will be kicked off. Is this optimal, if not what is an optimal number of threads? Is using threads a good practice for ANY blocking I/O operation that can wait (reading a file, database queries, etc)?
Related Bonus Question
Out of curiosity does Perl threads work the same as Python and it's GIL? With python to get the benefit of multithreading and utilize all cores for CPU intensive tasks you have to use multiprocessing.

Out of curiosity does Perl threads work the same as Python and it's GIL? With python to get the benefit of multithreading and utilize all cores for CPU intensive tasks you have to use multiprocessing.
No, but the conclusion is the same. Perl doesn't have a big lock protecting the interpreter across threads; instead it has a duplicate interpreter for each different thread. Since a variable belongs to an interpreter (and only one interpreter), no data is shared by default between threads. When variables are explicitly shared they're placed in a shared interpreter which serializes all accesses to shared variables on behalf of the other threads. In addition to the memory issues mentioned by others here, there are also some serious performance issues with threads in Perl, as well as limitations on the kind of data that can be shared and what you can do with it (see perlthrtut for more info).
The upshot is, if you need to parallelize a lot of IO and you can make it non-blocking, you'll get a lot more performance out of an event loop model than threads. If you need to parallelize stuff that can't be made non-blocking, you'll probably have a lot more luck with multi-process than with perl threads (and once you're familiar with that kind of code, it's also easier to debug).
It's also possible to combine the two models (for example, a mostly-single-process evented app that passes off certain expensive work to child processes using POE::Wheel::Run or AnyEvent::Run, or a multi-process app that has an evented parent managing non-evented children, or a Node Cluster type setup where you have a number of preforked evented webservers, with a parent that just accepts and passes FDs to its children).
There's no silver bullets, though, at least not yet.

From here:
http://perldoc.perl.org/threads.html
Memory consumption
On most systems, frequent and continual creation and destruction of threads can lead to ever-increasing growth in the memory footprint of the Perl interpreter. While it is simple to just launch threads and then ->join() or ->detach() them, for long-lived applications, it is better to maintain a pool of threads, and to reuse them for the work needed, using queues to notify threads of pending work. The CPAN distribution of this module contains a simple example (examples/pool_reuse.pl) illustrating the creation, use and monitoring of a pool of reusable threads.

Let's look at your code. I see three problems with it:
Easy one first: You use ->content instead of ->decoded_content(charset => 'none').
->content returns the raw HTML response body which is useless without information in the headers to decode it (e.g. it might be gzipped). It works sometimes.
->decoded_content(charset => 'none') gives you the actual response. It works always.
You process responses in the order requests were made. That means you could be blocked while responses are waiting to be serviced.
The simplest solution is to place the responses in a Thread::Queue::Any object.
use Thread::Queue::Any qw( );
my $q = Thread::Queue::Any->new();
my $requests = 0;
for my $url (#urls) {
++$requests;
async {
...
$q->enqueue($response);
};
}
while ($requests && my $response = $q->dequeue()) {
--$requests;
$_->join for threads->list(threads::joinable);
...
}
$_->join for threads->list();
You create a lot of threads that are only used once.
There is a significant amount of overhead to that approach. A common multithreading practice is to create a pool of persistent worker threads. These workers perform whatever job needs to be done, then move on to the next job rather than exiting. Jobs to the pool rather than a specific thread so that the job can be started as soon as possible. In addition to removing thread creation overhead, this allows the number of threads running at a time to be controlled. This is great for CPU-bound tasks.
However, your needs are different since you're using threads to do asynchronous IO. The CPU overhead of thread creation doesn't impact you as much (though it may impose a startup lag). Memory is fairly cheap, but you're still using far more than you need. Threads are really not ideal for this task.
There are much better systems for doing asynchronous IO, but they are not necessarily easily available from Perl. In your specific case, though, you're much better to avoid threads and go with Net::Curl::Multi. Follow the example in the Synopsis, and you'll get a very fast engine capable of making parallel web requests with very little overhead.
We at my former employer have switched to Net::Curl::Multi without problem for a high-load mission-critical web site, and we love it.
It's easy to create a wrapper that creates HTTP::Response objects if you want to limit changes to surrounding code. (This was the case for us.) Note that it helps to have a reference to the underlying library (libcurl) handy since the Perl code is a thin layer over the underlying library, since the documentation is very good, and since it documents all the options you can provide.

You might simply want to consider a non-blocking user agent. I like Mojo::UserAgent which is part of the Mojolicious suite. You might want to look at an example that I mocked up for a non-blocking crawler for another question.

Use cases for ithreads (interpreter threads) in Perl and rationale for using or not using them?

If you want to learn how to use Perl interpreter threads, there's good documentation in perlthrtut (threads tutorial) and the threads pragma manpage. It's definitely good enough to write some simple scripts.
However, I have found little guidance on the web on why and what to sensibly use Perl's interpreter threads for. In fact, there's not much talk about them, and if people talk about them it's quite often to discourage people from using them.
These threads, available when perl -V:useithreads is useithreads='define'; and unleashed by use threads, are also called ithreads, and maybe more appropriately so as they are very different from threads as offered by the Linux or Windows operating systems or the Java VM in that nothing is shared by default and instead a lot of data is copied, not just the thread stack, thus significantly increasing the process size. (To see the effect, load some modules in a test script, then create threads in a loop pausing for key presses each time around, and watch memory rise in task manager or top.)
[...] every time you start a thread all data structures are copied to
the new thread. And when I say all, I mean all. This e.g. includes
package stashes, global variables, lexicals in scope. Everything!
-- Things you need to know before programming Perl ithreads (Perlmonks 2003)
When researching the subject of Perl ithreads, you'll see people discouraging you from using them ("extremely bad idea", "fundamentally flawed", or "never use ithreads for anything").
The Perl thread tutorial highlights that "Perl Threads Are Different", but it doesn't much bother to explain how they are different and what that means for the user.
A useful but very brief explanation of what ithreads really are is from the Coro manpage under the heading WINDOWS PROCESS EMULATION. The author of that module (Coro - the only real threads in perl) also discourages using Perl interpreter threads.
Somewhere I read that compiling perl with threads enabled will result in a significantly slower interpreter.
There's a Perlmonks page from 2003 (Things you need to know before programming Perl ithreads), in which the author asks: "Now you may wonder why Perl ithreads didn't use fork()? Wouldn't that have made a lot more sense?" This seems to have been written by the author of the forks pragma. Not sure the info given on that page still holds true in 2012 for newer Perls.
Here are some guidelines for usage of threads in Perl I have distilled from my readings (maybe erroneously so):
Consider using non-blocking IO instead of threads, like with HTTP::Async, or AnyEvent::Socket, or Coro::Socket.
Consider using Perl interpreter threads on Windows only, not on UNIX because on UNIX, forks are more efficient both for memory and execution speed.
Create threads at beginning of program, not when memory concumption already considerable - see "ideal way to reduce these costs" in perlthrtut.
Minimize communication between threads because it's slow (all answers on that page).
So far my research. Now, thanks for any more light you can shed on this issue of threads in Perl. What are some sensible use cases for ithreads in Perl? What is the rationale for using or not using them?

The short answer is that they're quite heavy (you can't launch 100+ of them cheaply), and they exhibit unexpected behaviours (somewhat mitigated by recent CPAN modules).
You can safely use Perl ithreads by treating them as independent Actors.
Create a Thread::Queue::Any for "work".
Launch multiple ithreads and "result" Queues passing them the ("work" + own "result") Queues by closure.
Load (require) all the remaining code your application requires (not before threads!)
Add work for the threads into the Queue as required.
In "worker" ithreads:
Bring in any common code (for any kind of job)
Blocking-dequeue a piece of work from the Queue
Demand-load any other dependencies required for this piece of work.
Do the work.
Pass the result back to the main thread via the "result" queue.
Back to 2.
If some "worker" threads start to get a little beefy, and you need to limit "worker" threads to some number then launch new ones in their place, then create a "launcher" thread first, whose job it is to launch "worker" threads and hook them up to the main thread.
What are the main problems with Perl ithreads?
They're a little inconvenient with for "shared" data as you need to explicity do the sharing (not a big issue).
You need to look out for the behaviour of objects with DESTROY methods as they go out of scope in some thread (if they're still required in another!)
The big one: Data/variables that aren't explicitly shared are CLONED into new threads. This is a performance hit and probably not at all what you intended. The work around is to launch ithreads from a pretty much "pristine" condition (not many modules loaded).
IIRC, there are modules in the Threads:: namespace that help with making dependencies explicit and/or cleaning up cloned data for new threads.
Also, IIRC, there's a slightly different model using ithreads called "Apartment" threads, implemented by Thread::Appartment which has a different usage pattern and another set of trade-offs.
The upshot:
Don't use them unless you know what you're doing :-)
Fork may be more efficient on Unix, but the IPC story is much simpler for ithreads. (This may have been mitigated by CPAN modules since I last looked :-)
They're still better than Python's threads.
There might, one day, be something much better in Perl 6.

I have used perl's "threads" on several occasions. They're most useful for launching some process and continuing on with something else. I don't have a lot of experience in the theory of how they work under the hood, but I do have a lot of practical coding experience with them.
For example, I have a server thread that listens for incoming network connections and spits out a status response when someone asks for it. I create that thread, then move on and create another thread that monitors the system, checking five items, sleeping a few seconds, and looping again. It might take 3-4 seconds to collect the monitor data, then it gets shoved into a shared variable, and the server thread can read that when needed and immediately return the last known result to whomever asks. The monitor thread, when it finds that an item is in a bad state, kicks off a separate thread to repair that item. Then it moves on, checking the other items while the bad one is repaired, and kicking off other threads for other bad items or joining finished repair threads. The main program all the while is looping every few seconds, making sure that the monitor and server threads aren't joinable/still running. All of this could be written as a bunch of separate programs utilizing some other form of IPC, but perl's threads make it simple.
Another place where I've used them is in a fractal generator. I would split up portions of the image using some algorithm and then launch as many threads as I have CPUs to do the work. They'd each stuff their results into a single GD object, which didn't cause problems because they were each working on different portions of the array, and then when done I'd write out the GD image. It was my introduction to using perl threads, and was a good introduction, but then I rewrote it in C and it was two orders of magnitude faster :-). Then I rewrote my perl threaded version to use Inline::C, and it was only 20% slower than the pure C version. Still, in most cases where you'd want to use threads due to being CPU intensive, you'd probably want to just choose another language.
As mentioned by others, fork and threads really overlap for a lot of purposes. Coro, however, doesn't really allow for multi-cpu use or parallel processing like fork and thread do, you'll only ever see your process using 100%. I'm over-simplifying this, but I think the easiest way to describe Coro is that it's a scheduler for your subroutines. If you have a subroutine that blocks you can hop to another and do something else while you wait, for example of you have an app that calculates results and writes them to a file. One block might calculate results and push them into a channel. When it runs out of work, another block starts writing them to disk. While that block is waiting on disk, the other block can start calculating results again if it gets more work. Admittedly I haven't done much with Coro; it sounds like a good way to speed some things up, but I'm a bit put off by not being able to do two things at once.
My own personal preference if I want to do multiprocessing is to use fork if I'm doing lots of small or short things, threads for a handful of large or long-lived things.

What's the benefits of multi-processing when we already have mult-threading?

I'm confused whether using multiple processes for a web application will improve the performance. Apache's mod_wsgi provides an option to set the number of processes to be started for the daemon process group. I used fastcgi with lighttpd before and it also had an option to configure the max number of processes for each fastcgi application.
While I don't know how multi-processing is better, I do know something bad about it compared to single-process multi-threading model. For example, logging will be harder to implement in multi-processing scenario (link), especially when you also want log rotating. And since memory can't be shared, if you cache something in memory (the most straightforward way), you have multiple duplicate copies.
Do multiple processes better utilize multi-core computing power, or do they yield higher throughput? Or are they just there for some single threaded applications?

In the case of Python, or more specifically CPython as used by mod_wsgi, the issue is the Python GIL. Although you may have multiple threads in Python, the global interpreter lock effectively means that only one thread can be running Python code at a time. This means it cannot make use of multiple processors/cores properly on a system. Using multiple processes however does allow you to use all those processors/cores.
That said, for mod_wsgi it isn't all Python code but has a lot of C code and with Apache also being C code. During execution of the C code, the GIL is unlocked by that thread meaning that a thread running in C code can run in parallel to a thread running in Python code. Still not the best one can achieve, but can still make partial use of all the processors/core on your system.
For me details of this in relation to mod_wsgi read:
http://blog.dscpl.com.au/2007/09/parallel-python-discussion-and-modwsgi.html
http://blog.dscpl.com.au/2007/07/web-hosting-landscape-and-modwsgi.html

Multi-processing is less efficient than multi-threading, but it's more resilient to failure. Each process gets its own independent memory space and may be terminated and restarted (recycled) independent of other processes.

Threads vs Processes in Linux [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
The community reviewed whether to reopen this question last year and left it closed:
Original close reason(s) were not resolved
Improve this question
I've recently heard a few people say that in Linux, it is almost always better to use processes instead of threads, since Linux is very efficient in handling processes, and because there are so many problems (such as locking) associated with threads. However, I am suspicious, because it seems like threads could give a pretty big performance gain in some situations.
So my question is, when faced with a situation that threads and processes could both handle pretty well, should I use processes or threads? For example, if I were writing a web server, should I use processes or threads (or a combination)?

Linux uses a 1-1 threading model, with (to the kernel) no distinction between processes and threads -- everything is simply a runnable task. *
On Linux, the system call clone clones a task, with a configurable level of sharing, among which are:
CLONE_FILES: share the same file descriptor table (instead of creating a copy)
CLONE_PARENT: don't set up a parent-child relationship between the new task and the old (otherwise, child's getppid() = parent's getpid())
CLONE_VM: share the same memory space (instead of creating a COW copy)
fork() calls clone(least sharing) and pthread_create() calls clone(most sharing). **
forking costs a tiny bit more than pthread_createing because of copying tables and creating COW mappings for memory, but the Linux kernel developers have tried (and succeeded) at minimizing those costs.
Switching between tasks, if they share the same memory space and various tables, will be a tiny bit cheaper than if they aren't shared, because the data may already be loaded in cache. However, switching tasks is still very fast even if nothing is shared -- this is something else that Linux kernel developers try to ensure (and succeed at ensuring).
In fact, if you are on a multi-processor system, not sharing may actually be beneficial to performance: if each task is running on a different processor, synchronizing shared memory is expensive.
* Simplified. CLONE_THREAD causes signals delivery to be shared (which needs CLONE_SIGHAND, which shares the signal handler table).
** Simplified. There exist both SYS_fork and SYS_clone syscalls, but in the kernel, the sys_fork and sys_clone are both very thin wrappers around the same do_fork function, which itself is a thin wrapper around copy_process. Yes, the terms process, thread, and task are used rather interchangeably in the Linux kernel...

Linux (and indeed Unix) gives you a third option.
Option 1 - processes
Create a standalone executable which handles some part (or all parts) of your application, and invoke it separately for each process, e.g. the program runs copies of itself to delegate tasks to.
Option 2 - threads
Create a standalone executable which starts up with a single thread and create additional threads to do some tasks
Option 3 - fork
Only available under Linux/Unix, this is a bit different. A forked process really is its own process with its own address space - there is nothing that the child can do (normally) to affect its parent's or siblings address space (unlike a thread) - so you get added robustness.
However, the memory pages are not copied, they are copy-on-write, so less memory is usually used than you might imagine.
Consider a web server program which consists of two steps:
Read configuration and runtime data
Serve page requests
If you used threads, step 1 would be done once, and step 2 done in multiple threads. If you used "traditional" processes, steps 1 and 2 would need to be repeated for each process, and the memory to store the configuration and runtime data duplicated. If you used fork(), then you can do step 1 once, and then fork(), leaving the runtime data and configuration in memory, untouched, not copied.
So there are really three choices.

That depends on a lot of factors. Processes are more heavy-weight than threads, and have a higher startup and shutdown cost. Interprocess communication (IPC) is also harder and slower than interthread communication.
Conversely, processes are safer and more secure than threads, because each process runs in its own virtual address space. If one process crashes or has a buffer overrun, it does not affect any other process at all, whereas if a thread crashes, it takes down all of the other threads in the process, and if a thread has a buffer overrun, it opens up a security hole in all of the threads.
So, if your application's modules can run mostly independently with little communication, you should probably use processes if you can afford the startup and shutdown costs. The performance hit of IPC will be minimal, and you'll be slightly safer against bugs and security holes. If you need every bit of performance you can get or have a lot of shared data (such as complex data structures), go with threads.

Others have discussed the considerations.
Perhaps the important difference is that in Windows processes are heavy and expensive compared to threads, and in Linux the difference is much smaller, so the equation balances at a different point.

Once upon a time there was Unix and in this good old Unix there was lots of overhead for processes, so what some clever people did was to create threads, which would share the same address space with the parent process and they only needed a reduced context switch, which would make the context switch more efficient.
In a contemporary Linux (2.6.x) there is not much difference in performance between a context switch of a process compared to a thread (only the MMU stuff is additional for the thread).
There is the issue with the shared address space, which means that a faulty pointer in a thread can corrupt memory of the parent process or another thread within the same address space.
A process is protected by the MMU, so a faulty pointer will just cause a signal 11 and no corruption.
I would in general use processes (not much context switch overhead in Linux, but memory protection due to MMU), but pthreads if I would need a real-time scheduler class, which is a different cup of tea all together.
Why do you think threads are have such a big performance gain on Linux? Do you have any data for this, or is it just a myth?

I think everyone has done a great job responding to your question. I'm just adding more information about thread versus process in Linux to clarify and summarize some of the previous responses in context of kernel. So, my response is in regarding to kernel specific code in Linux. According to Linux Kernel documentation, there is no clear distinction between thread versus process except thread uses shared virtual address space unlike process. Also note, the Linux Kernel uses the term "task" to refer to process and thread in general.
"There are no internal structures implementing processes or threads, instead there is a struct task_struct that describe an abstract scheduling unit called task"
Also according to Linus Torvalds, you should NOT think about process versus thread at all and because it's too limiting and the only difference is COE or Context of Execution in terms of "separate the address space from the parent " or shared address space. In fact he uses a web server example to make his point here (which highly recommend reading).
Full credit to linux kernel documentation

If you want to create a pure a process as possible, you would use clone() and set all the clone flags. (Or save yourself the typing effort and call fork())
If you want to create a pure a thread as possible, you would use clone() and clear all the clone flags (Or save yourself the typing effort and call pthread_create())
There are 28 flags that dictate the level of resource sharing. This means that there are over 268 million flavours of tasks that you can create, depending on what you want to share.
This is what we mean when we say that Linux does not distinguish between a process and a thread, but rather alludes to any flow of control within a program as a task. The rationale for not distinguishing between the two is, well, not uniquely defining over 268 million flavours!
Therefore, making the "perfect decision" of whether to use a process or thread is really about deciding which of the 28 resources to clone.

How tightly coupled are your tasks?
If they can live independently of each other, then use processes. If they rely on each other, then use threads. That way you can kill and restart a bad process without interfering with the operation of the other tasks.

To complicate matters further, there is such a thing as thread-local storage, and Unix shared memory.
Thread-local storage allows each thread to have a separate instance of global objects. The only time I've used it was when constructing an emulation environment on linux/windows, for application code that ran in an RTOS. In the RTOS each task was a process with it's own address space, in the emulation environment, each task was a thread (with a shared address space). By using TLS for things like singletons, we were able to have a separate instance for each thread, just like under the 'real' RTOS environment.
Shared memory can (obviously) give you the performance benefits of having multiple processes access the same memory, but at the cost/risk of having to synchronize the processes properly. One way to do that is have one process create a data structure in shared memory, and then send a handle to that structure via traditional inter-process communication (like a named pipe).

In my recent work with LINUX is one thing to be aware of is libraries. If you are using threads make sure any libraries you may use across threads are thread-safe. This burned me a couple of times. Notably libxml2 is not thread-safe out of the box. It can be compiled with thread safe but that is not what you get with aptitude install.

I'd have to agree with what you've been hearing. When we benchmark our cluster (xhpl and such), we always get significantly better performance with processes over threads. </anecdote>

The decision between thread/process depends a little bit on what you will be using it to.
One of the benefits with a process is that it has a PID and can be killed without also terminating the parent.
For a real world example of a web server, apache 1.3 used to only support multiple processes, but in in 2.0 they added an abstraction so that you can swtch between either. Comments seems to agree that processes are more robust but threads can give a little bit better performance (except for windows where performance for processes sucks and you only want to use threads).

For most cases i would prefer processes over threads.
threads can be useful when you have a relatively smaller task (process overhead >> time taken by each divided task unit) and there is a need of memory sharing between them. Think a large array.
Also (offtopic), note that if your CPU utilization is 100 percent or close to it, there is going to be no benefit out of multithreading or processing. (in fact it will worsen)

Threads -- > Threads shares a memory space,it is an abstraction of the CPU,it is lightweight.
Processes --> Processes have their own memory space,it is an abstraction of a computer.
To parallelise task you need to abstract a CPU.
However the advantages of using a process over a thread is security,stability while a thread uses lesser memory than process and offers lesser latency.
An example in terms of web would be chrome and firefox.
In case of Chrome each tab is a new process hence memory usage of chrome is higher than firefox ,while the security and stability provided is better than firefox.
The security here provided by chrome is better,since each tab is a new process different tab cannot snoop into the memory space of a given process.

Multi-threading is for masochists. :)
If you are concerned about an environment where you are constantly creating threads/forks, perhaps like a web server handling requests, you can pre-fork processes, hundreds if necessary. Since they are Copy on Write and use the same memory until a write occurs, it's very fast. They can all block, listening on the same socket and the first one to accept an incoming TCP connection gets to run with it. With g++ you can also assign functions and variables to be closely placed in memory (hot segments) to ensure when you do write to memory, and cause an entire page to be copied at least subsequent write activity will occur on the same page. You really have to use a profiler to verify that kind of stuff but if you are concerned about performance, you should be doing that anyway.
Development time of threaded apps is 3x to 10x times longer due to the subtle interaction on shared objects, threading "gotchas" you didn't think of, and very hard to debug because you cannot reproduce thread interaction problems at will. You may have to do all sort of performance killing checks like having invariants in all your classes that are checked before and after every function and you halt the process and load the debugger if something isn't right. Most often it's embarrassing crashes that occur during production and you have to pore through a core dump trying to figure out which threads did what. Frankly, it's not worth the headache when forking processes is just as fast and implicitly thread safe unless you explicitly share something. At least with explicit sharing you know exactly where to look if a threading style problem occurs.
If performance is that important, add another computer and load balance. For the developer cost of debugging a multi-threaded app, even one written by an experienced multi-threader, you could probably buy 4 40 core Intel motherboards with 64gigs of memory each.
That being said, there are asymmetric cases where parallel processing isn't appropriate, like, you want a foreground thread to accept user input and show button presses immediately, without waiting for some clunky back end GUI to keep up. Sexy use of threads where multiprocessing isn't geometrically appropriate. Many things like that just variables or pointers. They aren't "handles" that can be shared in a fork. You have to use threads. Even if you did fork, you'd be sharing the same resource and subject to threading style issues.

If you need to share resources, you really should use threads.
Also consider the fact that context switches between threads are much less expensive than context switches between processes.
I see no reason to explicitly go with separate processes unless you have a good reason to do so (security, proven performance tests, etc...)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string