What is the exact different between thread and process in linux? - linux

what is the difference between thread and process, i know you think this is duplicate , but seriously no where and nobody given one exact and to the point answer . I am really tired when somebody says thread are light weight process , even books does not clearly mention it why ?
So I gone through different stackoverflow answer and made a final answer. I feel it is incomplete. Please contribute to this answer ,as it will really help thousand of people who still struggling to understand this puzzle.
My research reached till here so far ....
Linux calls clone API for creation of thread and creation of process , only difference is there are 28 flags which are configured differently for thread and process.
https://stackoverflow.com/a/64942352

I'm not qualified to speak about the implementation of Linux threads. (I've heard that it has been evolving over the past decade or so.) But, the purpose of threads and processes has remained pretty stable:
A thread is an execution path through your program's code. That's pretty much all there is to say about that.
A process is a container for all of the resources needed by an instance of your program; The process has a virtual address space, it has open file handles, it may have sockets, it may have memory mappings, it may own IPC objects, I'm not sure what all else. And, a process has some number of threads. (Always at least one thread, sometimes more.)
The threads in your program do all the work. The process is the place in which they do the work.

Related

Why do we need both threading and asynchronous calls

My understanding is that threads exist as a way of doing several things in parallel that share the same address space but each has its individual stack. Asynchronous programming is basically a way of using fewer threads. I don't understand why it's undesirable to just have blocking calls and a separate thread for each blocked command?
For example, suppose I want to scrape a large part of the web. A presumably uncontroversial implementation might be to have a large number of asynchronous loops. Each loop would ask for a webpage, wait to avoid overloading the remote server, then ask for another webpage from the same website until done. The loops would then be executed on a much smaller number of threads (which is fine because they mostly wait). So to restate the question, I don't see why it's any cheaper to e.g. maintain a threadpool within the language runtime than it would be to just have one (mostly blocked) OS thread per loop and let the operating system deal with the complexity of scheduling the work? After all, if piling two different schedulers on top of each other is so good, it can still be implemented that way in the OS :-)
It seems obvious the answer is something like "threads are expensive". However, a thread just needs to keep track of where it has got to before it was interrupted the last time. This is pretty much exactly what an asynchronous command needs to know before it blocks (perhaps representing what happens next by storing a callback). I suppose one difference is that a blocking asynchronous command does so at a well defined point whereas a thread can be interrupted anywhere. If there really is a substantial difference in terms of the cost of keeping the state, where does it come from? I doubt it's the cost of the stack since that wastes at most a 4KB page, so it's a non-issue even for 1000s of blocked threads.
Many thanks, and sorry if the question is very basic. It might be I just cannot figure out the right incantation to type into Google.
Threads consume memory, as they need to have their state preserved even if they aren't doing anything. If you make an asynchronous call on a single thread, it's literally (aside from registering that the call was made somewhere) not consuming any resources until it needs to be resolved, because if it isn't actively being processed you don't care about it.
If the architecture of your application is written in a way that the resources it needs scale linearly (or worse) with the number of users / traffic you receive, then that is going to be a problem. You can watch this talk about node.js if you want to watch someone talk at length about this.
https://www.youtube.com/watch?v=ztspvPYybIY

GC not able to collect back memory using fork-emulation on Windows

Let me begin by saying I do not have in depth knowledge of Perl so please pardon me if there is something obvious that I have missed :)
In the system (running in Windows environment) that I am looking at, we have a perl process which has to download ~5000-6000 files. Since each file can be independently downloaded, we forked separate threads for each file. The thread is supposed to download the file and die. On running the process, I noticed that the memory of the process goes up to ~1.7 GB and then dies due to the memory limit of each process.
On searching and asking a few people, I came across this concept of circular referencing due to which the garbage collector will not free up memory. I searched a bit and found the Devel-Cycle package which can find out if there are any cycles in the object. I got this package and added a line to check if the main object in the process has any cycles. find_cycle came back with the following statement for each thread.
DBD::Oracle::db FIRSTKEY failed: handle 2 is owned by thread 256004 not current thread c0ea29c (handles can't be shared between threads and your driver may need a CLONE method added) at C:/Program Files/Perl/site/lib/Devel/Cycle.pm line 151.
I got to know that DB handles cannot be shared between threads. I looked at the code again and realised that after the fork happens, the child process does actually create a new DB handle (which I guess is why the process still continues to run fine till it reaches the memory limit). I guess there might be more db handles from the parent in the object that are not used by the child but are still referenced.
Questons that I have -
Is the circular reference the only reason for the problem or could there be other issues causing the process to use so much memory?
Could the sharing of the handle cause the blow up in memory (in other words is the shared DB handle causing the GC to not free up space)?
If it is indeed the shared DB handle, I guess I can just say $dbHandle = 0 to get rid of the reference (if $dbHabndle is referencing that particular handle). Am I correct here?
I am trying to go through the code to see where else there is a reference to the parent DB handle (and found at least one more reference). Is there any other way I can do this? Is there a method to print out all the properties of an object?
EDIT:
Not all threads (due to the perl fork call in windows) are spawned at the same time. It spawns a max of n number of threads (where n is a configurable number). Once a thread has finished its execution, the process spawns another thread. At this moment n is set to 10, however I had changed n to 1 (so only one extra thread is running at one time), and I still hit the memory limit.
edit: Turns out, this does not solve the Ops problem. Still might be helpful for a future reader.
We do not really know a lot about your situation and your program sounds quite complex to just fork it 6000 times to me. But i will still attempt to answer, please correct me if my assumptions are wrong.
It appears you are on Windows. It is important to note, that Windows has no fork() system call. And as you specifically note that you "fork", i just assume that you actually use that Perl command. On windows, this will try to emulate fork() as best as it can but what that basically means is, that all the forked processes you see, are in fact just threads within the original process, just pretending to be processes to you. To do this, they copy the complete interpreter state. See http://perldoc.perl.org/perlfork.html for more information. Especially the following part seems to apply to you:
Resource limits
In the eyes of the operating system, pseudo-processes created via the fork() emulation are simply threads in the same process. This means that any process-level limits imposed by the operating system apply to all pseudo-processes taken together. This includes any limits imposed by the operating system on the number of open file, directory and socket handles, limits on disk space usage, limits on memory size, limits on CPU utilization etc.
If you fork so many pseudo processes, you need a lot of memory as you also have to copy the interpreter state as often. And depending on the complexity of your program and how it is structured, that may very well be a non-trivial amount of memory.
And as http://msdn.microsoft.com/en-us/library/windows/desktop/aa366778%28v=vs.85%29.aspx tells us, the 1.7GB you mentioned, is not far away from the 2GB that some Windows versions impose on you as memory limit for a single process.
My wild guess would be, that you in fact just hit that limit by spawning all those many many threads, each with its own copy of the interpreter state and everything.
You will probably be off a lot better using some threading library instead of asking Perl to emulate individual processes for you. Needless to mention (i hope) that you do not really gain any advantage by having 6000 threads over lets say 16. If you try to have all of them do something at the same time, you will in fact most likely experience slowdowns, depending on how the threading is handled.
In addition to the comments already provided, I want to emphasize the point DeVadder made regarding the behavior of fork in Windows and that Perl threading is likely a better solution but are you sure that the DBD module is safe to be used by multiple processes / forks / threads, etc without setting some extra parameters?
I had a similar error when using the DBI module to access a SQLite DB in multi-processed code using the threads module. It was solved by setting the 'use_immediate_transaction' option for the database handle provided by DBI to 1. If you aren't familiar with how Perl threads work, they aren't threads, they create a copy of the interpreter and everything you have in memory at the time of their creation, but even if I made the database handle separately in each "thread" I would get 'database locked' and various other errors. Without some of these extra options DBD may not function correctly in a multiprocessed environment.
Also, why make 6000 forks, use thread::queue and the threads module, make a worker pool of a few workers (one per core?) and recycle the workers. You are doing alot of overhead every fork for no gain.

Definition of Multi-threading

Not really programming related this question, but I still hope it fits somehow here :).
I wrote the following sentence in my work:
Mulitthreading refers to the ability of an OS to subdivide an application into
threads, where each of the them are capable to execute independently.
I was told, that this definition of thread is too narrow. I am not really sure why this is the case, could somebody be so kind to explain me what I missed?
Thank you
Usually, it is the application that decides when to create threads, not the OS. Also, you may want to mention that threads share address space, while each process has its own.
A thread fundamentally, is a saved execution context - a set of saved registers and a stack, that you can resume and continue execution of. This thread can be executed on a processor (these days, many machines of course can execute multiple threads at the same time).
The critical aspect of "multi-threading" is, that an operating system can emulate execution of many threads at the same time, by preempting (stopping) a thread once it has run for a certain amount of time (a "quantum"), then scheduling another thread to run, based on a certain algorithm that is OS-specific.

fork in multi-threaded program

I've heard that mixing forking and threading in a program could be very problematic, often resulting with mysterious behavior, especially when dealing with shared resources, such as locks, pipes, file descriptors. But I never fully understand what exactly the dangers are and when those could happen. It would be great if someone with expertise in this area could explain a bit more in detail what pitfalls are and what needs to be care when programming in a such environment.
For example, if I want to write a server that collects data from various different resources, one solution I've thought is to have the server spawns a set of threads, each popen to call out another program to do the actual work, open pipes to get the data back from the child. Each of these threads responses for its own work, no data interexchange in b/w them, and when the data is collected, the main thread has a queue and these worker threads will just put the result in the queue. What could go wrong with this solution?
Please do not narrow your answer by just "answering" my example scenario. Any suggestions, alternative solutions, or experiences that are not related to the example but helpful to provide a clean design would be great! Thanks!
The problem with forking when you do have some threads running is that the fork only copies the CPU state of the one thread that called it. It's as if all of the other threads just died, instantly, wherever they may be.
The result of this is locks aren't released, and shared data (such as the malloc heap) may be corrupted.
pthread does offer a pthread_atfork function - in theory, you could take every lock in the program before forking, release them after, and maybe make it out alive - but it's risky, because you could always miss one. And, of course, the stacks of the other threads won't be freed.
It is really quite simple. The problems with multiple threads and processes always arise from shared data. If there is not shared data then there can be no possible issues arising.
In your example the shared data is the queue owned by the main thread - any potential contention or race conditions will arise here. Typical methods for "solving" these issues involve locking schemes - a worker thread will lock the queue before inserting any data, and the main thread will lock the queue before removing it.

Practical uses for threads [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 14 years ago.
In your work, what specifically have you used threads for?
(Please give a description of the application and how the thread helped/enhanced the application.)
Threads are critical for most UI work. Otherwise, any time you want to do a calculation or anything that takes a while, you will freeze the UI.
Therefore, most GUI frameworks have UI threads that handles the event loop (and some drawing activities), but most user code takes place in another thread.
Threads are also useful for occasionally checking things or making episodic changes to the state of the system.
(Less serious answer) I like to use threads in any situation where I want the system to fall on its arse in interesting and unobvious ways, while still having plausible deniability as to how I could have let the problem slip through.
Or, in the words of Rasmus Lerdorf, "People aren't smart enough to write thread-safe code".
Handling concurrent client requests in a server.
Threads are fundamental for most I/O bound applications, and for any reasonably complex server side application. Consider an application that acts as an exchange for information with multiple sources of data. You need to be able to deal with this information in independent threads, in particular if operations on this data is subject to latencies or require a signifigant amount of time to complete.
In most occasions threads often help to decouple various concerns within the application. A single thread dispatching events to interested parties will not scale well in the vast majority of occasions.
All but the simplest of applications will require threading to some degree.
Most common use is for resposive UI like showing a progress bar for a long running background task.
Background tasks:
Handling network connections and protocols.
Doing Sound Synthesis running in the background of a multimedia application.
Doing File-loading in the background in a multimedia application (CD streaming)
Other uses:
Accelerating certain algorithms by running two instances of the same code in two different threads.
I know that most of the time I use threads, what I actually want to do is to launch some asynchronous lump of work - i.e. I want something to happen in the mythical "background". Unfortunately thinking about threads isn't really the correct level of abstraction for doing "launch some lump of work", because you're not putting something into the background. With a thread API you're creating another place to run stuff as a sibling of the process's original thread, and need to worry about what information is shared between them, and how, and so on. That's why I'm enjoying newer API such as Cocoa's NSOperation and NSOperationQueue. In the case of that API, launching some lump of work is just some single line, and the library takes care of whether a new thread should be launched or an old one reused.
To scan directories looking for changed files. It is a lot quicker to spawn a thread per sub directory then do it in a single thread.
We have been using threads for several applications where the main screen was comprised of a workflow tailored to the currently logged on user.
Getting the workflow can be time intensive. Various parts of the workflow get loaded by different threads. For our main application BP/GeNA, about 11 threads are fired, each running a database query.
Regards,
Lieven
I most commonly use threads when I want to do something with a bunch of resources that I know will take a long time, when there's no interdependence between the work to process the elements, and particularly if the bottleneck isn't a local resource (like CPU of disk). For example, if I've got a bunch of URLs to retrieve, each of those will go into a separate thread.
This is a very general question. I've used "threads" to take potentially blocking work off of the UI thread, whether that work is local or network i/o or that work is computationally intensive tasks which will tend to "block" depending on the hardware it is run on.
I think it is more interesting to ask about a particular problem or pattern which help alleviates it and the applicability on threads, i.e.:
how are threads relevant to model
view controller?
how or why should I
take work off of a UI thread to
ensure that the UI doesn't even think
of blocking?
how can a threadpool be
useful for recursive (network)
directory traversal as someone else
alluded to?
Should I be affinitizing
threads for cooperative scheduling of
computationally intensive tasks, or
should I be using a ThreadPool and
let the OS preemptively schedule the
threads as it sees fit.
It's a pretty broad space and more clarity would likely help.
I build web applications, so all code that I write is executed in multiple threads.
Our app is a web service, so we spawn off a thread per request. Technically, JNI spawns off the thread, but the code has to be thread-safe regardless. We've run into some interesting (FSVO) issues with both Hibernate and our ESB-based infrastructure, but for the most part keeping things in ThreadLocals and synchronizing on subsystem entry points has worked pretty well. We haven't tried more than a couple dozen simultaneous requests, so there may well be some race conditions we haven't identified yet, but overall we're performant and give the right answers.
I wrote a function that spawns a thread that beeps (from the speaker) at regular intervals to alert a test operator that something needs attention. After the modal dialog is responded to, the thread is killed.
Not employment related, but I'm doing some side work on the Netflix Prize. My computer has 8 cores and 20 GB of RAM ... running only 1 thread would be an utter waste, so I typically start 16 threads or so.

Resources