statisics for randomized task on multiple cores

statisics for randomized task on multiple cores - statistics

Consider the time for completing a task on a processor core is a distribution with mean m and standard deviation s. If the same task runs on n cores, what is the mean and standard deviation of the time it takes to complete the task? (the task is finished when one of the cores finishes the task)

This is more of a statistics question, than anything else. Without information on the distribution function of the time t a single task needs to complete, I could only give you a hint: You need to calculate the distribution function of the minimum of t for n of your tasks, as seen here. Using that you can then calculate the mean and the standard deviation.
PS: Is this homework?
EDIT:
Whether - and how much - it's worth to use multiple cores, depends on several things:
What you need to do. If you have to run the same program with different inputs, launching multiple instances makes a lot of sense. It might not cut down the overall time down to 1/n and each experiment will still need at least as much time as before, but the time needed for the whole series will be signigicantly less.
If on the other hand, you are hoping to run the same task with e.g. a different seed and keep the one that converges the fastest, you will probably gain far less, as estimated by the first part of my answer.
How well you have parallelized your tasks. n completely independent tasks is the ideal scenario. n threads with multiple synchronization points etc are not going to be near as efficient.
How well your hardware can handle multiple tasks. For example if each of these tasks needs a lot of memory, it will probably be faster to use a single core only, than forcing the system to use the swap space/pagefile/whatever your OS calls it by running multiple instances at once.

Related

Multiprocessing: why doesn't a single thread just use more cpu?

I'm learning about multiprocessing and it seems to be applicable in one of two scenarios:
our program is waitng for some I/O, so it makes sense to go do something else while waiting;
we break our program up so that individual parts of it can run "in parellel", in an attempt to take full advantage of the cpu
My confusion is about the second case. I'm probably just lacking in my understanding of how cpus really work: but if our single thread process is only using 1% of the cpu and it therefore makes sense to get more threads going, then why wouldn't we just (somehow?) speed up that single process so that it uses more cpu and finishes faster?

but if our single thread process is only using 1% of the cpu and it therefore makes sense to get more threads going, then why wouldn't we just (somehow?) speed up that single process so that it uses more cpu and finishes faster?
We don't know how to. There seem to be fundamental limitations to how fast we can do things that we haven't quite figured out how to get around. So instead, we do more than one thing at a time.
It takes a woman 9 months to make a baby. So if you want lots of babies, you get lots of women. You don't try to get one woman to go faster.
Say you want to raise 7 to the twenty-millionth power and also raise 11 to the twenty-millionth power. Each of these two operations can be reduced in the number of steps, but you will reach a limit. Say each operation takes N sequential steps (each requiring the output from the previous step as its input) and the fastest we can do a single step is Q nanoseconds. With one thread, it will take at least 2NQ nanoseconds to perform all the operations. With two threads, can do one step from each of the two operations at the same time, reducing the time minimum to N*Q nanoseconds.
That's a big win.

I might be wrong, but when we split things into threads, we want to make use of multi-core architecture of our CPUs.
We mostly think CPUs being a single unit, but you must've heard about how i5 is a quad-core processor, meaning it has 4 cores-- or 4 cores make a CPU, i3 is a dual core processor-- i.e, it only has two cores.
So the aggregate CPU utilization for quad-core would be 100% split into 4x25. There's a difference b/w concurrency and parallelism. Parallel means each thread runs on a separate core, making full use of it. Now you have 4 people doing one job-- or a better analogy would be there are 4 printers in the office, and 4 people can go ahead and get the copies that they want. This is parallelism.
Using that same analogy let's extend it to just one copier/printer and 4 people want to make copies, what we do is make use concurrency, we print each requested copy but only 25% of it, then we switch to the next person, then the next and then the next, this will take 4 iterations for all the copies to get printed. Even though we utilized 100% of the copier's capability, still our guys had to wait-- this waiting time also depends on what was the length of the document they wanted to print-- so we use something like pre-emption, you can only execute/print for a certain amount of time, before we start printing for the next guy.
Speeding up a single process-- allocating it 100% of the CPU is not a problem [although we want to run bunch of other stuff like GUI, play music, system services etc, but 85% is doable], the execution time becomes 1/4th when it's distributed b/w the CPUs. Imagine you have to print a book, with 4 copiers, book is 400pages long-- you use 4 copiers to print 100pages each. Will be faster right?
I hope I made some sense, Going to sleep.

Does it make sense to write concurrent program if you have 1 hardware thread? [duplicate]

What is the difference between concurrency and parallelism?

Concurrency is when two or more tasks can start, run, and complete in overlapping time periods. It doesn't necessarily mean they'll ever both be running at the same instant. For example, multitasking on a single-core machine.
Parallelism is when tasks literally run at the same time, e.g., on a multicore processor.
Quoting Sun's Multithreaded Programming Guide:
Concurrency: A condition that exists when at least two threads are making progress. A more generalized form of parallelism that can include time-slicing as a form of virtual parallelism.
Parallelism: A condition that arises when at least two threads are executing simultaneously.

Why the Confusion Exists
Confusion exists because dictionary meanings of both these words are almost the same:
Concurrent: existing, happening, or done at the same time(dictionary.com)
Parallel: very similar and often happening at the same time(merriam webster).
Yet the way they are used in computer science and programming are quite different. Here is my interpretation:
Concurrency: Interruptability
Parallelism: Independentability
So what do I mean by above definitions?
I will clarify with a real world analogy. Let’s say you have to get done 2 very important tasks in one day:
Get a passport
Get a presentation done
Now, the problem is that task-1 requires you to go to an extremely bureaucratic government office that makes you wait for 4 hours in a line to get your passport. Meanwhile, task-2 is required by your office, and it is a critical task. Both must be finished on a specific day.
Case 1: Sequential Execution
Ordinarily, you will drive to passport office for 2 hours, wait in the line for 4 hours, get the task done, drive back two hours, go home, stay awake 5 more hours and get presentation done.
Case 2: Concurrent Execution
But you’re smart. You plan ahead. You carry a laptop with you, and while waiting in the line, you start working on your presentation. This way, once you get back at home, you just need to work 1 extra hour instead of 5.
In this case, both tasks are done by you, just in pieces. You interrupted the passport task while waiting in the line and worked on presentation. When your number was called, you interrupted presentation task and switched to passport task. The saving in time was essentially possible due to interruptability of both the tasks.
Concurrency, IMO, can be understood as the "isolation" property in ACID. Two database transactions are considered isolated if sub-transactions can be performed in each and any interleaved way and the final result is same as if the two tasks were done sequentially. Remember, that for both the passport and presentation tasks, you are the sole executioner.
Case 3: Parallel Execution
Now, since you are such a smart fella, you’re obviously a higher-up, and you have got an assistant. So, before you leave to start the passport task, you call him and tell him to prepare first draft of the presentation. You spend your entire day and finish passport task, come back and see your mails, and you find the presentation draft. He has done a pretty solid job and with some edits in 2 more hours, you finalize it.
Now since, your assistant is just as smart as you, he was able to work on it independently, without needing to constantly ask you for clarifications. Thus, due to the independentability of the tasks, they were performed at the same time by two different executioners.
Still with me? Alright...
Case 4: Concurrent But Not Parallel
Remember your passport task, where you have to wait in the line?
Since it is your passport, your assistant cannot wait in line for you. Thus, the passport task has interruptability (you can stop it while waiting in the line, and resume it later when your number is called), but no independentability (your assistant cannot wait in your stead).
Case 5: Parallel But Not Concurrent
Suppose the government office has a security check to enter the premises. Here, you must remove all electronic devices and submit them to the officers, and they only return your devices after you complete your task.
In this, case, the passport task is neither independentable nor interruptible. Even if you are waiting in the line, you cannot work on something else because you do not have necessary equipment.
Similarly, say the presentation is so highly mathematical in nature that you require 100% concentration for at least 5 hours. You cannot do it while waiting in line for passport task, even if you have your laptop with you.
In this case, the presentation task is independentable (either you or your assistant can put in 5 hours of focused effort), but not interruptible.
Case 6: Concurrent and Parallel Execution
Now, say that in addition to assigning your assistant to the presentation, you also carry a laptop with you to passport task. While waiting in the line, you see that your assistant has created the first 10 slides in a shared deck. You send comments on his work with some corrections. Later, when you arrive back home, instead of 2 hours to finalize the draft, you just need 15 minutes.
This was possible because presentation task has independentability (either one of you can do it) and interruptability (you can stop it and resume it later). So you concurrently executed both tasks, and executed the presentation task in parallel.
Let’s say that, in addition to being overly bureaucratic, the government office is corrupt. Thus, you can show your identification, enter it, start waiting in line for your number to be called, bribe a guard and someone else to hold your position in the line, sneak out, come back before your number is called, and resume waiting yourself.
In this case, you can perform both the passport and presentation tasks concurrently and in parallel. You can sneak out, and your position is held by your assistant. Both of you can then work on the presentation, etc.
Back to Computer Science
In computing world, here are example scenarios typical of each of these cases:
Case 1: Interrupt processing.
Case 2: When there is only one processor, but all executing tasks have wait times due to I/O.
Case 3: Often seen when we are talking about map-reduce or hadoop clusters.
Case 4: I think Case 4 is rare. It’s uncommon for a task to be concurrent but not parallel. But it could happen. For example, suppose your task requires access to a special computational chip that can be accessed through only processor-1. Thus, even if processor-2 is free and processor-1 is performing some other task, the special computation task cannot proceed on processor-2.
Case 5: also rare, but not quite as rare as Case 4. A non-concurrent code can be a critical region protected by mutexes. Once it is started, it must execute to completion. However, two different critical regions can progress simultaneously on two different processors.
Case 6: IMO, most discussions about parallel or concurrent programming are basically talking about Case 6. This is a mix and match of both parallel and concurrent executions.
Concurrency and Go
If you see why Rob Pike is saying concurrency is better, you have to understand what the reason is. You have a really long task in which there are multiple waiting periods where you wait for some external operations like file read, network download. In his lecture, all he is saying is, “just break up this long sequential task so that you can do something useful while you wait.” That is why he talks about different organizations with various gophers.
Now the strength of Go comes from making this breaking really easy with go keyword and channels. Also, there is excellent underlying support in the runtime to schedule these goroutines.
But essentially, is concurrency better than parallelism?
Are apples better than oranges?

I like Rob Pike's talk: Concurrency is not Parallelism (it's better!)
(slides)
(talk)
Rob usually talks about Go and usually addresses the question of Concurrency vs Parallelism in a visual and intuitive explanation! Here is a short summary:
Task: Let's burn a pile of obsolete language manuals! One at a time!
Concurrency: There are many concurrently decompositions of the task! One example:
Parallelism: The previous configuration occurs in parallel if there are at least 2 gophers working at the same time or not.

To add onto what others have said:
Concurrency is like having a juggler juggle many balls. Regardless of how it seems, the juggler is only catching/throwing one ball per hand at a time. Parallelism is having multiple jugglers juggle balls simultaneously.

Say you have a program that has two threads. The program can run in two ways:
Concurrency Concurrency + parallelism
(Single-Core CPU) (Multi-Core CPU)
___ ___ ___
|th1| |th1|th2|
| | | |___|
|___|___ | |___
|th2| |___|th2|
___|___| ___|___|
|th1| |th1|
|___|___ | |___
|th2| | |th2|
In both cases we have concurrency from the mere fact that we have more than one thread running.
If we ran this program on a computer with a single CPU core, the OS would be switching between the two threads, allowing one thread to run at a time.
If we ran this program on a computer with a multi-core CPU then we would be able to run the two threads in parallel - side by side at the exact same time.

Concurrency: If two or more problems are solved by a single processor.
Parallelism: If one problem is solved by multiple processors.

Imagine learning a new programming language by watching a video tutorial. You need to pause the video, apply what been said in code then continue watching. That's concurrency.
Now you're a professional programmer. And you enjoy listening to calm music while coding. That's Parallelism.
As Andrew Gerrand said in GoLang Blog
Concurrency is about dealing with lots of things at once. Parallelism
is about doing lots of things at once.
Enjoy.

I will try to explain with an interesting and easy to understand example. :)
Assume that an organization organizes a chess tournament where 10 players (with equal chess playing skills) will challenge a professional champion chess player. And since chess is a 1:1 game thus organizers have to conduct 10 games in time efficient manner so that they can finish the whole event as quickly as possible.
Hopefully following scenarios will easily describe multiple ways of conducting these 10 games:
1) SERIAL - let's say that the professional plays with each person one by one i.e. starts and finishes the game with one person and then starts the next game with the next person and so on. In other words, they decided to conduct the games sequentially. So if one game takes 10 mins to complete then 10 games will take 100 mins, also assume that transition from one game to other takes 6 secs then for 10 games it will be 54 secs (approx. 1 min).
so the whole event will approximately complete in 101 mins (WORST APPROACH)
2) CONCURRENT - let's say that the professional plays his turn and moves on to the next player so all 10 players are playing simultaneously but the professional player is not with two person at a time, he plays his turn and moves on to the next person. Now assume a professional player takes 6 sec to play his turn and also transition time of a professional player b/w two players is 6 sec so the total transition time to get back to the first player will be 1min (10x6sec). Therefore, by the time he is back to the first person with whom the event was started, 2mins have passed (10xtime_per_turn_by_champion + 10xtransition_time=2mins)
Assuming that all player take 45sec to complete their turn so based on 10mins per game from SERIAL event the no. of rounds before a game finishes should 600/(45+6) = 11 rounds (approx)
So the whole event will approximately complete in 11xtime_per_turn_by_player_&_champion + 11xtransition_time_across_10_players = 11x51 + 11x60sec= 561 + 660 = 1221sec = 20.35mins (approximately)
SEE THE IMPROVEMENT from 101 mins to 20.35 mins (BETTER APPROACH)
3) PARALLEL - let's say organizers get some extra funds and thus decided to invite two professional champion players (both equally capable) and divided the set of same 10 players (challengers) into two groups of 5 each and assigned them to two champions i.e. one group each. Now the event is progressing in parallel in these two sets i.e. at least two players (one in each group) are playing against the two professional players in their respective group.
However within the group the professional player with take one player at a time (i.e. sequentially) so without any calculation you can easily deduce that whole event will approximately complete in 101/2=50.5mins to complete
SEE THE IMPROVEMENT from 101 mins to 50.5 mins (GOOD APPROACH)
4) CONCURRENT + PARALLEL - In the above scenario, let's say that the two champion players will play concurrently (read 2nd point) with the 5 players in their respective groups so now games across groups are running in parallel but within group, they are running concurrently.
So the games in one group will approximately complete in 11xtime_per_turn_by_player_&_champion + 11xtransition_time_across_5_players = 11x51 + 11x30 = 600 + 330 = 930sec = 15.5mins (approximately)
So the whole event (involving two such parallel running group) will approximately complete in 15.5mins
SEE THE IMPROVEMENT from 101 mins to 15.5 mins (BEST APPROACH)
NOTE: in the above scenario if you replace 10 players with 10 similar jobs and two professional players with two CPU cores then again the following ordering will remain true:
SERIAL > PARALLEL > CONCURRENT > CONCURRENT+PARALLEL
(NOTE: this order might change for other scenarios as this ordering highly depends on inter-dependency of jobs, communication needs between jobs and transition overhead between jobs)

Concurrent programming execution has 2 types : non-parallel concurrent programming and parallel concurrent programming (also known as parallelism).
The key difference is that to the human eye, threads in non-parallel concurrency appear to run at the same time but in reality they don't. In non - parallel concurrency threads rapidly switch and take turns to use the processor through time-slicing.
While in parallelism there are multiple processors available so, multiple threads can run on different processors at the same time.
Reference: Introduction to Concurrency in Programming Languages

Simple example:
Concurrent is: "Two queues accessing one ATM machine"
Parallel is: "Two queues and two ATM machines"

Parallelism is simultaneous execution of processes on a multiple cores per CPU or multiple CPUs (on a single motherboard).
Concurrency is when Parallelism is achieved on a single core/CPU by using scheduling algorithms that divides the CPU’s time (time-slice). Processes are interleaved.
Units:
1 or many cores in a single CPU (pretty much all modern day processors)
1 or many CPUs on a motherboard (think old school servers)
1 application is 1 program (think Chrome browser)
1 program can have 1 or many processes (think each Chrome browser tab is a process)
1 process can have 1 or many threads from 1 program (Chrome tab playing Youtube video in 1 thread, another thread spawned for comments
section, another for users login info)
Thus, 1 program can have 1 or many threads of execution
1 process is thread(s)+allocated memory resources by OS (heap, registers, stack, class memory)

They solve different problems. Concurrency solves the problem of having scarce CPU resources and many tasks. So, you create threads or independent paths of execution through code in order to share time on the scarce resource. Up until recently, concurrency has dominated the discussion because of CPU availability.
Parallelism solves the problem of finding enough tasks and appropriate tasks (ones that can be split apart correctly) and distributing them over plentiful CPU resources. Parallelism has always been around of course, but it's coming to the forefront because multi-core processors are so cheap.

concurency:
multiple execution flows with the potential to share resources
Ex:
two threads competing for a I/O port.
paralelism:
splitting a problem in multiple similar chunks.
Ex:
parsing a big file by running two processes on every half of the file.

Concurrency => When multiple tasks are performed in overlapping time periods with shared resources (potentially maximizing the resources utilization).
Parallel => when single task is divided into multiple simple independent sub-tasks which can be performed simultaneously.

Concurrency vs Parallelism
Rob Pike in 'Concurrency Is Not Parallelism'
Concurrency is about dealing with lots of things at once.
Parallelism is about doing lots of things at once.
[Concurrency theory]
Concurrency - handles several tasks at once
Parallelism - handles several thread at once
My vision of concurrency and parallelism
[Sync vs Async]
[Swift Concurrency]

If at all you want to explain this to a 9-year-old.

Think of it as servicing queues where server can only serve the 1st job in a queue.
1 server , 1 job queue (with 5 jobs) -> no concurrency, no parallelism (Only one job is being serviced to completion, the next job in the queue has to wait till the serviced job is done and there is no other server to service it)
1 server, 2 or more different queues (with 5 jobs per queue) -> concurrency (since server is sharing time with all the 1st jobs in queues, equally or weighted) , still no parallelism since at any instant, there is one and only job being serviced.
2 or more servers , one Queue -> parallelism ( 2 jobs done at the same instant) but no concurrency ( server is not sharing time, the 3rd job has to wait till one of the server completes.)
2 or more servers, 2 or more different queues -> concurrency and parallelism
In other words, concurrency is sharing time to complete a job, it MAY take up the same time to complete its job but at least it gets started early. Important thing is , jobs can be sliced into smaller jobs, which allows interleaving.
Parallelism is achieved with just more CPUs , servers, people etc that run in parallel.
Keep in mind, if the resources are shared, pure parallelism cannot be achieved, but this is where concurrency would have it's best practical use, taking up another job that doesn't need that resource.

I really like Paul Butcher's answer to this question (he's the writer of Seven Concurrency Models in Seven Weeks):
Although they’re often confused, parallelism and concurrency are
different things. Concurrency is an aspect of the problem domain—your
code needs to handle multiple simultaneous (or near simultaneous)
events. Parallelism, by contrast, is an aspect of the solution
domain—you want to make your program run faster by processing
different portions of the problem in parallel. Some approaches are
applicable to concurrency, some to parallelism, and some to both.
Understand which you’re faced with and choose the right tool for the
job.

In electronics serial and parallel represent a type of static topology, determining the actual behaviour of the circuit. When there is no concurrency, parallelism is deterministic.
In order to describe dynamic, time-related phenomena, we use the terms sequential and concurrent. For example, a certain outcome may be obtained via a certain sequence of tasks (eg. a recipe). When we are talking with someone, we are producing a sequence of words. However, in reality, many other processes occur in the same moment, and thus, concur to the actual result of a certain action. If a lot of people is talking at the same time, concurrent talks may interfere with our sequence, but the outcomes of this interference are not known in advance. Concurrency introduces indeterminacy.
The serial/parallel and sequential/concurrent characterization are orthogonal. An example of this is in digital communication. In a serial adapter, a digital message is temporally (i.e. sequentially) distributed along the same communication line (eg. one wire). In a parallel adapter, this is divided also on parallel communication lines (eg. many wires), and then reconstructed on the receiving end.
Let us image a game, with 9 children. If we dispose them as a chain, give a message at the first and receive it at the end, we would have a serial communication. More words compose the message, consisting in a sequence of communication unities.
I like ice-cream so much. > X > X > X > X > X > X > X > X > X > ....
This is a sequential process reproduced on a serial infrastructure.
Now, let us image to divide the children in groups of 3. We divide the phrase in three parts, give the first to the child of the line at our left, the second to the center line's child, etc.
I like ice-cream so much. > I like > X > X > X > .... > ....
> ice-cream > X > X > X > ....
> so much > X > X > X > ....
This is a sequential process reproduced on a parallel infrastructure (still partially serialized although).
In both cases, supposing there is a perfect communication between the children, the result is determined in advance.
If there are other persons that talk to the first child at the same time as you, then we will have concurrent processes. We do no know which process will be considered by the infrastructure, so the final outcome is non-determined in advance.

I'm going to offer an answer that conflicts a bit with some of the popular answers here. In my opinion, concurrency is a general term that includes parallelism. Concurrency applies to any situation where distinct tasks or units of work overlap in time. Parallelism applies more specifically to situations where distinct units of work are evaluated/executed at the same physical time. The raison d'etre of parallelism is speeding up software that can benefit from multiple physical compute resources. The other major concept that fits under concurrency is interactivity. Interactivity applies when the overlapping of tasks is observable from the outside world. The raison d'etre of interactivity is making software that is responsive to real-world entities like users, network peers, hardware peripherals, etc.
Parallelism and interactivity are almost entirely independent dimension of concurrency. For a particular project developers might care about either, both or neither. They tend to get conflated, not least because the abomination that is threads gives a reasonably convenient primitive to do both.
A little more detail about parallelism:
Parallelism exists at very small scales (e.g. instruction-level parallelism in processors), medium scales (e.g. multicore processors) and large scales (e.g. high-performance computing clusters). Pressure on software developers to expose more thread-level parallelism has increased in recent years, because of the growth of multicore processors. Parallelism is intimately connected to the notion of dependence. Dependences limit the extent to which parallelism can be achieved; two tasks cannot be executed in parallel if one depends on the other (Ignoring speculation).
There are lots of patterns and frameworks that programmers use to express parallelism: pipelines, task pools, aggregate operations on data structures ("parallel arrays").
A little more detail about interactivity:
The most basic and common way to do interactivity is with events (i.e. an event loop and handlers/callbacks). For simple tasks events are great. Trying to do more complex tasks with events gets into stack ripping (a.k.a. callback hell; a.k.a. control inversion). When you get fed up with events you can try more exotic things like generators, coroutines (a.k.a. Async/Await), or cooperative threads.
For the love of reliable software, please don't use threads if what you're going for is interactivity.
Curmudgeonliness
I dislike Rob Pike's "concurrency is not parallelism; it's better" slogan. Concurrency is neither better nor worse than parallelism. Concurrency includes interactivity which cannot be compared in a better/worse sort of way with parallelism. It's like saying "control flow is better than data".

From the book Linux System Programming by Robert Love:
Concurrency, Parallelism, and Races
Threads create two related but distinct phenomena: concurrency and
parallelism. Both are bittersweet, touching on the costs of threading
as well as its benefits. Concurrency is the ability of two or more
threads to execute in overlapping time periods. Parallelism is
the ability to execute two or more threads simultaneously.
Concurrency can occur without parallelism: for example, multitasking
on a single processor system. Parallelism (sometimes emphasized as
true parallelism) is a specific form of concurrency requiring multiple processors (or a single processor capable of multiple engines
of execution, such as a GPU). With concurrency, multiple threads make
forward progress, but not necessarily simultaneously. With
parallelism, threads literally execute in parallel, allowing
multithreaded programs to utilize multiple processors.
Concurrency is a programming pattern, a way of approaching problems.
Parallelism is a hardware feature, achievable through concurrency.
Both are useful.
This explanation is consistent with the accepted answer. Actually the concepts are far simpler than we think. Don't think them as magic. Concurrency is about a period of time, while Parallelism is about exactly at the same time, simultaneously.

Concurrency is the generalized form of parallelism. For example parallel program can also be called concurrent but reverse is not true.
Concurrent execution is possible on single processor (multiple threads, managed by scheduler or thread-pool)
Parallel execution is not possible on single processor but on multiple processors. (One process per processor)
Distributed computing is also a related topic and it can also be called concurrent computing but reverse is not true, like parallelism.
For details read this research paper
Concepts of Concurrent Programming

I really liked this graphical representation from another answer - I think it answers the question much better than a lot of the above answers
Parallelism vs Concurrency
When two threads are running in parallel, they are both running at the same time. For example, if we have two threads, A and B, then their parallel execution would look like this:
CPU 1: A ------------------------->
CPU 2: B ------------------------->
When two threads are running concurrently, their execution overlaps. Overlapping can happen in one of two ways: either the threads are executing at the same time (i.e. in parallel, as above), or their executions are being interleaved on the processor, like so:
CPU 1: A -----------> B ----------> A -----------> B ---------->
So, for our purposes, parallelism can be thought of as a special case of concurrency
Source: Another answer here
Hope that helps.

"Concurrency" is when there are multiple things in progress.
"Parallelism" is when concurrent things are progressing at the same time.
Examples of concurrency without parallelism:
Multiple threads on a single core.
Multiple messages in a Win32 message queue.
Multiple SqlDataReaders on a MARS connection.
Multiple JavaScript promises in a browser tab.
Note, however, that the difference between concurrency and parallelism is often a matter of perspective. The above examples are non-parallel from the perspective of (observable effects of) executing your code. But there is instruction-level parallelism even within a single core. There are pieces of hardware doing things in parallel with CPU and then interrupting the CPU when done. GPU could be drawing to screen while you window procedure or event handler is being executed. The DBMS could be traversing B-Trees for the next query while you are still fetching the results of the previous one. Browser could be doing layout or networking while your Promise.resolve() is being executed. Etc, etc...
So there you go. The world is as messy as always ;)

The simplest and most elegant way of understanding the two in my opinion is this. Concurrency allows interleaving of execution and so can give the illusion of parallelism. This means that a concurrent system can run your Youtube video alongside you writing up a document in Word, for example. The underlying OS, being a concurrent system, enables those tasks to interleave their execution. Because computers execute instructions so quickly, this gives the appearance of doing two things at once.
Parallelism is when such things really are in parallel. In the example above, you might find the video processing code is being executed on a single core, and the Word application is running on another. Note that this means that a concurrent program can also be in parallel! Structuring your application with threads and processes enables your program to exploit the underlying hardware and potentially be done in parallel.
Why not have everything be parallel then? One reason is because concurrency is a way of structuring programs and is a design decision to facilitate separation of concerns, whereas parallelism is often used in the name of performance. Another is that some things fundamentally cannot fully be done in parallel. An example of this would be adding two things to the back of a queue - you cannot insert both at the same time. Something must go first and the other behind it, or else you mess up the queue. Although we can interleave such execution (and so we get a concurrent queue), you cannot have it parallel.
Hope this helps!

"Concurrent" is doing things -- anything -- at the same time. They could be different things, or the same thing. Despite the accepted answer, which is lacking, it's not about "appearing to be at the same time." It's really at the same time. You need multiple CPU cores, either using shared memory within one host, or distributed memory on different hosts, to run concurrent code. Pipelines of 3 distinct tasks that are concurrently running at the same time are an example: Task-level-2 has to wait for units completed by task-level-1, and task-level-3 has to wait for units of work completed by task-level-2. Another example is concurrency of 1-producer with 1-consumer; or many-producers and 1-consumer; readers and writers; et al.
"Parallel" is doing the same things at the same time. It is concurrent, but furthermore it is the same behavior happening at the same time, and most typically on different data. Matrix algebra can often be parallelized, because you have the same operation running repeatedly: For example the column sums of a matrix can all be computed at the same time using the same behavior (sum) but on different columns. It is a common strategy to partition (split up) the columns among available processor cores, so that you have close to the same quantity of work (number of columns) being handled by each processor core. Another way to split up the work is bag-of-tasks where the workers who finish their work go back to a manager who hands out the work and get more work dynamically until everything is done. Ticketing algorithm is another.
Not just numerical code can be parallelized. Files too often can be processed in parallel. In a natural language processing application, for each of the millions of document files, you may need to count the number of tokens in the document. This is parallel, because you are counting tokens, which is the same behavior, for every file.
In other words, parallelism is when same behavior is being performed concurrently. Concurrently means at the same time, but not necessarily the same behavior. Parallel is a particular kind of concurrency where the same thing is happening at the same time.
Terms for example will include atomic instructions, critical sections, mutual exclusion, spin-waiting, semaphores, monitors, barriers, message-passing, map-reduce, heart-beat, ring, ticketing algorithms, threads, MPI, OpenMP.
Gregory Andrews' work is a top textbook on it: Multithreaded, Parallel, and Distributed Programming.

Concurrency can involve tasks run simultaneously or not (they can indeed be run in separate processors/cores but they can as well be run in "ticks"). What is important is that concurrency always refer to doing a piece of one greater task. So basically it's a part of some computations. You have to be smart about what you can do simultaneously and what not to and how to synchronize.
Parallelism means that you're just doing some things simultaneously. They don't need to be a part of solving one problem. Your threads can, for instance, solve a single problem each. Of course synchronization stuff also applies but from different perspective.

Parallelism:
Having multiple threads do similar task which are independent of each other in terms of data and resource that they require to do so. Eg: Google crawler can spawn thousands of threads and each thread can do it's task independently.
Concurrency:
Concurrency comes into picture when you have shared data, shared resource among the threads. In a transactional system this means you have to synchronize the critical section of the code using some techniques like Locks, semaphores, etc.

Explanation from this source was helpful for me:
Concurrency is related to how an application handles multiple tasks it
works on. An application may process one task at at time
(sequentially) or work on multiple tasks at the same time
(concurrently).
Parallelism on the other hand, is related to how an application
handles each individual task. An application may process the task
serially from start to end, or split the task up into subtasks which
can be completed in parallel.
As you can see, an application can be concurrent, but not parallel.
This means that it processes more than one task at the same time, but
the tasks are not broken down into subtasks.
An application can also be parallel but not concurrent. This means
that the application only works on one task at a time, and this task
is broken down into subtasks which can be processed in parallel.
Additionally, an application can be neither concurrent nor parallel.
This means that it works on only one task at a time, and the task is
never broken down into subtasks for parallel execution.
Finally, an application can also be both concurrent and parallel, in
that it both works on multiple tasks at the same time, and also breaks
each task down into subtasks for parallel execution. However, some of
the benefits of concurrency and parallelism may be lost in this
scenario, as the CPUs in the computer are already kept reasonably busy
with either concurrency or parallelism alone. Combining it may lead to
only a small performance gain or even performance loss.

Concurrent programming regards operations that appear to overlap and is primarily concerned with the complexity that arises due to non-deterministic control flow. The quantitative costs associated with concurrent programs are typically both throughput and latency. Concurrent programs are often IO bound but not always, e.g. concurrent garbage collectors are entirely on-CPU. The pedagogical example of a concurrent program is a web crawler. This program initiates requests for web pages and accepts the responses concurrently as the results of the downloads become available, accumulating a set of pages that have already been visited. Control flow is non-deterministic because the responses are not necessarily received in the same order each time the program is run. This characteristic can make it very hard to debug concurrent programs. Some applications are fundamentally concurrent, e.g. web servers must handle client connections concurrently. Erlang is perhaps the most promising upcoming language for highly concurrent programming.
Parallel programming concerns operations that are overlapped for the specific goal of improving throughput. The difficulties of concurrent programming are evaded by making control flow deterministic. Typically, programs spawn sets of child tasks that run in parallel and the parent task only continues once every subtask has finished. This makes parallel programs much easier to debug. The hard part of parallel programming is performance optimization with respect to issues such as granularity and communication. The latter is still an issue in the context of multicores because there is a considerable cost associated with transferring data from one cache to another. Dense matrix-matrix multiply is a pedagogical example of parallel programming and it can be solved efficiently by using Straasen's divide-and-conquer algorithm and attacking the sub-problems in parallel. Cilk is perhaps the most promising language for high-performance parallel programming on shared-memory computers (including multicores).
Copied from my answer: https://stackoverflow.com/a/3982782

Splitting a computational workload: where is it possible or impossible

I program although I am not a computer scientist. Therefore I would like to see if I understood correctly the challenge in splitting a workload. Is this below the correct way to think about it?
Specifically, is the following statement (1) correct?
(1) If A(X_a) + A(X_b) + A(X_c) + ... = B(X_a,X_b,X_c, ...) = Y is an equation that is being computed
whether or not it can be computed more rapidly from the perspective of the computer by assigning parts of the equation to be computed by individual threads at the same time depends on the following
if X_m changes when A(X_n) changes for m not equal to n, then dividing the workload for that particular computation is gives less of a performance gain, and if this is true for every combination of m and n in the system, then no performance gain for multithreading over single threading is possible.
Or in other words do I understand correctly that presence of linked variables decreases ability to multithread successfully because X_b and X_c depend on what A(X_a) and it bottlenecks the process: the other threads know A but have to wait for the first thread to give an output before they have instructions to execute, so simultaneous working on parts of an instruction which is easily broken up into parts cannot be done and the computation takes as much time one one thread doing each part of the calculation one after the other as it does to perform on more than one thread working at once and summing the results in order they complete on the fly on another thread.
(2) Or is there a way around the above bottleneck? For example if this bottleneck is known in advance, the first thread can start early and store in memory the results to A(X_n) for all n that bottleneck the operation, and then split the workload efficiently, one A(X_i) to the i th thread, but to do this, the first thread would have to predict in some way when the calculation B(X_a,X_b,X_c, ...) must be executed BEFORE B(X_a,X_b,X_c, ...) is actually executed, otherwise it would run into the bottleneck.
[EDIT: To clarify, in context of NWP's answer. If the clarification is too long / unclear, please leave a comment, and I'll make a few graphics in LaTeX to shorten the question writeup.]
Suppose the longest path in the program "compute I" is 5 units of time in the example. If you know this longest path, and the running system can anticipate (based on past frequency of execution) when this program "compute I" will be run in the future, subprogram "compute B->E" (which does not depend on anything else but is a proper subset of the longest path of program "compute I") may be executed in advance. The result is stored in memory prior to the user requesting "compute I".
If so, is the max speedup considered to be 9/4? The B->E is ready, so other threads do not have to wait for it. Or is max speed up for "compute I" still considered to be 9/5?
The anticipation program run before has a cost, but this cost may be spread over each instance of execution of "compute I". If the anticipation program has 15 steps, but the program "compute I" is run typically 100 times per each execution of the anticipation program, and all steps cost equally, do we simply say the max speedup possible in "compute I" is therefore 9/(5 - 1 + 15/100)?
The speedup possible now appears to depend not only on the number of threads, and the longest path, but also on the memory available to store precalculations and how far in advance another program can anticipate "compute I" will be run and precalculate proper subprograms of it. Another program "compute X" may have the same length of the longest path as "compute I" but the system cannot anticipate that "compute X" will be run equally as far in advance as "compute I". How do we weight the speedup achieved (i) at expense of increasing memory to store precalculations (ii) timing of execution of some programs can be anticipated further in advance than of other program allowing bottleneck to be precalculated and this way cutting down the longest path?
But if a longest path can be dynamically cut down in dynamics by improving predictive precalculation of subprograms and greater memory for storing results of precalculation, can bottlenecks be considered at all as determining the ultimate upper boundary to speedup due to splitting a computational workload?
From the linked variables dependency bottleneck perspective / graph bottle perspective, the ultimate upper boundary of speedup to multithreading a program "compute I" appears to be determined by longest subprogram (other subprograms depend on it / wait for it). But from the dynamics perspective, where the whole system is running before and after the program "compute I" is executed as a part of it, sufficient predictability of timing of future execution of "compute I" and ability to store more and more precalculations of its independent subprograms can completely cut down length of all subprograms of "compute I" to 1 unit, meaning it can in possibly achieve a speedup of 9/1 = 9, if sufficient predictability and memory is available.
Which perspective is the correct one for estimating the upper bounds to speedup by multithreading? (A program run in a system running a long time with sufficient memory seems to have no limit to multithreading, whereas if it is looked at by itself, there is a very definite fixed limit to the speedup.)
Or is the question of ability to cut down longest path by anticipation and partial precalculation a moot one because speedup in that case varies with the user's decision to execute a program in a way that can be predicted and so cannot upper boundary to multithreading speedup due to anticipation cannot be know to a program writer or system designer and should be ignored / not relied upon to exist?

I do not quite understand which things depend on what from your description but I can give you some theory. There is Ahmdal's law which gives you an upper bound of the speedup you can achieve based on how parallelizable a given algorithm is assuming you have enough processors. If you can parallelize 50% of the calculation you can get a maximum speedup of 2x. 95% parallelization gives you a maximum speedup of 20x. To figure out how much speedup you can get you need to know how much of your problem can be parallelized. This can be done by drawing a graph of the things you need to do and which depend on what and figure out the longest path. Example:
In this example the longest path would be B->E->F->H->I. All blocks are assumed to take the same time to execute. So there are 9 blocks, the longest path is 5 blocks, so your maximum achievable speedup is 9/5 = 1.8x. In practice you need to consider that your computer can only run a limited number of threads in parallel, that some blocks take longer than others and that there is a cost involved in creating threads and using appropriate locking mechanisms to prevent data races. Those can be added to the graph by giving each block a cost and finding the longest path based on adding cost including the cost of threading mechanisms. Although this method only gives you an upper bound it tends to be very humbling. I hope this allows you to draw a graph and find the answer.
EDIT:
I forgot to say that Ahmdal's law compares executing the code with a single thread to executing the code with an infinite number of threads with no overhead. If you make the multithreaded version execute different code than the single threaded version you are no longer bound by Ahmdal's law.
With enough memory and time you can calculate the results for all possible inputs and then just do a lookup based on a given input to find the result. Such a system would get higher speedup because it does not actually calculate anything and is not bound by Ahmdal's law. If you manage to optimize B->E to take zero units of time the longest path becomes 3 and there are only 8 nodes giving you a maximum speedup of 8/3 = 2.66x which is better than the 1.8x of before. That is only the speedup possibility by multithreading though, actually the first version takes 4 time units and the second version 3 time units. Optimizing code can give you more speedup than multithreading. The graph can still be useful though. Assuming you do not run out of cores the graph can tell you which parts of your program are worth optimizing and which are not. Assuming you do run out of cores the graph can tell you which paths should be prioritized. In my example I calculate A, B, C and D simultaneously and therefore need a quadcore to make it work. If I move C down in time to execute in parallel to E and make D run parallel to H a dualcore will suffice for the same speedup of 1.8x.

Algorithm to optimize # threads used in a calculation

I'm performing an operation, lets call it CalculateSomeData. CalculateSomeData operates in successive "generations", numbered 1..x. The number of generations in the entire run is fixed by the input parameters to CalculateSomeData and is known a priori. A single generation takes anywhere from 30 minutes to 2 hours to complete. Some of that variability is due to the input parameters and that cannot be controlled. However, a portion of that variability is due to things like hardware capacities, CPU load from other processes, network bandwidth load, etc. One parameter that can be controlled per-generation is the number of threads that CalculateSomeData uses. Right now that's fixed and likely non-optimal. I'd like to track the time each generation takes and then have some algorithm by which I tweak the number of threads so that each successive generation improves upon the prior generation's calculation time (minimizing time). What approach should I use? How applicable are genetic algorithms? Intuition tells me that the range is going to be fairly tight - maybe 1 to 16 threads on a dual quad-core processor machine.
any pointers, pseudocode, etc. are much appreciated.

How about an evolutionary algorithm.
Start with a guess. 1 thread per CPU core seems good, but depends on the task at hand.
Measure the average time for each task in the generation. Compare it to the time taken by the previous generation. (Assume effectively infinite time and 0 threads for generation 0).
If the most recent generation tasks averaged a better time than the one before, continue to change the number of threads in the same direction as you did last step (so if the last generation had more threads than the previous thread, then add a thread for the new generation, but if it had fewer, then use one fewer (obviously with a lower limit of 1 thread).
If the most recent generation tasks took longer, on average, than the previous generation, then change the number of threads in the opposite direction (so if increasing the number of threads resulted in worse time, use one fewer thread next time).
As long as the optimal number of threads isn't too close to 1, then you'll probably end up oscillating between 3 values that are all reasonably close to optimal. You may want to explicitly detect this case and lock yourself into the central value, if you have a large number of generations to deal with.

If the calculations are completely CPU bound the number of threads should be equal to the number of cores on the machine. That way you minimize the number of context switches.
If your calculations involve I/O, network, synchronization or something else that blocks execution you must find the limiting resource and measure the utilization. You need to monitor the utilization and slowly add more threads until the utilization gets close to 100%. You should have as few threads as possible to saturate your limiting resource.

You should divide up your generations into lots of small tasks and put them in a queue. Spawn one thread per core and have each thread grab a task to do, run it to completion, and repeat.
You want lots more tasks than cores to make sure that you don't end up with just one task running at the end of the generation and all other threads idle. This is what is likely to happen if you set #tasks = #threads = #cores as Albin suggests (unless you can ensure that all tasks take precisely the same amount of time).
You also probably don't want more threads than cores. Context switching isn't terribly expensive, but the larger cache footprint that comes with having more than #cores tasks simultaneously active could hurt you (unless your tasks use very little memory).

(When) are parallel sorts practical and how do you write an efficient one?

I'm working on a parallelization library for the D programming language. Now that I'm pretty happy with the basic primitives (parallel foreach, map, reduce and tasks/futures), I'm starting to think about some higher level parallel algorithms. Among the more obvious candidates for parallelization is sorting.
My first question is, are parallelized versions of sorting algorithms useful in the real world, or are they mostly academic? If they are useful, where are they useful? I personally would seldom use them in my work, simply because I usually peg all of my cores at 100% using a much coarser grained level of parallelism than a single sort() call.
Secondly, it seems like quick sort is almost embarrassingly parallel for large arrays, yet I can't get the near-linear speedups I believe I should be getting. For a quick sort, the only inherently serial part is the first partition. I tried parallelizing a quick sort by, after each partition, sorting the two subarrays in parallel. In simplified pseudocode:
// I tweaked this number a bunch. Anything smaller than this and the
// overhead is smaller than the parallelization gains.
const smallestToParallelize = 500;
void quickSort(T)(T[] array) {
if(array.length < someConstant) {
insertionSort(array);
return;
}
size_t pivotPosition = partition(array);
if(array.length >= smallestToParallelize) {
// Sort left subarray in a task pool thread.
auto myTask = taskPool.execute(quickSort(array[0..pivotPosition]));
quickSort(array[pivotPosition + 1..$]);
myTask.workWait();
} else {
// Regular serial quick sort.
quickSort(array[0..pivotPosition]);
quickSort(array[pivotPosition + 1..$]);
}
}
Even for very large arrays, where the time the first partition takes is negligible, I can only get about a 30% speedup on a dual core, compared to a purely serial version of the algorithm. I'm guessing the bottleneck is shared memory access. Any insight on how to eliminate this bottleneck or what else the bottleneck might be?
Edit: My task pool has a fixed number of threads, equal to the number of cores in the system minus 1 (since the main thread also does work). Also, the type of wait I'm using is a work wait, i.e. if the task is started but not finished, the thread calling workWait() steals other jobs off the pool and does them until the one it's waiting on is done. If the task isn't started, it is completed in the current thread. This means that the waiting isn't inefficient. As long as there is work to be done, all threads will be kept busy.

Keep in mind I'm not an expert on parallel sort, and folks make research careers out of parallel sort but...
1) are they useful in the real world.
of course they are, if you need to sort something expensive (like strings or worse) and you aren't pegging all the cores.
think UI code where you need to sort a large dynamic list of strings based on context
think something like a barnes-hut n-bodies sim where you need to sort the particles
2) Quicksort seems like it would give a linear speedup, but it isn't. The partition step is a sequential bottleneck, you will see this if you profile and it will tend to cap out at 2-3x on a quad core.
If you want to get good speedups on a smaller system you need to ensure that your per task overheads are really small and ideally you will want to ensure that you don't have too many threads running, i.e. not much more than 2 on a dual core. A thread pool probably isn't the right abstraction.
If you want to get good speedups on a larger system you'll need to look at the scan based parallel sorts, there are papers on this. bitonic sort is also quite easy parallelize as is merge sort. A parallel radix sort can also be useful, there is one in the PPL (if you aren't averse to Visual Studio 11).

I'm no expert but... here is what I'd look at:
First of all, I've heard that as a rule of thumb, algorithms that look at small bits of a problem from the start tends to work better as parallel algorithms.
Looking at your implementation, try making the parallel/serial switch go the other way: partition the array and sort in parallel until you have N segments, then go serial. If you are more or less grabbing a new thread for each parallel case, then N should be ~ your core count. OTOH if your thread pool is of fixed size and acts as a queue of short lived delegates, then I'd use N ~ 2+ times your core count (so that cores don't sit idle because one partition finished faster).
Other tweaks:
skip the myTask.wait(); at the local level and rather have a wrapper function that waits on all the tasks.
Make a separate serial implementation of the function that avoids the depth check.

"My first question is, are parallelized versions of sorting algorithms useful in the real world" - depends on the size of the data set that you are working on in the real work. For small sets of data the answer is no. For larger data sets it depends not only on the size of the data set but also the specific architecture of the system.
One of the limiting factors that will prevent the expected increase in performance is the cache layout of the system. If the data can fit in the L1 cache of a core, then there is little to gain by sorting across multiple cores as you incur the penalty of the L1 cache miss between each iteration of the sorting algorithm.
The same reasoning applies to chips that have multiple L2 caches and NUMA (non-uniform memory access) architectures. So the more cores that you want to distribute the sorting across, the smallestToParallelize constant will need to be increased accordingly.
Another limiting factor which you identified is shared memory access, or contention over the memory bus. Since the memory bus can only satisfy a certain number of memory accesses per second; having additional cores that do essentially nothing but read and write to main memory will put a lot of stress on the memory system.
The last factor that I should point out is the thread pool itself as it may not be as efficient as you think. Because you have threads that steal and generate work from a shared queue, that queue requires synchronization methods; and depending on how those are implemented, they can cause very long serial sections in your code.

I don't know if answers here are applicable any longer or if my suggestions are applicable to D.
Anyway ...
Assuming that D allows it, there is always the possibility of providing prefetch hints to the caches. The core in question requests that data it will soon (not immediately) need be loaded into a certain cache level. In the ideal case the data will have been fetched by the time the core starts working on it. More likely the prefetch process will be more or less on the way which at least will result in less wait states than if the data were fetched "cold."
You'll still be constrained by the overall cache-to-RAM throughput capacity so you'll need to have organized the data such that so much data is in the core's exclusive caches that it can spend a fair amount of time there before having to write updated data.
The code and data need to be organized according to the concept of cache lines (fetch units of 64 bytes each) which is the smallest-sized unit in a cache. This should result in that for two cores the work needs to be organized such that the memory system works half as much per core (assuming 100% scalability) as before when only one core was working and the work hadn't been organized. For four cores a quarter as much and so on. It's quite a challenge but by no means impossible, it just depends on how imaginative you are in restructuring the work. As always, there are solutions that cannot be conceived ... until someone does just that!
I don't know how WYSIWYG D is compared to C - which I use - but in general I think the process of developing scaleable applications is ameliorated by how much the developer can influence the compiler in its actual machine code generation. For interpreted languages there will be so much memory work going on by the interpreter that you risk not being able to discern improvements from the general "background noise."
I once wrote a multi-threaded shellsort which ran 70% faster on two cores compared to one and 100% on three cores compared to one. Four cores ran slower than three. So I know the dilemmas you face.

I would like to point you to External Sorting[1] which faces similar problems. Usually, this class of algorithms is used mostly to cope with large volumes of data, but their main point is that they split up large chunks into smaller and unrelated problems, which are therefore really great to run in parallel. You "only" need to stitch together the partial results afterwards, which is not quite as parallel (but relatively cheap compared to the actual sorting).
An External Merge Sort would also work really well with an unknown amount of threads. You just split the work-load arbitrarily, and give each chunk of n elements to a thread whenever there is one idle, until all your work units are done, at which point you can start joining them up.
[1] http://en.wikipedia.org/wiki/External_sorting

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string