About multithreading, concurrency, and parallelism - multithreading

Recently I had confusion with understanding concepts: multithreading, concurrency, and parallelism. In order to reduce confusion, I've tried to organize my understanding about these and drawn my conclusion. My question is,
Is there a misunderstanding or something wrong from conclusion below?
References I took can be found here.
1. Concurrency and parallelism are different level of category.
It is not either concurrent or parallelism. It is either concurrent or not, and either parallel or not.
For example,
Not concurrent (Sequential) / Not parallel
Not concurrent (Sequential) / Parallel
Concurrent / Not parallel
Concurrent / Parallel
2. Parallelism is not subset of concurrency.
3. How does threading or multithreading relates to concurrency and parallelism?
Definition of thread clarifies this. Thread is "unit of execution flow". This "execution flow" can be managed independently by a scheduler, which is typically a part of the operating system.
Having a thread means having one unit of execution flow.
Having multiple threads (Multithreading) means having multiple units of execution flow.
And,
Having multiple units of execution flow is having multiple things making progress, which is definition of concurrency.
And,
Multiple units of execution flow is done by time slicing in single core hardware environment.
Multiple units of execution flow is done in parallel in multi core hardware environment.
4. Is multithreading concurrent or parallel?
Multithreading, or having multiple units of execution flow, is concurrent.
Multithreading itself is just having multiple units of execution flow. This has nothing to do with parallelism.
How operating system deals with multiple units of execution flow relates to parallelism.
Parallelism is achieved by operating system and hardware environment.
"Code can be concurrent, but not parallel." (Parallelism implies concurrency but not the other way round right?, stackexchange)
Detailed description will be truly appreciated.

Parallelism Refers to any system in which a single application can make use of more computing hardware than a single CPU can provide. There are a number of different types of parallel computing architecture, but when people say "parallelism" they often are talking about one in particular...
...A Symmetric MultiProcessing (SMP) system is a computer with one memory system, and two or more traditional CPUs that have equal access to it. Most modern workstations, most mobile devices, and many server systems* are SMP.
Multithreading is a model of concurrent computing.** A computer scientist might tell you that two threads run concurrently when the order in which the operations they perform are interleaved is not strictly determined by the program itself. A software developer is more likely to say that two threads run concurrently with each other when both threads have been started and neither of them has finished.
One way to achieve parallelism in an application running on an SMP system is to use multiple concurrent threads.
* Some servers are NUMA, which is a close cousin to SMP. In a NUMA system, the CPUs all access the same memory system, just like in SMP, except that each CPU "owns" part of the physical memory space, and it can access its own memory locations more quickly than it can access memory locations that are owned by other CPUs.
** There are other models of concurrent computing. Some, such as Actors, are used in production software. Others are mostly of academic interest.

Related

Hardware Multithreading and Simultaneous Multithreading(SMT)

I'm reading Multithreading (computer architecture) - Wiki, aka hardware threading, and I'm trying to understand the second paragraph:
(p2): Where multiprocessing systems include multiple complete processing units in one or more cores, multithreading aims to increase utilization of a single core by using thread-level parallelism, as well as instruction-level parallelism.
while the link to thread-level parallelism says:
(Link): Thread-level parallelism (TLP) is the parallelism inherent in an application that runs multiple threads at once. This type of parallelism is found largely in applications written for commercial servers such as ...
which is not so useful... So I read task parallelism above, since I guess TLP is a subtype of it:
Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing tasks—concurrently performed by processes or threads—across different processors.
Question: If thread-level parallelism is task parallelism, and task parallelism is for parallelization across multiple processors, how increase utilization of a single core by using thread-level parallelism work?
Guessing: I guess for TLP, it should mean across multiple logical processors, i.e. hardware threads in the perspective of OS, correct?
Another minor issue is that for my first link, Multithreading:
In computer architecture, multithreading is the ability of a central processing unit (CPU) (or a single core in a multi-core processor) to execute multiple processes or threads concurrently, supported by the operating system.
And in (p2) it aim to increase utilization of a single core by using thread-level parallelism? What a contradiction.
I don't think we should base off wiki definitions, the wording there is not accurate enough to merit searching for contradictions.
First, I would describe task parallelism as a form of parallelism inherent to some algorithm or problem, where there could be a functional decomposition into multiple tasks with different nature, that can run concurrently. Alternative forms of parallelism include for example spatial or data decomposition, where the problem can be broken into different parts of the data or the input layout (e.g., array ranges, matrix tiles, image parts...).
Thread-level parallelism is a different taxonomy, it is any form of parallelism that can be extracted for utilization by a multi-threaded system. It requires the decomposition to be coarse grained enough to allow the different threads to run independently (otherwise the synchronization overhead required would make it useless).
The alternative for that is for example ILP (instruction level parallelism) which is when a single thread context can extract parallelism within the code by running over a deep out-of-order machine that can schedule based on readiness. This allows more fine-grained parallelism and less programmer involvement usually, but limits the parallelism to the depth of the OOO window.
On a related topic - be careful not to confuse simultaneous execution and concurrent one.
Thread level parallelism can be used by extracting task-level parallelism or other forms of algorithm decomposition from the code. It can then be run on a system that is single-core (preemptive), or multi-threaded. The latter type can be achieved through multi-core systems, simultaneous multi-threading or both (common processors usually have many cores, and may of them support SMT on top of that).
I think my intuition should be correct, it is either:
Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments.
should be updated to:
Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple logical processors in parallel computing environments.
to include the possibility of hyper threading.
No update, but task parallelism is focus on cross-processor parallelism while for TLP it should mean
Thread-level parallelism is a form of parallelization of computer code across multiple logical processors in parallel computing environments.
again, to include the possibility of hyper threading.
Useful resource:
https://en.wikipedia.org/wiki/Hyper-threading
https://www.youtube.com/watch?v=wnS50lJicXc
https://en.wikipedia.org/wiki/Simultaneous_multithreading
especially this line:
The name multithreading is ambiguous, because not only can multiple threads be executed simultaneously on one CPU core, but also multiple tasks (with different page tables, different task state segments, different protection rings, different I/O permissions, etc.).
So for the minor issue, see Concurrent computing - #Introduction - p1:
The concept of concurrent computing is frequently confused with the related but distinct concept of parallel computing,[2][3] although both can be described as "multiple processes executing during the same period of time". In parallel computing, execution occurs at the same physical instant: for example, on separate processors of a multi-processor machine, with the goal of speeding up computations—parallel computing is impossible on a (one-core) single processor, as only one computation can occur at any instant (during any single clock cycle).

Differences beteween Threading in Nvidias GPUs and CPUs

I am trying to understand the difference between the threading techniques used by Nividia GPUs and normal (multi threading) CPUs. In particular my two questions are:
Which part of the system is respondsible for the thread scheduling and according to which aspects are they scheduled?
Are threads processed synchronously?
CUDA cores and CPU cores are literally a complete different thing - the name is more a marketing thing;
What do you mean with responsible for thread scheduling? Its mostly both Software and Hardware. For instance the pure CPU has little to do with the actual thread scheduling, but provides the necessary functionality to implement a thread-scheduler as a part of the OS. So the scheduling parameter are defined by the software. Hence you should adopt your question to a specific OS.
One thing the CPU provides are the so called hardware-threads. Each hardware-thread allows the "parallel" execution of one software-thread. (Note: With Hyperthreading, the execution is not really parallel more interleaving). The scheduler distributes all running threads on these hardware-threads.
This is basically a MIMD-System.
The scheduling on graphic-cards are way more complicated. In short:
You have a few thousands CUDA-cores - but in contrast to the CPU you cannot assign a unique application to each of them. The CUDA-cores are organized in groups (so called warps) and all CUDA-cores inside the same group execute the same thread simultaneously.
This is called SIMT

What are all the different types of parallelism?

I am trying to understand more about parallelism, but I've noticed there are a lot of different terms out there and some seem to mean the same thing while others have a notable difference. So, what are all the different types of parallelism, how do they differ from each other, and do any have specific applications or purposes?
(To keep this more focused, I'm hoping for an answer that provides clarity to all the terminology associated with parallelism, including terms not listed below; technical comparisons between each different type would be nice, but will probably result in this question becoming off-topic - then again, I don't really know, hence the question).
Note:
this is not a question about concurrency and goes beyond the "simple" question: "what is parallelism?", although a clarifying definition might be warranted.
First, I have taken notice of the difference between parallelism and threading, but some of the differences between the following terms are still confusing.
To add clarity to my question here is a list of terms that I have found that are related to parallelism: parallel computing, parallel processing, multithreading, multiprocessing, multicore programming, Hyper-threading (Intel) 2, Simultaneous MultiThreading (SMT) 3, Switch-on-Event MultiThreading 3. (If possible, definitions or references to definitions for each of these terms would also be appreciated).
My very specific question: what is the difference between thread-level parallelism, instruction-level parallelism, and process-level parallelism? (and any other x-level parallelism)?
In a multi-core processor, can parallelism occur within a single core? Is that what Hyper-threading is, and does that require a single core having, for example, two ALU's that can be used in parallel?
Last one: is there a difference between hardware vs software parallelism, aside from the obvious distinction that one happens in hardware while the other in software?
Related resources:
- Process vs Thread,
- Parallelism on a GPU,
- Hyper-threading,
- Concurrency vs Parallelism,
- Hyper-threading and gaming.
Q:What is the difference betweenthread-level parallelism,instruction-level parallelism,and process-level parallelism?
While the subject matter is indeed immensely wide, I would try to have this view, even at a risk of making many opponents present their objections of simplifying the subject matter ( but StackOverflow format does not substitute other sources of complete reference, does it ? ):
A:the main difference is WHAT / WHO / HOW is responsible for keeping things to execute in true-[PARALLEL]
Instruction Level Parallelism - ILP - is the simplest case, the CPU-architecture has designed and "hardwired" this particular form of hardware-based parallelism. Having processors with ILP4 ( 4 instructions executed at once ), or having processors with per-instruction based width of this form of parallel-instruction execution, be it ILP2 for some instructions but ILP1 for some others, again the silicon architecture decides, what can happen indeed in parallel at the instruction level. Some awkward surprises may arise from further details, as memory-controller channels may block ILP-mode in cases, where REG/MEMORY uops will have to wait for a free channel to access the instructed MEMORY.
hardware-threads are the next level of granularity. Given a CPU-core is declared to support two hardware threads, these are the only streams-of-code execution, that may flow in parallel ( if no O/S request comes to instantiate and schedule another thread to get executed, mapped onto one of the available CPU-core hardware-threads ). From the user-perspective, there are O/S tools that permit one to explicitly "nail"-down a process-level-PID / thread-level-PID affinity onto a particular CPU-core(s) and thus limit or even eliminate any "disturbance", so as to move from a "just"-[CONCURRENT] flow of code-execution closer to a true-[PARALLEL]one.
We will knowingly skip all the crowds of threads, that are just a tool for latency-masking ( be it on the SIMT / SMX warp-wide GPU-scheduler, or the more relaxed, MIMT O/S-kernel driven multithreading )
software operated distributed-systems parallelism is the one that ought be mentioned for completeness, but it has the principally highest adverse costs from a need to invent, define, implement and operate the setup / coordination in software ( which all causes overheads to grow remarkably ), in the sense of the re-formulated Amdahl's Law right due to a need to somehow design and keep operational the non-native orchestration of both the distributed process execution and all the dataflow, that it is dependent on.
hardware-based true-[PARALLEL] systems are at the highest level of orchestration, where both the silicon ( like the InMOS' network of meshed Transputers ) and also the programming language ( like the InMOS' occam or occam-pi ) provide the carefully engineered, conceptually crafted true-[PARALLEL] code-execution.
- MIMT: Multiple Instruction Multiple Threads, a non-restricted thread-execution fabric / policy, where any thread may and does issue a different instruction to the processor for execution, as opposed to SIMT
- SIMT: Single Instruction Multiple Threads, typically a GPU Streaming Multiprocessor code-execution architecture- SMX: Streaming Multiprocessor eXecution unit, typically a GPU SIMT building block, onto which the GPU-kernel code-units could be directed ( addressed ) for being TaskQueeue-scheduled and later executed, according to the WARP-wide SIMT-code scheduler coordinated
what is the difference between thread-level parallelism, instruction-level parallelism, and process-level parallelism?
In 1, different CPU cores execute different streams of instructions.
In 2, single CPU core executes different instructions from a single instruction stream in parallel (these instructions are either consecutive instructions in the stream, or otherwise very close to each other).
3 is same as 1, the difference is cosmetic. It’s just the default settings about which memory pages are shared across threads and which aren’t. But these settings are user-adjustable with process creation flags, shared memory sections, dynamic libraries, and other system APIs, that’s why on the lower level, the difference between process and threads is not a big deal.
and any other x-level parallelism
Another important one is SIMD level parallelism. For this one, CPU applies same instruction to multiple operands stored in special wide registers. With SSE we have 128-bit wide registers, and we can e.g. multiply a vector of 4 single-precision floating-point numbers in one register by another 4 values in another register, making 4 products in parallel, with a single mulps instruction. ARM NEON is similar, also 128 bit registers, the instruction to multiply 4 floats by 4 floats is vmul.f32. AVX operates on 256-bit registers so it can multiply 8 floats at once, with a single vmulps instruction.
can parallelism occur within a single core?
Yes.
Is that what Hyper-threading is
Yes, also it’s what instruction-level parallelism is, and SIMD parallelism, too.
does that require a single core having, for example, two ALU's that can be used in parallel?
Modern CPUs have more than two per core but HT was introduced in P4 and it’s not a requirement. The profit from HT is not just loading multiple ALUs, it’s also using the core while a thread is waiting for data to arrive from caches or from system RAM. And also, using the core while it's stalled because of the data dependency between nearby instructions. HT allows a CPU core to compute something else on another hardware thread while it’s waiting, therefore improving ALU utilization. Without HT, the core would likely just sit and wait for hundreds cycles in case of RAM latency, or for dozens cycles in case of data dependency latency.
is there a difference between hardware vs software parallelism
When you have a single hardware thread and multiple OS threads that compute stuff, only 1 thread will be running at any given time. The rest of the threads will be waiting. The OS will periodically (often ~50-100Hz) switch which one’s running, with the goal to give all threads a fair slice of CPU time. You can call that software parallelism if you want, but I wouldn’t call such thing parallel at all.

What is the difference between multicore and concurrent programming

Can anyone help me out I am working on a presentation and would like to include a bit about - 'The difference between multicore and concurrent programming', I have googled a bit but not turning up many good descriptions, any help appreciated! :)
Thanks,
Eamonn
Concurrent (occurring or existing simultaneously) implies that different code MAY execute at the exact same cycle. It means that things can possibly happen in parallel if multiple processors or a processor with multiple cores is available and the program is crafted correctly. Just adding threads does not imply concurrent execution.
The reason I say MAY and possibly is that anytime the programs separate threads need to share volatile/mutable state, other threads that need access to that state can not continue executing and will have to wait their turn to access that state, and things start happening serially again.
Typically this is implemented in a single program as more than one thread executing code concurrently at the same exact cycle as another thread, given that there is no resource contentions as listed above. This requires multiple physical processors or cores. Other models run multiple heavyweight OS processes that can execute concurrently.
Concurrent programming is very hard to do correctly with mutable shared state.
You can write a concurrent program
that runs serially on a single single
core processor, but scales up to
execute more things at the same time
when more processors or cores, or even
multiple processors with multiple
cores is present.
You can also cause single threaded programs to appear concurrent on a multi-core / multi-processor system if they can operate on independent ranges of input data at the same time. Example: a single threaded 3D rendering program can on a dual core machine can run 2 separate instances the first rendering all the odd frames and the second rendering all the even frames. As long as they don't try to share any mutable resources.
Multi-core means that a single CPU has multiple Processor cores that can execute threads or processes concurrently and typically appears as multiple processors to mainstream operating systems.
It does NOT imply that programs that are single threaded gain any concurrency behaviors or benefits from the additional processor cores available.
Concurrent Programming is more broad - it just refers to writing software that will run "concurrently" - ie: more than one thing will happen at a time.
"Multi-core" programming is really referring to a specific subset of concurrent programming, in which you are targetting multiple available CPU cores on a specific machine. This is the most common form of concurrent programming (typically single process running on a single computer), but still only one form of concurrent programming.
You can do concurrent programming on a machine that has only a single CPU core. The operating system provides the illusion that more than one thread is running at the same time, it rapidly switches back-and-forth between them.
A machine with multiple cores simply needs to this context switching less often since two threads can run at the same time on two cores. It is only a bit special because threading bugs can make your life difficult much quicker. The odds that two threads try to access a shared memory location at the same time is much higher.
At a high level, multi-core is an attribute of the processor chip in your computer. Multi core means it has got multiple processing cores. There are several types of multi-processor computers: the old style super computers with thousands of computers connected via ethernet, systems with more than processors (like 2 Pentium 4s), and contemporary multi-core systems where every processor package has multiple processing cores 9like Intel i7). The third type is often called multi-core of Chip Multiprocessor (CMP).
Concurrent programming is an attribute of software. Concurrent programming is about writing code which has is split into multiple tasks that can execute concurrently if processors are available. While concurrent programs do leverage multi-core, concurrent programming is broader in two dimensions:
Concurrent programs can run on a single core or multiple cores.
Concurrent programs can be used on any type of multi-processors I mentioned above.
Thus, to summarize:
Concurrent programming is about software that can use multiple processors if available. those processors can be on the same chip (multi-core or Chip Multiprocessor) or on different chips (often known as SMP). You can have systems where you can put two multi-core chips in the same system making it a CMP and an SMP at the same time. Concurrent programming will work for that as well.
Concurrent programming regards operations that appear to overlap and is primarily concerned with the complexity that arises due to non-deterministic control flow. The quantitative costs associated with concurrent programs are typically both throughput and latency. Concurrent programs are often IO bound but not always, e.g. concurrent garbage collectors are entirely on-CPU. The pedagogical example of a concurrent program is a web crawler. This program initiates requests for web pages and accepts the responses concurrently as the results of the downloads become available, accumulating a set of pages that have already been visited. Control flow is non-deterministic because the responses are not necessarily received in the same order each time the program is run. This characteristic can make it very hard to debug concurrent programs. Some applications are fundamentally concurrent, e.g. web servers must handle client connections concurrently. Erlang, F# asynchronous workflows and Scala's Akka library are perhaps the most promising approaches to highly concurrent programming.
Multicore programming is a special case of parallel programming. Parallel programming concerns operations that are overlapped for the specific goal of improving throughput. The difficulties of concurrent programming are evaded by making control flow deterministic. Typically, programs spawn sets of child tasks that run in parallel and the parent task only continues once every subtask has finished. This makes parallel programs much easier to debug than concurrent programs. The hard part of parallel programming is performance optimization with respect to issues such as granularity and communication. The latter is still an issue in the context of multicores because there is a considerable cost associated with transferring data from one cache to another. Dense matrix-matrix multiply is a pedagogical example of parallel programming and it can be solved efficiently by using Straasen's divide-and-conquer algorithm and attacking the sub-problems in parallel. Cilk is perhaps the most promising approach for high-performance parallel programming on multicores and it has been adopted in both Intel's Threaded Building Blocks and Microsoft's Task Parallel Library (in .NET 4).

Threads & Processes Vs MultiThreading & Multi-Core/MultiProcessor : How they are mapped?

I was very confused but the following thread cleared my doubts:
Multiprocessing, Multithreading,HyperThreading, Multi-core
But it addresses the queries from the hardware point of view. I want to know how these hardware features are mapped to software?
One thing that is obvious is that there is no difference between MultiProcessor(=Mutlicpu) and MultiCore other than that in multicore all cpus reside on one chip(die) where as in Multiprocessor all cpus are on their own chips & connected together.
So, mutlicore/multiprocessor systems are capable of executing multiple processes (firefox,mediaplayer,googletalk) at the "sametime" (unlike context switching these processes on a single processor system) Right?
If it correct. I'm clear so far. But the confusion arises when multithreading comes into picture.
MultiThreading "is for" parallel processing. right?
What are elements that are involved in multithreading inside cpu? diagram? For me to exploit the power of parallel processing of two independent tasks, what should be the requriements of CPU?
When people say context switching of threads. I don't really get it. because if its context switching of threads then its not parallel processing. the threads must be executed "scrictly simultaneously". right?
My notion of multithreading is that:
Considering a system with single cpu. when process is context switched to firefox. (suppose) each tab of firefox is a thread and all the threads are executing strictly at the same time. Not like one thread has executed for sometime then again another thread has taken until the context switch time is arrived.
What happens if I run a multithreaded software on a processor which can't handle threads? I mean how does the cpu handle such software?
If everything is good so far, now question is HOW MANY THREADS? It must be limited by hardware, I guess? If hardware can support only 2 threads and I start 10 threads in my process. How would cpu handle it? Pros/Cons? From software engineering point of view, while developing a software that will be used by the users in wide variety of systems, Then how would I decide should I go for multithreading? if so, how many threads?
First, try to understand the concept of 'process' and 'thread'. A thread is a basic unit for execution: a thread is scheduled by operating system and executed by CPU. A process is a sort of container that holds multiple threads.
Yes, either multi-processing or multi-threading is for parallel processing. More precisely, to exploit thread-level parallelism.
Okay, multi-threading could mean hardware multi-threading (one example is HyperThreading). But, I assume that you just say multithreading in software. In this sense, CPU should support context switching.
Context switching is needed to implement multi-tasking even in a physically single core by time division.
Say there are two physical cores and four very busy threads. In this case, two threads are just waiting until they will get the chance to use CPU. Read some articles related to preemptive OS scheduling.
The number of thread that can physically run in concurrent is just identical to # of logical processors. You are asking a general thread scheduling problem in OS literature such as round-robin..
I strongly suggest you to study basics of operating system first. Then move on multithreading issues. It seems like you're still unclear for the key concepts such as context switching and scheduling. It will take a couple of month, but if you really want to be an expert in computer software, then you should know such very basic concepts. Please take whatever OS books and lecture slides.
Threads running on the same core are not technically parallel. They only appear to be executed in parallel, as the CPU switches between them very fast (for us, humans). This switch is what is called context switch.
Now, threads executing on different cores are executed in parallel.
Most modern CPUs have a number of cores, however, most modern OSes (windows, linux and friends) usually execute much larger number of threads, which still causes context switches.
Even if no user program is executed, still OS itself performs context switches for maintanance work.
This should answer 1-3.
About 4: basically, every processor can work with threads. it is much more a characteristic of operating system. Thread is basically: memory (optional), stack and registers, once those are replaced you are in another thread.
5: the number of threads is pretty high and is limited by OS. Usually it is higher than regular programmer can successfully handle :)
The number of threads is dictated by your program:
is it IO bound?
can the task be divided into a number of smaller tasks?
how small is the task? the task can be too small to make it worth to spawn threads at all.
synchronization: if extensive synhronization is required, the penalty might be too heavy and you should reduce the number of threads.
Multiple threads are separate 'chains' of commands within one process. From CPU point of view threads are more or less like processes. Each thread has its own set of registers and its own stack.
The reason why you can have more threads than CPUs is that most threads don't need CPU all the time. Thread can be waiting for user input, downloading something from the web or writing to disk. While it is doing that, it does not need CPU, so CPU is free to execute other threads.
In your example, each tab of Firefox probably can even have several threads. Or they can share some threads. You need one for downloading, one for rendering, one for message loop (user input), and perhaps one to run Javascript. You cannot easily combine them because while you download you still need to react to user's input. However, download thread is sleeping most of the time, and even when it's downloading it needs CPU only occasionally, and message loop thread only wakes up when you press a button.
If you go to task manager you'll see that despite all these threads your CPU use is still quite low.
Of course if all your threads do some number-crunching tasks, then you shouldn't create too many of them as you get no performance benefit (though there may be architectural benefits!).
However, if they are mainly I/O bound then create as many threads as your architecture dictates. It's hard to give advice without knowing your particular task.
Broadly speaking, yeah, but "parallel" can mean different things.
It depends what tasks you want to run in parallel.
Not necessarily. Some (indeed most) threads spend a lot of time doing nothing. Might as well switch away from them to a thread that wants to do something.
The OS handles thread switching. It will delegate to different cores if it wants to. If there's only one core it'll divide time between the different threads and processes.
The number of threads is limited by software and hardware. Threads consume processor and memory in varying degrees depending on what they're doing. The thread management software may impose its own limits as well.
The key thing to remember is the separation between logical/virtual parallelism and real/hardware parallelism. With your average OS, a system call is performed to spawn a new thread. What actually happens (whether it is mapped to a different core, a different hardware thread on the same core, or queued into the pool of software threads) is up to the OS.
Parallel processing uses all the methods not just multi-threading.
Generally speaking, if you want to have real parallel processing, you need to perform it in hardware. Take the example of the Niagara, it has up to 8-cores each capable of executing 4-threads in hardware.
Context switching is needed when there are more threads than is capable of being executed in parallel in hardware. Even then, when executed in series (switching between one thread to the next), they are considered concurrent because there is no guarantee on the order of switching. So, it may go T0, T1, T2, T1, T3, T0, T2 and so on. For all intents and purposes, the threads are parallel.
Time slicing.
That would be up to the OS.
Multithreading is the execution of more than one thread at a time. It can happen both on single core processors and the multicore processor systems. For single processor systems, context switching effects it. Look!Context switching in this computational environment refers to time slicing by the operating system. Therefore do not get confused. The operating system is the one that controls the execution of other programs. It allows one program to execute in the CPU at a time. But the frequency at which the threads are switched in and out of the CPU determines the transparency of parallelism exhibited by the system.
For multicore environment,multithreading occurs when each core executes a thread.Though,in multicore again,context switching can occur in the individual cores.
I think answers so far are pretty much to the point and give you a good basic context. In essence, say you have quad core processor, but each core is capable of executing 2 simultaneous threads.
Note, that there is only slight (or no) increase of speed if you are running 2 simultaneous threads on 1 core versus you run 1st thread and then 2nd thread vertically. However, each physical core adds speed to your general workflow.
Now, say you have a process running on your OS that has multiple threads (i.e. needs to run multiple things in "parallel") and has some kind of stack of tasks in a queue (or some other system with priority rules). Then software sends tasks to a queue and your processor attempts to execute them as fast as it can. Now you have 2 cases:
If a software supports multiprocessing, then tasks will be sent to any available processor (that is not doing anything or simply finished doing some other job and job send from your software is 1st in a queue).
If your software does not support multiprocessing, then all of your jobs will be done in a similar manner, but only by one of your cores.
I suggest reading Wikipedia page on thread. Very first picture there already gives you a nice insight. :)

Resources