What is the difference between the Temporal multithreading and Super-threading? - multithreading

There are two terms:
Temporal multithreading: In fine-grained temporal multithreading, the main processor pipeline may contain multiple threads, with context switches effectively occurring between pipe stages (e.g., in the barrel processor). A barrel processor is a CPU that switches between threads of execution on every cycle.
Super-threading: is a type of multithreading that enables different threads to be executed by a single processor without truly executing them at the same time.1 This qualifies it as time-sliced or temporal multithreading rather than simultaneous multithreading (SMT). It is motivated by the observation that the processor's functional units are occasionally left idle while executing instructions from one thread due to long-latency events. Super-threading seeks to make use of the otherwise unused processor cycles by executing instructions from another thread until the previous thread is ready to resume execution.
Is the main difference between TM and ST, that Temporal multithreading (fine-grained) uses C-slowing and switches between threads of execution on every cycle, but Super-threading switches between threads not every cycle and only when processor's functional units are left idle while executing instructions from one thread due to long-latency events?
What is the difference between the Temporal multithreading (fine-grained) and Super-threading?

Temporal multi-threading could be in form of fine-grain or coarse-grain multi-threading. Fine-grain multithreading switches contexts at fixed fine-grain interval (e.g. every cycle). Coarse-grain multithreading switches contexts on long-latency events (e.g. LLC cache misses).
Simultaneous multithreading, on the other hand, does not have any notion of thread switching. Multiple threads can run concurrently.
A picture is worth a thousand words. Take a look at slides 5 to 7 here. It has pictures for all 3 methods and compares them nicely.
As other people have said, super-threading is not a common term and it seems similar to coarse-grain TM to me.

Related

How Tasks are scheduled in a multi-core processor

I've got confused about how tasks are scheduled in a multi-core processor. Actually, different sources have different opinions. Importantly, there isn't enough document about tasks scheduling mechanism in a multi-core processor. Therefore, I decided to ask you a question.
I depicted a process that contains a process kernel thread, and two user-level threads. and provide a pseudo-code about the processing logic.
The question is, How this process will be executed in a multi-core processing unite that contains 2 physical cores and 4 logical processors (each core has 2). Such that, there are not any waiting processes, and the CPU was assigned to the process completely.
I guess it works like below:
Note: PKT_C1_LP1 means process kernel thread is assigned to core 1 and logical processor 1
|--PKT_C1_LP1--1s--| |--T1_C1_LP1--1s--| |--TSK1_C1_LP1--1s--|
|--T2_C1_LP2--2s-----------| |--TSK2_C1_LP2--1s--|
----------- timeline ----------->
Update
Seems like the answer(s) to your question(s) will depend a lot on what
OS and scheduler your system is running.
Because there aren't any waiting processes and also enough resources. So I believe that almost all of the scheduling algorithms in any os will have insignificant differences. However, let's say, for simplicity it is:
non-preemptive FCFS scheduling
Here's a timing diagram of the code that each thread needs to execute. This imagines a maximal case where each task immmediately spawns a new thread. The green sections are infinitesimally short pieces of code (think, "not-to-scale") but are basically just scheduling operations. And the red sections are similarly short process EXIT and thread END scheduling operations. (I've omitted penalties associated with thread creation. And notice that worker threads do not END, they just go idle, and they stay in a thread pool.
Basic Timing Diagram
Now the first thing you'll notice is that, because of the way tasks work, the second task can be executed on the same thread that scheduled it, because no more tasks are scheduled, and the thread is only going to await that task. This has nothing to do with thread scheduling, and everything to do with how tasks efficiently manage their pool of worker threads. This is application-level code, not os-level code that accomplishes this. The diagram below requires 1 fewer threads thanks to tasks.
Timing Diagram with smarter tasks
Now we can look at what the scheduler needs to do. We are still dealing with only logical processors. (The details of which core will execute which thread are complicated so let's leave that out for the moment.) Here we see that we can naively execute all each of these threads on their own processor.
Greedy usage of processors
It will likely be more efficient to execute the worker thread on one of the previous processors. They are idle when worker thread 1 needs to execute, so it makes more sense to reuse one of the previously allocated processors. Here task 1 code in worker thread 1 is shown executing on processor 2 (could also have been assigned to processor 1 because it is also free, but stay tuned for the next diagram and you'll see why I put it on processor 2).
Schedule thread to reuse a processor
And finally, we can construct the last version that takes us to the most efficient scheduling. This hinges on optimizing the case where you create a thread and then immediately join a thread. Different operating systems try to optimize this case so that the newly created thread can run on the same processor. It means that creating the thread doesn't immediately schedule the new thread on a free processor and burn the cost of a context switch back to the thread that scheduled it. Instead, the new thread is scheduled when we block in our Join operation, or when the next clock interrupt occurs. If we can quickly get to our Join call before an interrupt triggers the scheduler (we're talking < 10 ms on a typical operating systems for such things to be triggered by the clock chip) then the scheduling will happen more efficiently like this (below), where thread 2 can be scheduled to run on the same processor without a context switch. (Interestingly, Linux and Windows optimize this case differently.)
Final timing diagram
You'll notice (above) that this can now all execute on only two logical processors.
Whether it is more efficient to run these on separate cores or different logical processors of the same core is a nuance of the operating system again that depends highly on virtual memory usage and also the hardware specs of the processor and its caches. Different operating systems will do different things here, too. And the details matter greatly. Non-uniform memory architecture would affect the decision too.
In the real world, the operating system may use heuristics to determine the best priority and placement for threads and processes. The real world answer is so much different and more nuanced than this "computer science" answer I've given and depends on the specific details.
Additional Reading/Viewing:
Windows and Linux: A Tale of Two Kernels - Tech-Ed 2004 (Older but excellent info)
Processes, Threads, and Jobs in the Windows Operating System
Scheduling: Introduction; and Multiprocessor Scheduling (Advanced)
Capacity Aware Scheduling

Is Simultaneous Multithreading (Hyperthreading) "true" multicore processing?

So what I am aware of is that Simultaneous Multithreading (Intel's Hyperthreading for example) enables a single CPU core to efficiently manage several threads at once. And most explainations I find is that it's like you have more than one core at your disposal. But what I'm wondering is if this is what is actually going on at a low level (machine level)? Or is it more like to the OS it just looks ike it is being operated on 2 cores, but in the end Simultaneous Multithreading just makes it much more efficient at going back and forth between two (or more) different threads, giving the illusion of having more than one core?
Simultancous multithreading is defined in "Simultaneous Multithreading: Maximizing On-Chip Parallelism" (Dean M. Tullsen et al., 1995, PDF) as "a technique permitting several independent threads to issue instructions to a superscalar’s multiple functional units in a single cycle" ("issue" means initiation of execution — an alternative use of the term means entering into an instruction scheduler). "Simultaneous" refers to the issue of instructions from different threads at the same time, distinguishing SMT from fine-grained multithreading that rapidly switches between threads in execution (e.g., choosing each cycle which thread's instructions to execute) and switch-on-event multithreading (which is more similar to OS-level context switches).
SMT implementations often interleave instruction fetch and decode and commit, making these pipeline stages look more like those of a fine-grain multithreaded or non-multithreaded core. SMT exploits an out-of-order superscalar already choosing dynamically between arbitrary (within a window) instructions recognizing that typically execution resources are not fully used. (In-order SMT provides relatively greater benefits since in-order execution lacks the latency hiding of out-of-order execution, but the pipeline control complexity is increased.)
A barrel processor (pure round-robin, fine-grained thread scheduling with nops issued for non-ready threads) with shared caches would look more like separate cores at 1/thread_count the clock frequency (and shared caches) since such lacks dynamic contention for execution resources. It is also arguable that having instructions from multiple threads in the processor pipeline at the same time represents parallel instruction processing; distinct threads can have instructions being processed (in different pipeline stages) at the same time. Even with switch-on-event multithreading, a cache miss can be processed in parallel with the execution of another thread, i.e., multiple instructions from another thread can be processed during the "processing" of a load instruction.
The distinction from OS-level context switching can be even more fuzzy if the ISA provides instructions that are not interrupt-atomic. For example, on x86 a timer interrupt can lead an OS to perform a context switch while a string instruction is in progress. In some sense, during the entire time slice of the other thread, the string instruction might be considered still to be "executing" since its operation was not completed. With hardware prefetching, some degree of forward progress of the earlier thread might, in theory, continue past the time when another thread starts running, so even a requirement of simultaneous activity in multiple threads might be satisfied. (If processing of long x86 string instructions was handed off to an accelerator, the instruction might run fully in parallel with another thread running on the core that initiated the instruction.)

fork vs thread on one single core

Imagine that I have two tasks, each of them needs 2 seconds to finish its job.
In this case, if I create two threads for each of them and my PC is single-core, this won't save any time. Am I right ?
What if I use fork to create two processes (the machine is still single-core) and each process takes charge of one task ? Can this save any time ?
If not, I have a question:
In current modern machine (including multi-core), if I have several heavy tasks, which method should I use ?
fork ?
thread ?
fork + thread, meaning that create some processes and
each process contains more than one thread ?
Even with a single core having two threads may speed up execution. If your routine is purely CPU bound then two threads won't improve anything, indeed the performance will be worse because of context switching overhead. But if the routine has to wait for memory, disk or or network (which is usually the case) then two threads will provide performance gains even with a single core.
About fork vs threads, threads require less resources so, in principle, should be the first choice. But there are two caveats: 1) maybe you want to be able to terminate a parallel routine, this is much safer to do with processes than with threads and 2) some languages (notably Python and Ruby) provide pseudo-thread libraries which do not use real threads but switch between routines using the same thread. This simulated threading can be very useful for example when waiting for network requests but it must be taken into account that it's not real multithreading.
Amendment: As commented by Sergio Tulentsev, Ruby and Python do indeed provide real threads and not only coroutines.
"job takes 2 seconds" - If those 2 seconds are fully occupying the CPU (100% load), you won't gain anything with either thread nor fork if you have no cores to share. The single-core CPU is simply busy and you cannnot make it more busy.
In case this 2 seconds include waiting time (for example on I/O, storage, whatever) you could gain something, even with a single core. The amount of gain depends on the CPU working vs. CPU waiting ratio and the overhead of your multiprocessing. Most non-trivial programs have at least some amount of "CPU waiting", so multithreading is often useful even on single-core CPUs.
This overhead for setting up a coroutine and context switching can be considerable and needs to be measured. Obviously, the shorter the run time of your actiual task is, the larger will be the ratio of overhead (for setting up a thread or process, etc.) and the smaller will be you multi-processing gain.
Traditionally, threads used to have considerably less overhead than processes (after all, that was why they were invented), but the "considerably" has maybe vanished over time - On modern Linux systems, processes are only a tad slower to set up than threads (actually, both use the same system calls). You rather decide between thread or process based on the requirements related to amount of protection (or sharing) of data than execution speed.

Why are OS threads considered expensive?

There are many solutions geared toward implementing "user-space" threads. Be it golang.org goroutines, python's green threads, C#'s async, erlang's processes etc. The idea is to allow concurrent programming even with a single or limited number of threads.
What I don't understand is, why are the OS threads so expensive? As I see it, either way you have to save the stack of the task (OS thread, or userland thread), which is a few tens of kilobytes, and you need a scheduler to move between two tasks.
The OS provides both of this functions for free. Why should OS threads be more expensive than "green" threads? What's the reason for the assumed performance degradation caused by having a dedicated OS thread for each "task"?
I want to amend Tudors answer which is a good starting point. There are two main overheads of threads:
Starting and stopping them. Involves creating a stack and kernel objects. Involves kernel transitions and global kernel locks.
Keeping their stack around.
(1) is only a problem if you are creating and stopping them all the time. This is solved commonly using thread pools. I consider this problem to be practically solved. Scheduling a task on a thread pool usually does not involve a trip to the kernel which makes it very fast. The overhead is on the order of a few interlocked memory operations and a few allocations.
(2) This becomes important only if you have many threads (> 100 or so). In this case async IO is a means to get rid of the threads. I found that if you don't have insane amounts of threads synchronous IO including blocking is slightly faster than async IO (you read that right: sync IO is faster).
Saving the stack is trivial, no matter what its size - the stack pointer needs to be saved in the Thread Info Block in the kernel, (so usualy saving most of the registers as well since they will have been pushed by whatever soft/hard interrupt caused the OS to be entered).
One issue is that a protection level ring-cycle is required to enter the kernel from user. This is an essential, but annoying, overhead. Then the driver or system call has to do whatever was requested by the interrupt and then the scheduling/dispatching of threads onto processors. If this results in the preemption of a thread from one process by a thread from another, a load of extra process context has to be swapped as well. Even more overhead is added if the OS decides that a thread that is running on another processor core than the one handling the interrupt mut be preempted - the other core must be hardware-interrupted, (this is on top of the hard/soft interrupt that entred the OS in the first place.
So, a scheduling run may be quite a complex operation.
'Green threads' or 'fibers' are, (usually), scheduled from user code. A context-change is much easier and cheaper than an OS interrupt etc. because no Wagnerian ring-cycle is required on every context-change, process-context does not change and the OS thread running the green thread group does not change.
Since something-for-nothing does not exist, there are problems with green threads. They ar run by 'real' OS threads. This means that if one 'green' thread in a group run by one OS thread makes an OS call that blocks, all green threads in the group are blocked. This means that simple calls like sleep() have to be 'emulated' by a state-machine that yields to other green threads, (yes, just like re-implementing the OS). Similarly, any inter-thread signalling.
Also, of course, green threads cannot directly respond to IO signaling, so somewhat defeating the point of having any threads in the first place.
There are many solutions geared toward implementing "user-space" threads. Be it golang.org goroutines, python's green threads, C#'s async, erlang's processes etc. The idea is to allow concurrent programming even with a single or limited number of threads.
It's an abstraction layer. It's easier for many people to grasp this concept and use it more effectively in many scenarios. It's also easier for many machines (assuming a good abstraction), since the model moves from width to pull in many cases. With pthreads (as an example), you have all the control. With other threading models, the idea is to reuse threads, for the process of creating a concurrent task to be inexpensive, and to use a completely different threading model. It's far easier to digest this model; there's less to learn and measure, and the results are generally good.
What I don't understand is, why are the OS threads so expensive? As I see it, either way you have to save the stack of the task (OS thread, or userland thread), which is a few tens of kilobytes, and you need a scheduler to move between two tasks.
Creating a thread is expensive, and the stack requires memory. As well, if your process is using many threads, then context switching can kill performance. So lightweight threading models became useful for a number of reasons. Creating an OS thread became a good solution for medium to large tasks, ideally in low numbers. That's restrictive, and quite time consuming to maintain.
A task/thread pool/userland thread does not need to worry about much of the context switching or thread creation. It's often "reuse the resource when it becomes available, if it's not ready now -- also, determine the number of active threads for this machine".
More commmonly (IMO), OS level threads are expensive because they are not used correctly by the engineers - either there are too many and there is a ton of context switching, there is competition for the same set of resources, the tasks are too small. It takes much more time to understand how to use OS threads correctly, and how to apply that best to the context of a program's execution.
The OS provides both of this functions for free.
They're available, but they are not free. They are complex, and very important to good performance. When you create an OS thread, it's given time 'soon' -- all the process' time is divided among the threads. That's not the common case with user threads. The task is often enqueued when the resource is not available. This reduces context switching, memory, and the total number of threads which must be created. When the task exits, the thread is given another.
Consider this analogy of time distribution:
Assume you are at a casino. There are a number people who want cards.
You have a fixed number of dealers. There are fewer dealers than people who want cards.
There is not always enough cards for every person at any given time.
People need all cards to complete their game/hand. They return their cards to the dealer when their game/hand is complete.
How would you ask the dealers to distribute cards?
Under the OS scheduler, that would be based on (thread) priority. Every person would be given one card at a time (CPU time), and priority would be evaluated continually.
The people represent the task or thread's work. The cards represent time and resources. The dealers represent threads and resources.
How would you deal fastest if there were 2 dealers and 3 people? and if there were 5 dealers and 500 people? How could you minimize running out of cards to deal? With threads, adding cards and adding dealers is not a solution you can deliver 'on demand'. Adding CPUs is equivalent to adding dealers. Adding threads is equivalent to dealers dealing cards to more people at a time (increases context switching). There are a number of strategies to deal cards more quickly, especially after you eliminate the people's need for cards in a certain amount of time. Would it not be faster to go to a table and deal to a person or people until their game is complete if the dealer to people ratio were 1/50? Compare this to visiting every table based on priority, and coordinating visitation among all dealers (the OS approach). That's not to imply the OS is stupid -- it implies that creating an OS thread is an engineer adding more people and more tables, potentially more than the dealers can reasonably handle. Fortunately, the constraints may be lifted in many cases by using other multithreading models and higher abstractions.
Why should OS threads be more expensive than "green" threads? What's the reason for the assumed performance degradation caused by having a dedicated OS thread for each "task"?
If you developed a performance critical low level threading library (e.g. upon pthreads), you would recognize the importance of reuse (and implement it in your library as a model available for users). From that angle, the importance of higher level multithreading models is a simple and obvious solution/optimization based on real world usage as well as the ideal that the entry bar for adopting and effectively utilizing multithreading can be lowered.
It's not that they are expensive -- the lightweight threads' model and pool is a better solution for many problems, and a more appropriate abstraction for engineers who do not understand threads well. The complexity of multithreading is greatly simplified (and often more performant in real world usage) under this model. With OS threads, you do have more control, but several more considerations must be made to use them as effectively as possible -- heeding these consideration can dramatically reflow a program's execution/implementation. With higher level abstractions, many of these complexities are minimized by completely altering the flow of task execution (width vs pull).
The problem with starting kernel threads for each small task is that it incurs a non-negligible overhead to start and stop, coupled with the stack size it needs.
This is the first important point: thread pools exist so that you can recycle threads, in order to avoid wasting time starting them as well as wasting memory for their stacks.
Secondly, if you fire off threads to do asynchronous I/O, they will spend most of their time blocked waiting for the I/O to complete, thus effectively not doing any work and wasting memory. A much better option is to have a single worker handle multiple async calls (through some under-the-hood scheduling technique, such as multiplexing), thus again saving memory and time.
One thing that makes "green" threads faster than kernel threads is that they are user-space objects, managed by a virtual machine. Starting them is a user space call, while starting a thread is a kernel-space call that is much slower.
A person in Google shows an interesting approach.
According to him, kernel mode switching itself is not the bottleneck, and the core cost happen on SMP scheduler. And he claims M:N schedule assisted by kernel wouldn't be expensive, and this makes me to expect general M:N threading to be available on every languages.
Because the OS. Imagine that instead of asking you to clean the house your grandmother has to call the social service that does some paperwork and a week after assigns a social worker for helping her. The worker can be called off at any time and replaced with another one, which again takes several days.
That's pretty ineffective and slow, huh?
In this metaphor you are a userland coroutine scheduler, the social service is an OS with its kernel-level thread scheduler, and a social worker is a fully-fledged thread.
I think the two things are in different levels.
Thread or Process is an instance of the program which is being executed. In a process/thread there is much more things in it. Execution stack, opening files, signals, processors status, and a many other things.
Greentlet is different, it is runs in vm. It supplies a light-weight thread. Many of them supply a pseudo-concurrently (typically in a single or a few OS-level threads). And often they supply a lock-free method by data-transmission instead of data sharing.
So, the two things focus different, so the weight are different.
And In my mind, the greenlet should be finished in the VM not the OS.

Threads & Processes Vs MultiThreading & Multi-Core/MultiProcessor : How they are mapped?

I was very confused but the following thread cleared my doubts:
Multiprocessing, Multithreading,HyperThreading, Multi-core
But it addresses the queries from the hardware point of view. I want to know how these hardware features are mapped to software?
One thing that is obvious is that there is no difference between MultiProcessor(=Mutlicpu) and MultiCore other than that in multicore all cpus reside on one chip(die) where as in Multiprocessor all cpus are on their own chips & connected together.
So, mutlicore/multiprocessor systems are capable of executing multiple processes (firefox,mediaplayer,googletalk) at the "sametime" (unlike context switching these processes on a single processor system) Right?
If it correct. I'm clear so far. But the confusion arises when multithreading comes into picture.
MultiThreading "is for" parallel processing. right?
What are elements that are involved in multithreading inside cpu? diagram? For me to exploit the power of parallel processing of two independent tasks, what should be the requriements of CPU?
When people say context switching of threads. I don't really get it. because if its context switching of threads then its not parallel processing. the threads must be executed "scrictly simultaneously". right?
My notion of multithreading is that:
Considering a system with single cpu. when process is context switched to firefox. (suppose) each tab of firefox is a thread and all the threads are executing strictly at the same time. Not like one thread has executed for sometime then again another thread has taken until the context switch time is arrived.
What happens if I run a multithreaded software on a processor which can't handle threads? I mean how does the cpu handle such software?
If everything is good so far, now question is HOW MANY THREADS? It must be limited by hardware, I guess? If hardware can support only 2 threads and I start 10 threads in my process. How would cpu handle it? Pros/Cons? From software engineering point of view, while developing a software that will be used by the users in wide variety of systems, Then how would I decide should I go for multithreading? if so, how many threads?
First, try to understand the concept of 'process' and 'thread'. A thread is a basic unit for execution: a thread is scheduled by operating system and executed by CPU. A process is a sort of container that holds multiple threads.
Yes, either multi-processing or multi-threading is for parallel processing. More precisely, to exploit thread-level parallelism.
Okay, multi-threading could mean hardware multi-threading (one example is HyperThreading). But, I assume that you just say multithreading in software. In this sense, CPU should support context switching.
Context switching is needed to implement multi-tasking even in a physically single core by time division.
Say there are two physical cores and four very busy threads. In this case, two threads are just waiting until they will get the chance to use CPU. Read some articles related to preemptive OS scheduling.
The number of thread that can physically run in concurrent is just identical to # of logical processors. You are asking a general thread scheduling problem in OS literature such as round-robin..
I strongly suggest you to study basics of operating system first. Then move on multithreading issues. It seems like you're still unclear for the key concepts such as context switching and scheduling. It will take a couple of month, but if you really want to be an expert in computer software, then you should know such very basic concepts. Please take whatever OS books and lecture slides.
Threads running on the same core are not technically parallel. They only appear to be executed in parallel, as the CPU switches between them very fast (for us, humans). This switch is what is called context switch.
Now, threads executing on different cores are executed in parallel.
Most modern CPUs have a number of cores, however, most modern OSes (windows, linux and friends) usually execute much larger number of threads, which still causes context switches.
Even if no user program is executed, still OS itself performs context switches for maintanance work.
This should answer 1-3.
About 4: basically, every processor can work with threads. it is much more a characteristic of operating system. Thread is basically: memory (optional), stack and registers, once those are replaced you are in another thread.
5: the number of threads is pretty high and is limited by OS. Usually it is higher than regular programmer can successfully handle :)
The number of threads is dictated by your program:
is it IO bound?
can the task be divided into a number of smaller tasks?
how small is the task? the task can be too small to make it worth to spawn threads at all.
synchronization: if extensive synhronization is required, the penalty might be too heavy and you should reduce the number of threads.
Multiple threads are separate 'chains' of commands within one process. From CPU point of view threads are more or less like processes. Each thread has its own set of registers and its own stack.
The reason why you can have more threads than CPUs is that most threads don't need CPU all the time. Thread can be waiting for user input, downloading something from the web or writing to disk. While it is doing that, it does not need CPU, so CPU is free to execute other threads.
In your example, each tab of Firefox probably can even have several threads. Or they can share some threads. You need one for downloading, one for rendering, one for message loop (user input), and perhaps one to run Javascript. You cannot easily combine them because while you download you still need to react to user's input. However, download thread is sleeping most of the time, and even when it's downloading it needs CPU only occasionally, and message loop thread only wakes up when you press a button.
If you go to task manager you'll see that despite all these threads your CPU use is still quite low.
Of course if all your threads do some number-crunching tasks, then you shouldn't create too many of them as you get no performance benefit (though there may be architectural benefits!).
However, if they are mainly I/O bound then create as many threads as your architecture dictates. It's hard to give advice without knowing your particular task.
Broadly speaking, yeah, but "parallel" can mean different things.
It depends what tasks you want to run in parallel.
Not necessarily. Some (indeed most) threads spend a lot of time doing nothing. Might as well switch away from them to a thread that wants to do something.
The OS handles thread switching. It will delegate to different cores if it wants to. If there's only one core it'll divide time between the different threads and processes.
The number of threads is limited by software and hardware. Threads consume processor and memory in varying degrees depending on what they're doing. The thread management software may impose its own limits as well.
The key thing to remember is the separation between logical/virtual parallelism and real/hardware parallelism. With your average OS, a system call is performed to spawn a new thread. What actually happens (whether it is mapped to a different core, a different hardware thread on the same core, or queued into the pool of software threads) is up to the OS.
Parallel processing uses all the methods not just multi-threading.
Generally speaking, if you want to have real parallel processing, you need to perform it in hardware. Take the example of the Niagara, it has up to 8-cores each capable of executing 4-threads in hardware.
Context switching is needed when there are more threads than is capable of being executed in parallel in hardware. Even then, when executed in series (switching between one thread to the next), they are considered concurrent because there is no guarantee on the order of switching. So, it may go T0, T1, T2, T1, T3, T0, T2 and so on. For all intents and purposes, the threads are parallel.
Time slicing.
That would be up to the OS.
Multithreading is the execution of more than one thread at a time. It can happen both on single core processors and the multicore processor systems. For single processor systems, context switching effects it. Look!Context switching in this computational environment refers to time slicing by the operating system. Therefore do not get confused. The operating system is the one that controls the execution of other programs. It allows one program to execute in the CPU at a time. But the frequency at which the threads are switched in and out of the CPU determines the transparency of parallelism exhibited by the system.
For multicore environment,multithreading occurs when each core executes a thread.Though,in multicore again,context switching can occur in the individual cores.
I think answers so far are pretty much to the point and give you a good basic context. In essence, say you have quad core processor, but each core is capable of executing 2 simultaneous threads.
Note, that there is only slight (or no) increase of speed if you are running 2 simultaneous threads on 1 core versus you run 1st thread and then 2nd thread vertically. However, each physical core adds speed to your general workflow.
Now, say you have a process running on your OS that has multiple threads (i.e. needs to run multiple things in "parallel") and has some kind of stack of tasks in a queue (or some other system with priority rules). Then software sends tasks to a queue and your processor attempts to execute them as fast as it can. Now you have 2 cases:
If a software supports multiprocessing, then tasks will be sent to any available processor (that is not doing anything or simply finished doing some other job and job send from your software is 1st in a queue).
If your software does not support multiprocessing, then all of your jobs will be done in a similar manner, but only by one of your cores.
I suggest reading Wikipedia page on thread. Very first picture there already gives you a nice insight. :)

Resources