Programming for Multi core Processors

Programming for Multi core Processors - programming-languages

As far as I know, the multi-core architecture in a processor does not effect the program. The actual instruction execution is handled in a lower layer.
my question is,
Given that you have a multicore environment, Can I use any programming practices to utilize the available resources more effectively? How should I change my code to gain more performance in multicore environments?

That is correct. Your program will not run any faster (except for the fact that the core is handling fewer other processes, because some of the processes are being run on the other core) unless you employ concurrency. If you do use concurrency, though, more cores improves the actual parallelism (with fewer cores, the concurrency is interleaved, whereas with more cores, you can get true parallelism between threads).
Making programs efficiently concurrent is no simple task. If done poorly, making your program concurrent can actually make it slower! For example, if you spend lots of time spawning threads (thread construction is really slow), and do work on a very small chunk size (so that the overhead of thread construction dominates the actual work), or if you frequently synchronize your data (which not only forces operations to run serially, but also has a very high overhead on top of it), or if you frequently write to data in the same cache line between multiple threads (which can lead to the entire cache line being invalidated on one of the cores), then you can seriously harm the performance with concurrent programming.
It is also important to note that if you have N cores, that DOES NOT mean that you will get a speedup of N. That is the theoretical limit to the speedup. In fact, maybe with two cores it is twice as fast, but with four cores it might be about three times as fast, and then with eight cores it is about three and a half times as fast, etc. How well your program is actually able to take advantage of these cores is called the parallel scalability. Often communication and synchronization overhead prevent a linear speedup, although, in the ideal, if you can avoid communication and synchronization as much as possible, you can hopefully get close to linear.
It would not be possible to give a complete answer on how to write efficient parallel programs on StackOverflow. This is really the subject of at least one (probably several) computer science courses. I suggest that you sign up for such a course or buy a book. I'd recommend a book to you if I knew of a good one, but the paralell algorithms course I took did not have a textbook for the course. You might also be interested in writing a handful of programs using a serial implementation, a parallel implementation with multithreading (regular threads, thread pools, etc.), and a parallel implementation with message passing (such as with Hadoop, Apache Spark, Cloud Dataflows, asynchronous RPCs, etc.), and then measuring their performance, varying the number of cores in the case of the parallel implementations. This was the bulk of the course work for my parallel algorithms course and can be quite insightful. Some computations you might try parallelizing include computing Pi using the Monte Carlo method (this is trivially parallelizable, assuming you can create a random number generator where the random numbers generated in different threads are independent), performing matrix multiplication, computing the row echelon form of a matrix, summing the square of the number 1...N for some very large number of N, and I'm sure you can think of others.

I don't know if it's the best possible place to start, but I've subscribed to the article feed from Intel Software Network some time ago and have found a lot of interesting thing there, presented in pretty simple way. You can find some very basic articles on fundamental concepts of parallel computing, like this. Here you have a quick dive into openMP that is one possible approach to start parallelizing the slowest parts of your application, without changing the rest. (If those parts present parallelism, of course.) Also check Intel Guide for Developing Multithreaded Applications. Or just go and browse the article section, the articles are not too many, so you can quickly figure out what suits you best. They also have a forum and a weekly webcast called Parallel Programming Talk.

Yes, simply adding more cores to a system without altering the software would yield you no results (with exception of the operating system would be able to schedule multiple concurrent processes on separate cores).
To have your operating system utilise your multiple cores, you need to do one of two things: increase the thread count per process, or increase the number of processes running at the same time (or both!).
Utilising the cores effectively, however, is a beast of a different colour. If you spend too much time synchronising shared data access between threads/processes, your level of concurrency will take a hit as threads wait on each other. This also assumes that you have a problem/computation that can relatively easily be parallelised, since the parallel version of an algorithm is often much more complex than the sequential version thereof.
That said, especially for CPU-bound computations with work units that are independent of each other, you'll most likely see a linear speed-up as you throw more threads at the problem. As you add serial segments and synchronisation blocks, this speed-up will tend to decrease.
I/O heavy computations would typically fare the worst in a multi-threaded environment, since access to the physical storage (especially if it's on the same controller, or the same media) is also serial, in which case threading becomes more useful in the sense that it frees up your other threads to continue with user interaction or CPU-based operations.

You might consider using programming languages designed for concurrent programming. Erlang and Go come to mind.

Related

Multithreading vs Shared Memory

I have a problem which is essentially a series of searches for multiple copies of items (needles) in a massive but in memory database (10s of Gb) - the haystack.
This is divided into tasks where each task is to find each of a series of needles in the haystack
and each task is logically independent from the other tasks.
(This is already distributed across multiple machines where each machine
has its own copy of the haystack.)
There are many ways this could be parallelized on individual machines.
We could have one search process per CPU core sharing memory.
Or we could have one search process with multiple threads (one per core). Or even several multi-threaded processes.
3 possible architectures:
A process loads the haystack into Posix shared memory.
Subsequent processes use the shared memory segment instead (like a cache)
A process loads the haystack into memory and then forks.
Each process uses the same memory because of copy on write semantics.
A process loads the haystack into memory and spawns multiple search threads
The question is one method likely to be better and why? or rather what are the trade offs.
(For argument's sake assume performance trumps implementation complexity).
Implementing two or three and measuring is possible of course but hard work.
Are there any reasons why one might be definitively better?
Data in the haystack is immutable.
The processes are running on Linux. So processes are not significantly more expensive than threads.
The haystack spans many GBs so CPU caches are not likely to help.
The search process is essentially a binary search (actually equal_range with a touch of interpolation).
Because the tasks are logically independent there is no benefit from inter-thread communication being
cheaper than inter-process communication (as per for example https://stackoverflow.com/a/18114475/1569204).
I cannot think of any obvious performance trade-offs between threads and shared memory here. Are there any? Perhaps the code maintenance trade-offs are more relevant?
Background research
The only relevant SO answer I could find refers to the overhead of synchronising threads - Linux: Processes and Threads in a Multi-core CPU - which is true but less applicable here.
Related and interesting but different questions are:
Multithreading: What is the point of more threads than cores?
Performance difference between IPC shared memory and threads memory
performance - multithreaded or multiprocess applications
An interesting presentation is https://elinux.org/images/1/1c/Ben-Yossef-GoodBadUgly.pdf
It suggests there can be a small difference in the speed of thread vs process context switches.
I am assuming that except for a monitoring threads/process the others are never switched out.

General advise: Be able to measure improvements! Without that, you may tweak all you like based on advise off the internet but still don't get optimal performance. Effectively, I'm telling you not to trust me or anyone else (including yourself) but to measure. Also prepare yourself for measuring this in real time on production systems. A benchmark may help you to some extent, but real load patterns are still a different beast.
Then, you say the operations are purely in-memory, so the speed doesn't depend on (network or storage) IO performance. The two bottlenecks you face are CPU and RAM bandwidth. So, in order to work on the right part, find out which is the limiting factor. Making sure that the according part is efficient ensures optimal performance for your searches.
Further, you say that you do binary searches. This basically means you do log(n) comparisons, where each comparison requires a load of a certain element from the haystack. This load probably goes through all caches, because the size of the data makes cache hits very unlikely. However, you could hold multiple needles to search for in cache at the same time. If you then manage to trigger the cache loads for the needles first and then perform the comparison, you could reduce the time where either CPU or RAM are idle because they wait for new operations to perform. This is obviously (like others) a parameter you need to tweak for the system it runs on.
Even further, reconsider binary searching. Binary searching performs reliably with a good upper bound on random data. If you have any patterns (i.e. anything non-random) in your data, try to exploit this knowledge. If you can roughly estimate the location of the needle you're searching for, you may thus reduce the number of lookups. This is basically moving the work from the RAM bus to the CPU, so it again depends which is the actual bottleneck. Note that you can also switch algorithms, e.g. going from an educated guess to a binary search when you have less than a certain amount of elements left to consider.
Lastly, you say that every node has a full copy of your database. If each of the N nodes is assigned one Nth of the database, it could improve caching. You'd then make one first step at locating the element to determine the node and then dispatch the search to the responsible node. If in doubt, every node can still process the search as a fallback.

The modern approach is to use threads and a single process.
Whether that is better than using multiple processes and a shared memory segment might depend somewhat on your personal preference and how easy threads are to use in the language you are using, but I would say that if decent thread support is available (e.g. Java) you are pretty much always better off using it.
The main advantage of using multiple processes as far as I can see is that it is impossible to run into the kind of issues you can get when managing multiple threads (e.g., forgetting to synchronise access to shared writable resources - except for the shared memory pool). However, thread-safety by not having threads at all is not much of an argument in favour.
It might also be slightly easier to add processes than add threads. You would have to write some code to change the number of processing threads online (or use a framework or application server).
But overall, the multiple-process approach is dead. I haven't used shared memory in decades. Threads have won the day and it is worth the investment to learn to use them.
If you do need to have multi-threaded access to common writable memory then languages like Java give you all sorts of classes for doing that (as well as language primitives). At some point you are going to find you want that and then with the multi-process approach you are faced with synchronising using semaphores and writing your own classes or maybe looking for a third party library, but the Java people will be miles ahead by then.
You also mentioned forking and relying on copy-on-write. This seems like a very fragile solution dependent on particular behaviour of the system and I would not myself use it.

GPGPU vs. Multicore?

What are the key practical differences between GPGPU and regular multicore/multithreaded CPU programming, from the programmer's perspective? Specifically:
What types of problems are better suited to regular multicore and what types are better suited to GPGPU?
What are the key differences in programming model?
What are the key underlying hardware differences that necessitate any differences in programming model?
Which one is typically easier to use and by how much?
Is it practical, in the long term, to implement high level parallelism libraries for the GPU, such as Microsoft's task parallel library or D's std.parallelism?
If GPU computing is so spectacularly efficient, why aren't CPUs designed more like GPUs?

Interesting question. I have researched this very problem so my answer is based on some references and personal experiences.
What types of problems are better suited to regular multicore and what types are better suited to GPGPU?
Like #Jared mentioned. GPGPU are built for very regular throughput workloads, e.g., graphics, dense matrix-matrix multiply, simple photoshop filters, etc. They are good at tolerating long latencies because they are inherently designed to tolerate Texture sampling, a 1000+ cycle operation. GPU cores have a lot of threads: when one thread fires a long latency operation (say a memory access), that thread is put to sleep (and other threads continue to work) until the long latency operation finishes. This allows GPUs to keep their execution units busy a lot more than traditional cores.
GPUs are bad at handling branches because GPUs like to batch "threads" (SIMD lanes if you are not nVidia) into warps and send them down the pipeline together to save on instruction fetch/decode power. If threads encounter a branch, they may diverge, e.g., 2 threads in a 8-thread warp may take the branch while the other 6 may not take it. Now the warp has to be split into two warps of size 2 and 6. If your core has 8 SIMD lanes (which is why original warp pakced 8 threads), now your two newly formed warps will run inefficiently. The 2-thread warp will run at 25% efficiency and the 6-thread warp will run at 75% efficiency. You can imagine that if a GPU continues to encounter nested branches, its efficiency becomes very low. Therefore, GPUs aren't good at handling branches and hence code with branches should not be run on GPUs.
GPUs are also bad a co-operative threading. If threads need to talk to each other then GPUs won't work well because synchronization is not well-supported on GPUs (but nVidia is on it).
Therefore, the worst code for GPU is code with less parallelism or code with lots of branches or synchronization.
What are the key differences in programming model?
GPUs don't support interrupts and exception. To me thats the biggest difference. Other than that CUDA is not very different from C. You can write a CUDA program where you ship code to the GPU and run it there. You access memory in CUDA a bit differently but again thats not fundamental to our discussion.
What are the key underlying hardware differences that necessitate any differences in programming model?
I mentioned them already. The biggest is the SIMD nature of GPUs which requires code to be written in a very regular fashion with no branches and inter-thread communication. This is part of why, e.g., CUDA restricts the number of nested branches in the code.
Which one is typically easier to use and by how much?
Depends on what you are coding and what is your target.
Easily vectorizable code: CPU is easier to code but low performance. GPU is slightly harder to code but provides big bang for the buck.
For all others, CPU is easier and often better performance as well.
Is it practical, in the long term, to implement high level parallelism libraries for the GPU, such as Microsoft's task parallel library or D's std.parallelism?
Task-parallelism, by definition, requires thread communication and has branches as well. The idea of tasks is that different threads do different things. GPUs are designed for lots of threads that are doing identical things. I would not build task parallelism libraries for GPUs.
If GPU computing is so spectacularly efficient, why aren't CPUs designed more like GPUs?
Lots of problems in the world are branchy and irregular. 1000s of examples. Graph search algorithms, operating systems, web browsers, etc. Just to add -- even graphics is becoming more and more branchy and general-purpose like every generation so GPUs will be becoming more and more like CPUs. I am not saying they will becomes just like CPUs, but they will become more programmable. The right model is somewhere in-between the power-inefficient CPUs and the very specialized GPUs.

Even in a multi-core CPU, your units of work are going to be much larger than on a GPGPU. GPGPUs are appropriate for problems that scale very well, with each chunk of work being increadibly small. A GPGPU has much higher latency because you have to move data to the GPU's memory system before it can be accessed. However, once the data is there, your throughput, if the problem is appropriately scalable, will be much higher with a GPGPU. In my experience, the problem with GPGPU programming is the latency in getting data from normal memory to the GPGPU.
Also, GPGPUs are horrible at communicating across worker processes if the worker processes don't have a sphere of locality routing. If you're trying to communicate all the way across the GPGPU, you're going to be in a lot of pain. For this reason, standard MPI libraries are a poor fit for GPGPU programming.
All computers aren't designed like GPUs because GPUs are fantastic at high latency, high throughput calculations that are inherently parallel and can be broken down easily. Most of what the CPU doing is not inherently parallel and does not scale to thousands or millions of simultaneous workers very efficiently. Luckily, graphics programming does and that's why all this started in GPUs. People have increasingly been finding problems that they can make look like graphics problems, which has led to the rise of GPGPU programming. However, GPGPU programming is only really worth your time if it is appropriate to your problem domain.

Is there a point to multithreading?

I don’t want to make this subjective...
If I/O and other input/output-related bottlenecks are not of concern, then do we need to write multithreaded code? Theoretically the single threaded code will fare better since it will get all the CPU cycles. Right?
Would JavaScript or ActionScript have fared any better, had they been multithreaded?
I am just trying to understand the real need for multithreading.

I don't know if you have payed any attention to trends in hardware lately (last 5 years) but we are heading to a multicore world.
A general wake-up call was this "The free lunch is over" article.
On a dual core PC, a single-threaded app will only get half the CPU cycles. And CPUs are not getting faster anymore, that part of Moores law has died.

In the words of Herb Sutter The free lunch is over, i.e. the future performance path for computing will be in terms of more cores not higher clockspeeds. The thing is that adding more cores typically does not scale the performance of software that is not multithreaded, and even then it depends entirely on the correct use of multithreaded programming techniques, hence multithreading is a big deal.
Another obvious reason is maintaining a responsive GUI, when e.g. a click of a button initiates substantial computations, or I/O operations that may take a while, as you point out yourself.

The primary reason I use multithreading these days is to keep the UI responsive while the program does something time-consuming. Sure, it's not high-tech, but it keeps the users happy :-)

Most CPUs these days are multi-core. Put simply, that means they have several processors on the same chip.
If you only have a single thread, you can only use one of the cores - the other cores will either idle or be used for other tasks that are running. If you have multiple threads, each can run on its own core. You can divide your problem into X parts, and, assuming each part can run indepedently, you can finish the calculations in close to 1/Xth of the time it would normally take.
By definition, the fastest algorithm running in parallel will spend at least as much CPU time as the fastest sequential algorithm - that is, parallelizing does not decrease the amount of work required - but the work is distributed across several independent units, leading to a decrease in the real-time spent solving the problem. That means the user doesn't have to wait as long for the answer, and they can move on quicker.
10 years ago, when multi-core was unheard of, then it's true: you'd gain nothing if we disregard I/O delays, because there was only one unit to do the execution. However, the race to increase clock speeds has stopped; and we're instead looking at multi-core to increase the amount of computing power available. With companies like Intel looking at 80-core CPUs, it becomes more and more important that you look at parallelization to reduce the time solving a problem - if you only have a single thread, you can only use that one core, and the other 79 cores will be doing something else instead of helping you finish sooner.

Much of the multithreading is done just to make the programming model easier when doing blocking operations while maintaining concurrency in the program - sometimes languages/libraries/apis give you little other choice, or alternatives makes the programming model too hard and error prone.
Other than that the main benefit of multi threading is to take advantage of multiple CPUs/cores - one thread can only run at one processor/core at a time.

No. You can't continue to gain the new CPU cycles, because they exist on a different core and the core that your single-threaded app exists on is not going to get any faster. A multi-threaded app, on the other hand, will benefit from another core. Well-written parallel code can go up to about 95% faster- on a dual core, which is all the new CPUs in the last five years. That's double that again for a quad core. So while your single-threaded app isn't getting any more cycles than it did five years ago, my quad-threaded app has four times as many and is vastly outstripping yours in terms of response time and performance.

Your question would be valid had we only had single cores. The things is though, we mostly have multicore CPU's these days. If you have a quadcore and write a single threaded program, you will have three cores which is not used by your program.
So actually you will have at most 25% of the CPU cycles and not 100%. Since the technology today is to add more cores and less clockspeed, threading will be more and more crucial for performance.

That's kind of like asking whether a screwdriver is necessary if I only need to drive this nail. Multithreading is another tool in your toolbox to be used in situations that can benefit from it. It isn't necessarily appropriate in every programming situation.

Here are some answers:
You write "If input/output related problems are not bottlenecks...". That's a big "if". Many programs do have issues like that, remembering that networking issues are included in "IO", and in those cases multithreading is clearly worthwhile. If you are writing one of those rare apps that does no IO and no communication then multithreading might not be an issue
"The single threaded code will get all the CPU cycles". Not necessarily. A multi-threaded code might well get more cycles than a single threaded app. These days an app is hardly ever the only app running on a system.
Multithreading allows you to take advantage of multicore systems, which are becoming almost universal these days.
Multithreading allows you to keep a GUI responsive while some action is taking place. Even if you don't want two user-initiated actions to be taking place simultaneously you might want the GUI to be able to repaint and respond to other events while a calculation is taking place.
So in short, yes there are applications that don't need multithreading, but they are fairly rare and becoming rarer.

First, modern processors have multiple cores, so a single thraed will never get all the CPU cycles.
On a dualcore system, a single thread will utilize only half the CPU. On a 8-core CPU, it'll use only 1/8th.
So from a plain performance point of view, you need multiple threads to utilize the CPU.
Beyond that, some tasks are also easier to express using multithreading.
Some tasks are conceptually independent, and so it is more natural to code them as separate threads running in parallel, than to write a singlethreaded application which interleaves the two tasks and switches between them as necessary.
For example, you typically want the GUI of your application to stay responsive, even if pressing a button starts some CPU-heavy work process that might go for several minutes. In that time, you still want the GUI to work. The natural way to express this is to put the two tasks in separate threads.

Most of the answers here make the conclusion multicore => multithreading look inevitable. However, there is another way of utilizing multiple processors - multi-processing. On Linux especially, where, AFAIK, threads are implemented as just processes perhaps with some restrictions, and processes are cheap as opposed to Windows, there are good reasons to avoid multithreading. So, there are software architecture issues here that should not be neglected.
Of course, if the concurrent lines of execution (either threads or processes) need to operate on the common data, threads have an advantage. But this is also the main reason for headache with threads. Can such program be designed such that the pieces are as much autonomous and independent as possible, so we can use processes? Again, a software architecture issue.
I'd speculate that multi-threading today is what memory management was in the days of C:
it's quite hard to do it right, and quite easy to mess up.
thread-safety bugs, same as memory leaks, are nasty and hard to find
Finally, you may find this article interesting (follow this first link on the page). I admit that I've read only the abstract, though.

When Should I Use Threads?

As far as I'm concerned, the ideal amount of threads is 3: one for the UI, one for CPU resources, and one for IO resources.
But I'm probably wrong.
I'm just getting introduced to them, but I've always used one for the UI and one for everything else.
When should I use threads and how? How do I know if I should be using them?

Unfortunately, there are no hard and fast rules to using Threads. If you have too many threads the processor will spend all its time generating and switching between them. Use too few threads you will not get the throughput you want in your application. Additionally using threads is not easy. A language like C# makes it easier on you because you have tools like ThreadPool.QueueUserWorkItem. This allows the system to manage thread creation and destruction. This helps mitigate the overhead of creating a new thread to pass the work onto. You have to remember that the creation of a thread is not an operation that you get for "free." There are costs associated with starting a thread so that should always be taken into consideration.
Depending upon the language you are using to write your application you will dictate how much you need to worry about using threads.
The times I find most often that I need to consider creating threads explicitly are:
Asynchronous operations
Operations that can be parallelized
Continual running background operations

The answer totally depends on what you're planning on doing. However, one for CPU resources is a bad move - your CPU may have up to six cores, plus hyperthreading, in a retail CPU, and most CPUs will have two or more. In this case, you should have as many threads as CPU cores, plus a few more for scheduling mishaps. The whole CPU is not a single-threaded beast, it may have many cores and need many threads for 100% utilization.
You should use threads if and only if your target demographic will virtually all have multi-core (as is the case in current desktop/laptop markets), and you have determined that one core is not enough performance.

Herb Sutter wrote an article for Dr. Dobb's Journal in which he talks about the three pillars of concurrency. This article does a very good job of breaking down which problems are good candidates for being solved via threading constructs.

From the SQLite FAQ: "Threads are evil. Avoid Them." Only use them when you absolutely have to.
If you have to, then take steps to avoid the usual carnage. Use thread pools to execute fine-grained tasks with no interdependencies, using GUI-framework-provided facilities to dispatch outcomes back to the UI. Avoid sharing data between long-running threads; use message queues to pass information between them (and to synchronise).
A more exotic solution is to use languages such as Erlang that are explicit designed for fine-grained parallelism without sacrificing safety and comprehensibility. Concurrency itself is of fundamental importance to the future of computation; threads are simply a horrible, broken way to express it.

The "ideal number of threads" depends on your particular problem and how much parallelism you can exploit. If you have a problem that is "embarassingly parallel" in that it can be subdivided into independent problems with little to no communication between them required, and you have enough cores that you can actually get true parallelism, then how many threads you use depends on things like the problem size, the cache line size, the context switching and spawning overhead, and various other things that is really hard to compute before hand. For such situations, you really have to do some profiling in order to choose an optimal sharding/partitioning of your problem across threads. It typically doesn't make sense, though, to use more threads than you do cores. It is also true that if you have lots of synchronization, then you may, in fact, have a performance penalty for using threads. It's highly dependent on the particular problem as well as how interdependent the various steps are. As a guiding principle, you need to be aware that spawning threads and thread synchronization are expensive operations, but performing computations in parallel can increase throughput if communication and other forms of synchronization is minimal. You should also be aware that threading can lead to very poor cache performance if your threads end up invalidating a mutually shared cache line.

Parallel coding Vs Multithreading (on single cpu)

can we use interchangeably "Parallel coding" and "Multithreading coding " on single cpu?
i am not much experience in both,
but i want to shift my coding style to any one of the above.
As i found now a days many single thred application are obsolete, which would be better for future software industy as a career prospect?

There is definitely overlap between multithreading and parallel coding/computing, with the main differences in the target processing architecture.
Multithreading has been used to exploit the benefits of concurrency within a single process on a single CPU with shared memory. Running the same programs on a machine with multiple CPUs may result in significant speedup, but is often a bonus rather than intended (until recently). Many OSes have threading models (e.g. pthreads), which benefit from but do not require multiple CPUs.
Multiprocessing is the standard model for parallel programming targeting multiple CPUs, from early SMP machines with many CPUs on a big machine, then to cluster computing across many machines, and now back to many CPUs/cores on a single computer. MPI is a standard that can work across many different architectures.
Of course, one can program a parallel design using threads with language frameworks like OpenMP. I've heard of multicomponent GUIs/applications that rely on separate processing that could theoretically run anywhere. Practically, there's more of the former than the latter.
Probably the main distinction is when the program runs across multiple machines, where it's not practical to use multithreading, and existing applications that share memory will not work.

Parallel coding is the concept of executing multiple actions in parallel(Same time).
Multi-threaded Programming on a Single Processor
Multi-threading on a single processor gives the illusion of running in parallel. Behind the scenes, the processor is switching between threads depending on how threads have been prioritized.
Multi-threaded Programming on Multiple Processors
Multi-threading on multiple processor cores is truly parallel. Each microprocessor is running a single thread. Consequently, there are multiple parallel, concurrent tasks happening at once.

The question is a bit confusing as you can perform parallel operations in multiple threads, but all multi-thread applications are not using parallel computing.
In parallel code, you typically have many "workers" that consume a set of data to return results asynchronously. But multithread is used in a broader scope, like GUI, blocking I/O and networking.
Being on a single or many CPU does not change much, as the management depends on how your OS can handle threads and processes.
Multithreading will be useful everywhere, parallel is not everyday computing paradigm, so it might be a "niche" in a career prospect.

Some demos I saw in .NET 4.0, the Parallel code changes seem easier then doing threads. There is new syntax for "For Loops" and other things to support parallel processing. So there is a difference.
I think in the future you will do both, but I think the Parallel support will be better and easier. You still need threads for background operations and other things.

The fact is that you cannot achieve "real" parallelism on a single CPU. There are several libraries (such as C's MPI) that help a little bit on this area. But the concept of paralellism it's not that used among developers working on popular solutions.
Multithreading is common these days thanks to the introduction of multiple cores on a single CPU, it's easy and almost transparent to implement in every language thanks to thread libs and threadsafe types, methods, classes and so on. This way you can simulate paralellism.
Anyway, if you're starting with this, start by reading about concurrency and threading topics. And of course, threads + parallelism work good together.

I'm not sure about what do you think "Parallel coding" is but Parallel coding as I understand it refers to producing code which is executed in parallel by the CPU, and therefore Multithreaded code falls inside that description.
In that way, obviously you can use them interchangeably (as one falls inside the other).
Nonetheless I'll suggest you take it slowly and start learning from the basics. Understand WHY multithreading is becoming important, what's the difference between processes, threads and fibers, how do you synchronize either of them and so on.
Keep in mind that parallel coding, as you call it, is quite complex, specially compared to sequential coding so be prepared. Also don't just rush into it. Just because you use 3 threads instead of one won't make your program faster, it can even make it slower. You need to understand the hows and the whys. Not every thing can be made parallel and not everthing that can, should.

in simple Language
multithreading is available in the CPu by itself and
parallel programming is an explicit task either done by the compiler or my constructs written by programmers "#pragma"

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string