Independent Thread Scheduling since Volta

Independent Thread Scheduling since Volta - multithreading

Nvidia introduced a new Independent Thread Scheduling for their GPGPUs since Volta. In case CUDA threads diverge, alternative code paths are not executed in blocks but instruction-wise. Still, divergent paths can not be executed at the same time since the GPUs are SIMT as well. This is the original article:
https://developer.nvidia.com/blog/inside-volta/ (scroll down to "Independent Thread Scheduling").
I understood what this means. What I don't understand is, in which way this new behavoir accelerates code. Even the before/after diagrams in the above article do not reflect an overall speed-up.
My question: Which kinds of divergent algorithms will run faster on Volta (and newer) due to the described new scheduling?

The purpose of the feature is not necessarily to accelerate code.
An important purpose of the feature is to enable reliable use of programming models such as producer-consumer within a warp (amongst threads in the same warp) that would have been either brittle or prone to hang using the previous thread schedulers pre-volta.
The typical example IMO of which you can find various examples here on the cuda tag, is people trying to negotiate for atomic locks among threads in the same warp. This would have been "brittle" (and here) or not workable (hangs) on previous architectures. It works well, on volta, in my experience.
Here is another example of an algorithm that just hangs on pre-volta, but "works" (does not hang) on volta+.

Related

Recursion calls on different cores?

So I recently was writing program to confirm that Dijkstra mutual exclusion algorithm is working. I decided to use Kotlin coz I didn't want to use C++ and manage memory myself. Today I saw that despite the fact my program is running sequential all cores are in 100% use, is it some JVM optimization? Or maybe Kotlin optimized my recursion calls? I need to mention that my recursion is not tail recursion. Do any of you know why this happens? I did not use threads or coroutines just to be clear.

Short answer: don't worry!
If you're not using coroutines or explicit threads, then your code should continue to execute in the same thread.
However, there's no guarantee that your thread will always be executed on the same core; the OS is free to schedule it on whichever core it thinks best at each moment.  (I don't know what criteria various OSs may use to make those decisions, but it's likely to take account of other threads and processes. And your thread can move between codes quickly enough to confuse any monitoring you may do.)
Also, even if you don't start any other threads, there are likely to be several background threads handling garbage collection, running finalisers, and other housekeeping.  If you use a GUI toolkit such as JavaFX or Swing, that will use many threads, as will a framework such as Spring.  (These will usually be marked as ‘dæmon’ theads, so they don't prevent the JVM from exiting.)
Finally, the JVM itself is likely to use system threads for background compilation of bytecode, monitoring, and so on.  (Different JVM implementations may do this differently, of course.)
(None of this is specific to Kotlin, by the way; it's the same for all Java apps.  But it'd almost certainly be different for Kotlin/JS and Kotlin/native.)
So no, activity on multiple cores does not imply that your code has been transformed or rewritten.  It just means that you're not running on the bare metal, and should trust the JVM to take care of things!

It was GC, here is Reddit post I created there.
https://www.reddit.com/r/Kotlin/comments/bp9y1t/recursion_calls_on_different_cores/

When should a thread generally yield?

In most languages/frameworks, there exists a way for a thread to yield control to other threads. However, I can't really think of a time when yielding from a thread was the correct solution to a given problem. When, in general, should one use Thread.yield(), sleep(0), etc?

One use case could be for testing concurrent programs, try to find interleavings that reveal flaws in your synchronization patterns. For instance in Java:
A useful trick for increasing the number of interleavings, and
therefore more effectively exploring the state space of your programs,
is to use Thread.yield to encourage more context switches during
operations that access shared state. (The effectiveness of this
technique is platform-specific, since the JVM is free to treat
THRead.yield as a no-op [JLS 17.9]; using a short but nonzero sleep
would be slower but more reliable.) — JCIP
Also interesting from the Java point of view is that their semantics are not defined:
The semantics of Thread.yield (and Thread.sleep(0)) are undefined
[JLS 17.9]; the JVM is free to implement them as no-ops or treat them
as scheduling hints. In particular, they are not required to have the
semantics of sleep(0) on Unix systemsput the current thread at the end
of the run queue for that priority, yielding to other threads of the
same prioritythough some JVMs implement yield in this way. — JCIP
This makes them, of course, rather unreliable. This is very Java specific, however, in generally I believe following is true:
Both are low-level mechanism which can be used to influence the scheduling order. If this is used to achieve a certain functionality then this functionality is based on the probability of the OS scheduler which seems a rather bad idea. This should be managed by higher-level synchronization constructs instead.
For testing purpose or for forcing the program into a certain state it seems a handy tool.

When, in general, should one use Thread.yield(), sleep(0), etc?
It depends on the VM are thread model we are talking about. For me the answer is rarely if ever.
Traditionally some thread models were non-preemptive and others are (or were) not mature hence the need for Thread.yield().
I feel that Thread.yield() is like using register in C. We used to rely on it to improve the performance of our programs because in many cases the programmer was better at this than the compiler. But modern compilers are much smarter and in much fewer cases these days can the programmer actually improve the performance of a program with the use of register and Thread.yield().

Keep your OS scheduler decide for you ?
So never yield, and never sleep(0) until you match a case where sleep(0) is absolutly necessary and document it here.
Also context switch are costy so I don't think a lot of people want more context switches.

I know this is old, but you didn't get any good answers here.
In general yielding is a way to be polite to other threads/processes and give them a chance to run on the same CPU with minimal delay to the yielding thread.
Not all yielding is equal either. On Windows SwitchToThread() only releases CPU if another thread of equal or greater priority was scheduled to run on the same CPU which means it very possibly will simply resume the calling thread while Sleep(0) has looser scheduler semantics; on Linux sched_yield() is similar to SwitchToThread() while nanosleep() with a 0 timespec seemingly marks the thread as unready for whatever period the timer slack is set to (inferred from profiling and substantiated here ). Behavior on MacOS is seemingly similar to Linux, but with much less timer slack - haven't looked into it that much though.
Yielding was way more useful in the days when uniprocessor systems were abundant because it really helped keep the system moving, but for example on Windows where by default Sleep(1) is actually predictably at least a 15.6ms delay (note that this is nearly an entire frame at 60fps if you're making a game or media player or something) it's still pretty valid although MessageWaitForMultipleObjectsEx should be preferred in general UI applications. Windows 10 added a new type of high resolution waitable timer with microsecond granularity that should probably be preferred over other methods, so hopefully that kind of yielding won't be so necessary anymore either.
In the context of N:1 and N:M cooperative threading models (not common at the OS level anymore, but still employed at the application-level through libraries providing Fibers and Coroutines often enough) yielding is still also definitely useful to keep things moving.
Unfortunately it's also abused pretty often, for example yielding in a busy loop rather than waiting on a synchronization primitive because the appropriate primitive isn't obvious or because the developer is overly optimistic about how long their threads will wait for / overly pessimistic about the scheduler. But in practice on most modern multitasking OSes unless the system is extremely busy, threads waiting on a synchronization primitive will get run almost instantly when the primitive is triggered/released/whatever.
You should try to avoid yielding, especially as an alternative to using a proper synchronization method. When you do need to yield, a zero sleep or waiting on a high resolution time source is probably better than a normal yield - I call the prior a "long yield" as opposed to a "short yield" - but unless you're using the system interface the implementation of sleep in your programming language/framework of choice might "optimize" sleep(0) into a short yield or even a no-op for you, sadly.

Processes, threads, green threads, protothreads, fibers, coroutines: what's the difference?

I'm reading up on concurrency. I've got a bit over my head with terms that have confusingly similar definitions. Namely:
Processes
Threads
"Green threads"
Protothreads
Fibers
Coroutines
"Goroutines" in the Go language
My impression is that the distinctions rest on (1) whether truly parallel or multiplexed; (2) whether managed at the CPU, at the OS, or in the program; and (3..5) a few other things I can't identify.
Is there a succinct and unambiguous guide to the differences between these approaches to parallelism?

OK, I'm going to do my best. There are caveats everywhere, but I'm going to do my best to give my understanding of these terms and references to something that approximates the definition I've given.
Process: OS-managed (possibly) truly concurrent, at least in the presence of suitable hardware support. Exist within their own address space.
Thread: OS-managed, within the same address space as the parent and all its other threads. Possibly truly concurrent, and multi-tasking is pre-emptive.
Green Thread: These are user-space projections of the same concept as threads, but are not OS-managed. Probably not truly concurrent, except in the sense that there may be multiple worker threads or processes giving them CPU time concurrently, so probably best to consider this as interleaved or multiplexed.
Protothreads: I couldn't really tease a definition out of these. I think they are interleaved and program-managed, but don't take my word for it. My sense was that they are essentially an application-specific implementation of the same kind of "green threads" model, with appropriate modification for the application domain.
Fibers: OS-managed. Exactly threads, except co-operatively multitasking, and hence not truly concurrent.
Coroutines: Exactly fibers, except not OS-managed.
Goroutines: They claim to be unlike anything else, but they seem to be exactly green threads, as in, process-managed in a single address space and multiplexed onto system threads. Perhaps somebody with more knowledge of Go can cut through the marketing material.
It's also worth noting that there are other understandings in concurrency theory of the term "process", in the process calculus sense. This definition is orthogonal to those above, but I just thought it worth mentioning so that no confusion arises should you see process used in that sense somewhere.
Also, be aware of the difference between parallel and concurrent. It's possible you were using the former in your question where I think you meant the latter.

I mostly agree with Gian's answer, but I have different interpretations of a few concurrency primitives. Note that these terms are often used inconsistently by different authors. These are my favorite definitions (hopefully not too far from the modern consensus).
Process:
OS-managed
Each has its own virtual address space
Can be interrupted (preempted) by the system to allow another process to run
Can run in parallel with other processes on different processors
The memory overhead of processes is high (includes virtual memory tables, open file handles, etc)
The time overhead for creating and context switching between processes is relatively high
Threads:
OS-managed
Each is "contained" within some particular process
All threads in the same process share the same virtual address space
Can be interrupted by the system to allow another thread to run
Can run in parallel with other threads on different processors
The memory and time overheads associated with threads are smaller than processes, but still non-trivial
(For example, typically context switching involves entering the kernel and invoking the system scheduler.)
Cooperative Threads:
May or may not be OS-managed
Each is "contained" within some particular process
In some implementations, each is "contained" within some particular OS thread
Cannot be interrupted by the system to allow a cooperative peer to run
(The containing process/thread can still be interrupted, of course)
Must invoke a special yield primitive to allow peer cooperative threads to run
Generally cannot be run in parallel with cooperative peers
(Though some people think it's possible: http://ocm.dreamhosters.com/.)
There are lots of variations on the cooperative thread theme that go by different names:
Fibers
Green threads
Protothreads
User-level threads (user-level threads can be interruptable/preemptive, but that's a relatively unusual combination)
Some implementations of cooperative threads use techniques like split/segmented stacks or even individually heap-allocating every call frame to reduce the memory overhead associated with pre-allocating a large chunk of memory for the stack
Depending on the implementation, calling a blocking syscall (like reading from the network or sleeping) will either cause a whole group of cooperative threads to block or implicitly cause the calling thread to yield
Coroutines:
Some people use "coroutine" and "cooperative thread" more or less synonymously
I do not prefer this usage
Some coroutine implementations are actually "shallow" cooperative threads; yield can only be invoked by the "coroutine entry procedure"
The shallow (or semi-coroutine) version is easier to implement than threads, because each coroutine does not need a complete stack (just one frame for the entry procedure)
Often coroutine frameworks have yield primitives that require the invoker to explicitly state which coroutine control should transfer to
Generators:
Restricted (shallow) coroutines
yield can only return control back to whichever code invoked the generator
Goroutines:
An odd hybrid of cooperative and OS threads
Cannot be interrupted (like cooperative threads)
Can run in parallel on a language runtime-managed pool of OS threads
Event handlers:
Procedures/methods that are invoked by an event dispatcher in response to some action happening
Very popular for user interface programming
Require little to no language/system support; can be implemented in a library
At most one event handler can be running at a time; the dispatcher must wait for a handler to finish (return) before starting the next
Makes synchronization relatively simple; different handler executions never overlap in time
Implementing complex tasks with event handlers tends to lead to "inverted control flow"/"stack ripping"
Tasks:
Units of work that are doled out by a manager to a pool of workers
The workers can be threads, processes or machines
Of course the kind of worker a task library uses has a significant impact on how one implements the tasks
In this list of inconsistently and confusingly used terminology, "task" takes the crown. Particularly in the embedded systems community, "task" is sometimes used to mean "process", "thread" or "event handler" (usually called an "interrupt service routine"). It is also sometimes used generically/informally to refer to any kind of unit of computation.
One pet peeve that I can't stop myself from airing: I dislike the use of the phrase "true concurrency" for "processor parallelism". It's quite common, but I think it leads to much confusion.
For most applications, I think task-based frameworks are best for parallelization. Most of the popular ones (Intel's TBB, Apple's GCD, Microsoft's TPL & PPL) use threads as workers. I wish there were some good alternatives that used processes, but I'm not aware of any.
If you're interested in concurrency (as opposed to processor parallelism), event handlers are the safest way to go. Cooperative threads are an interesting alternative, but a bit of a wild west. Please do not use threads for concurrency if you care about the reliability and robustness of your software.

Protothreads are just a switch case implementation that acts like a state machine but makes implementation of the software a whole lot simpler. It is based around idea of saving a and int value before a case label and returning and then getting back to the point after the case by reading back that variable and using switch to figure out where to continue. So protothread are a sequential implementation of a state machine.

Protothreads are great when implementing sequential state machines. Protothreads are not really threads at all, but rather a syntax abstraction that makes it much easier to write a switch/case state machine that has to switch states sequentially (from one to the next etc..).
I have used protothreads to implement asynchronous io: http://martinschroder.se/asynchronous-io-using-protothreads/

Why might threads be considered "evil"?

I was reading the SQLite FAQ, and came upon this passage:
Threads are evil. Avoid them.
I don't quite understand the statement "Thread are evil". If that is true, then what is the alternative?
My superficial understanding of threads is:
Threads make concurrence happen. Otherwise, the CPU horsepower will be wasted, waiting for (e.g.) slow I/O.
But the bad thing is that you must synchronize your logic to avoid contention and you have to protect shared resources.
Note: As I am not familiar with threads on Windows, I hope the discussion will be limited to Linux/Unix threads.

When people say that "threads are evil", the usually do so in the context of saying "processes are good". Threads implicitly share all application state and handles (and thread locals are opt-in). This means that there are plenty of opportunities to forget to synchronize (or not even understand that you need to synchronize!) while accessing that shared data.
Processes have separate memory space, and any communication between them is explicit. Furthermore, primitives used for interprocess communication are often such that you don't need to synchronize at all (e.g. pipes). And you can still share state directly if you need to, using shared memory, but that is also explicit in every given instance. So there are fewer opportunities to make mistakes, and the intent of the code is more explicit.

Simple answer the way I understand it...
Most threading models use "shared state concurrency," which means that two execution processes can share the same memory at the same time. If one thread doesn't know what the other is doing, it can modify the data in a way that the other thread doesn't expect. This causes bugs.
Threads are "evil" because you need to wrap your mind around n threads all working on the same memory at the same time, and all of the fun things that go with it (deadlocks, racing conditions, etc).
You might read up about the Clojure (immutable data structures) and Erlang (message passsing) concurrency models for alternative ideas on how to achieve similar ends.

What makes threads "evil" is that once you introduce more than one stream of execution into your program, you can no longer count on your program to behave in a deterministic manner.
That is to say: Given the same set of inputs, a single-threaded program will (in most cases) always do the same thing.
A multi-threaded program, given the same set of inputs, may well do something different every time it is run, unless it is very carefully controlled. That is because the order in which the different threads run different bits of code is determined by the OS's thread scheduler combined with a system timer, and this introduces a good deal of "randomness" into what the program does when it runs.
The upshot is: debugging a multi-threaded program can be much harder than debugging a single-threaded program, because if you don't know what you are doing it can be very easy to end up with a race condition or deadlock bug that only appears (seemingly) at random once or twice a month. The program will look fine to your QA department (since they don't have a month to run it) but once it's out in the field, you'll be hearing from customers that the program crashed, and nobody can reproduce the crash.... bleah.
To sum up, threads aren't really "evil", but they are strong juju and should not be used unless (a) you really need them and (b) you know what you are getting yourself into. If you do use them, use them as sparingly as possible, and try to make their behavior as stupid-simple as you possibly can. Especially with multithreading, if anything can go wrong, it (sooner or later) will.

I would interpret it another way. It's not that threads are evil, it's that side-effects are evil in a multithreaded context (which is a lot less catchy to say).
A side effect in this context is something that affects state shared by more than one thread, be it global or just shared. I recently wrote a review of Spring Batch and one of the code snippets used is:
private static Map<Long, JobExecution> executionsById = TransactionAwareProxyFactory.createTransactionalMap();
private static long currentId = 0;
public void saveJobExecution(JobExecution jobExecution) {
Assert.isTrue(jobExecution.getId() == null);
Long newId = currentId++;
jobExecution.setId(newId);
jobExecution.incrementVersion();
executionsById.put(newId, copy(jobExecution));
}
Now there are at least three serious threading issues in less than 10 lines of code here. An example of a side effect in this context would be updating the currentId static variable.
Functional programming (Haskell, Scheme, Ocaml, Lisp, others) tend to espouse "pure" functions. A pure function is one with no side effects. Many imperative languages (eg Java, C#) also encourage the use of immutable objects (an immutable object is one whose state cannot change once created).
The reason for (or at least the effect of) both of these things is largely the same: they make multithreaded code much easier. A pure function by definition is threadsafe. An immutable object by definition is threadsafe.
The advantage processes have is that there is less shared state (generally). In traditional UNIX C programming, doing a fork() to create a new process would result in shared process state and this was used as a means of IPC (inter-process communication) but generally that state is replaced (with exec()) with something else.
But threads are much cheaper to create and destroy and they take less system resources (in fact, the operating itself may have no concept of threads yet you can still create multithreaded programs). These are called green threads.

The paper you linked to seems to explain itself very well. Did you read it?
Keep in mind that a thread can refer to the programming-language construct (as in most procedural or OOP languages, you create a thread manually, and tell it to executed a function), or they can refer to the hardware construct (Each CPU core executes one thread at a time).
The hardware-level thread is obviously unavoidable, it's just how the CPU works. But the CPU doesn't care how the concurrency is expressed in your source code. It doesn't have to be by a "beginthread" function call, for example. The OS and the CPU just have to be told which instruction threads should be executed.
His point is that if we used better languages than C or Java with a programming model designed for concurrency, we could get concurrency basically for free. If we'd used a message-passing language, or a functional one with no side-effects, the compiler would be able to parallelize our code for us. And it would work.

Threads aren't any more "evil" than hammers or screwdrivers or any other tools; they just require skill to utilize. The solution isn't to avoid them; it's to educate yourself and up your skill set.

Creating a lot of threads without constraint is indeed evil.. using a pooling mechanisme (threadpool) will mitigate this problem.
Another way threads are 'evil' is that most framework code is not designed to deal with multiple threads, so you have to manage your own locking mechanisme for those datastructures.
Threads are good, but you have to think about how and when you use them and remember to measure if there really is a performance benefit.

A thread is a bit like a light weight process. Think of it as an independent path of execution within an application. The thread runs in the same memory space as the application and therefore has access to all the same resources, global objects and global variables.
The good thing about them: you can parallelise a program to improve performance. Some examples, 1) In an image editing program a thread may run the filter processing independently of the GUI. 2) Some algorithms lend themselves to multiple threads.
Whats bad about them? if a program is poorly designed they can lead to deadlock issues where both threads are waiting on each other to access the same resource. And secondly, program design can me more complex because of this. Also, some class libraries don't support threading. e.g. the c library function "strtok" is not "thread safe". In other words, if two threads were to use it at the same time they would clobber each others results. Fortunately, there are often thread safe alternatives... e.g. boost library.
Threads are not evil, they can be very useful indeed.
Under Linux/Unix, threading hasn't been well supported in the past although I believe Linux now has Posix thread support and other unices support threading now via libraries or natively. i.e. pthreads.
The most common alternative to threading under Linux/Unix platforms is fork. Fork is simply a copy of a program including it's open file handles and global variables. fork() returns 0 to the child process and the process id to the parent. It's an older way of doing things under Linux/Unix but still well used. Threads use less memory than fork and are quicker to start up. Also, inter process communications is more work than simple threads.

In a simple sense you can think of a thread as another instruction pointer in the current process. In other words it points the IP of another processor to some code in the same executable. So instead of having one instruction pointer moving through the code there are two or more IP's executing instructions from the same executable and address space simultaneously.
Remember the executable has it's own address space with data / stack etc... So now that two or more instructions are being executed simultaneously you can imagine what happens when more than one of the instructions wants to read/write to the same memory address at the same time.
The catch is that threads are operating within the process address space and are not afforded protection mechanisms from the processor that full blown processes are. (Forking a process on UNIX is standard practice and simply creates another process.)
Out of control threads can consume CPU cycles, chew up RAM, cause execeptions etc.. etc.. and the only way to stop them is to tell the OS process scheduler to forcibly terminate the thread by nullifying it's instruction pointer (i.e. stop executing). If you forcibly tell a CPU to stop executing a sequence of instructions what happens to the resources that have been allocated or are being operated on by those instructions? Are they left in a stable state? Are they properly freed? etc...
So, yes, threads require more thought and responsibility than executing a process because of the shared resources.

For any application that requires stable and secure execution for long periods of time without failure or maintenance, threads are always a tempting mistake. They invariably turn out to be more trouble than they are worth. They produce rapid results and prototypes that seem to be performing correctly but after a couple weeks or months running you discover that they have critical flaws.
As mentioned by another poster, once you use even a single thread in your program you have now opened a non-deterministic path of code execution that can produce an almost infinite number of conflicts in timing, memory sharing and race conditions. Most expressions of confidence in solving these problems are expressed by people who have learned the principles of multithreaded programming but have yet to experience the difficulties in solving them.
Threads are evil. Good programmers avoid them wherever humanly possible. The alternative of forking was offered here and it is often a good strategy for many applications. The notion of breaking your code down into separate execution processes which run with some form of loose coupling often turns out to be an excellent strategy on platforms that support it. Threads running together in a single program is not a solution. It is usually the creation of a fatal architectural flaw in your design that can only be truly remedied by rewriting the entire program.
The recent drift towards event oriented concurrency is an excellent development innovation. These kinds of programs usually prove to have great endurance after they are deployed.
I've never met a young engineer who didn't think threads were great. I've never met an older engineer who didn't shun them like the plague.

Being an older engineer, I heartily agree with the answer by Texas Arcane.
Threads are very evil because they cause bugs that are extremely difficult to solve. I have literally spent months solving sporadic race-conditions. One example caused trams to suddenly stop about once a month in the middle of the road and block traffic until towed away. Luckily I didn't create the bug, but I did get to spend 4 months full-time to solve it...
It's a tad late to add to this thread, but I would like to mention a very interesting alternative to threads: asynchronous programming with co-routines and event loops. This is being supported by more and more languages, and does not have the problem of race conditions like multi-threading has.
It can replace multi-threading in cases where it is used to wait on events from multiple sources, but not where calculations need to be performed in parallel on multiple CPU cores.

Parallel coding Vs Multithreading (on single cpu)

can we use interchangeably "Parallel coding" and "Multithreading coding " on single cpu?
i am not much experience in both,
but i want to shift my coding style to any one of the above.
As i found now a days many single thred application are obsolete, which would be better for future software industy as a career prospect?

There is definitely overlap between multithreading and parallel coding/computing, with the main differences in the target processing architecture.
Multithreading has been used to exploit the benefits of concurrency within a single process on a single CPU with shared memory. Running the same programs on a machine with multiple CPUs may result in significant speedup, but is often a bonus rather than intended (until recently). Many OSes have threading models (e.g. pthreads), which benefit from but do not require multiple CPUs.
Multiprocessing is the standard model for parallel programming targeting multiple CPUs, from early SMP machines with many CPUs on a big machine, then to cluster computing across many machines, and now back to many CPUs/cores on a single computer. MPI is a standard that can work across many different architectures.
Of course, one can program a parallel design using threads with language frameworks like OpenMP. I've heard of multicomponent GUIs/applications that rely on separate processing that could theoretically run anywhere. Practically, there's more of the former than the latter.
Probably the main distinction is when the program runs across multiple machines, where it's not practical to use multithreading, and existing applications that share memory will not work.

Parallel coding is the concept of executing multiple actions in parallel(Same time).
Multi-threaded Programming on a Single Processor
Multi-threading on a single processor gives the illusion of running in parallel. Behind the scenes, the processor is switching between threads depending on how threads have been prioritized.
Multi-threaded Programming on Multiple Processors
Multi-threading on multiple processor cores is truly parallel. Each microprocessor is running a single thread. Consequently, there are multiple parallel, concurrent tasks happening at once.

The question is a bit confusing as you can perform parallel operations in multiple threads, but all multi-thread applications are not using parallel computing.
In parallel code, you typically have many "workers" that consume a set of data to return results asynchronously. But multithread is used in a broader scope, like GUI, blocking I/O and networking.
Being on a single or many CPU does not change much, as the management depends on how your OS can handle threads and processes.
Multithreading will be useful everywhere, parallel is not everyday computing paradigm, so it might be a "niche" in a career prospect.

Some demos I saw in .NET 4.0, the Parallel code changes seem easier then doing threads. There is new syntax for "For Loops" and other things to support parallel processing. So there is a difference.
I think in the future you will do both, but I think the Parallel support will be better and easier. You still need threads for background operations and other things.

The fact is that you cannot achieve "real" parallelism on a single CPU. There are several libraries (such as C's MPI) that help a little bit on this area. But the concept of paralellism it's not that used among developers working on popular solutions.
Multithreading is common these days thanks to the introduction of multiple cores on a single CPU, it's easy and almost transparent to implement in every language thanks to thread libs and threadsafe types, methods, classes and so on. This way you can simulate paralellism.
Anyway, if you're starting with this, start by reading about concurrency and threading topics. And of course, threads + parallelism work good together.

I'm not sure about what do you think "Parallel coding" is but Parallel coding as I understand it refers to producing code which is executed in parallel by the CPU, and therefore Multithreaded code falls inside that description.
In that way, obviously you can use them interchangeably (as one falls inside the other).
Nonetheless I'll suggest you take it slowly and start learning from the basics. Understand WHY multithreading is becoming important, what's the difference between processes, threads and fibers, how do you synchronize either of them and so on.
Keep in mind that parallel coding, as you call it, is quite complex, specially compared to sequential coding so be prepared. Also don't just rush into it. Just because you use 3 threads instead of one won't make your program faster, it can even make it slower. You need to understand the hows and the whys. Not every thing can be made parallel and not everthing that can, should.

in simple Language
multithreading is available in the CPu by itself and
parallel programming is an explicit task either done by the compiler or my constructs written by programmers "#pragma"

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string