differences between cpu and gpu (graphics) - graphics

I'm trying to understand some things better in my computer graphics class. What are the main differences between CPU and GPU? Thanks for any help.

OK the main difference is that a GPU has very specialized structure to handle just graphics. You could think of the CPU as being a unit that just follows instruction and uses the Arithemetic and Logic unit to perform computations with data coming in and going out; either sequentially or in a parallel way. But a GPU has general purpose hardware which has different hardware units specially wired to perform several hugely complex intensive tasks which if performed by a CPU would take a long time. I think the Wikipedia article on GPU should explain more clearly.

Related

Hyper threading in hardware level

So, this semester I have a subject about operating systems, and I don't understand so much yet about Hyper threading. I searched internet but what I found are almost the same things (I don't know if I searched with the wrong terms).
Here are the sources I found:
https://www.dasher.com/will-hyper-threading-improve-processing-performance/;
Hyper-threading Performance Comparison;
Why does hyper-threading benefit my algorithm?;
But, my question is not about HT in differents languages or how I can analyse with/without but how this was implemented at the hardware level.
How does HT comunicate with main memory (ALU, registers..), cache and others devices. Where can I find something about this?
And finally I want to compare HT to parallelized processes. How does parallelism take advantage of hyper threading?
So guys if you know about a book or site that can help me, please share here.
Thanks,
Modern hyperthreading is implemented in a very clever way.
Think about a dual-core processor for a minutes. It has two cores, each of which has registers, caches with mechanisms to access memory, and a collection of execution units to perform various integer, floating point, and control operations.
Now imagine that instead of each core having its own collection of execution units, the two cores just share one pool of execution units. Either core can use, say, a floating point multiplier so long as the other core isn't using that same execution unit. If one core needs an execution unit that is used by the other core, it will have to wait, just as it would have to if that execution unit was used by an overlapping instruction executed by that same thread.

What happens on hardware when running a parallel program

For example, a machine has two processors and each processor has two cores. I write a parallel program by using OpenMP and run it with 3 threads. What happens on the hardware? I think only one processor will deal with the program (is that right?) but I can't image how are the three threads distributed to two cores. Please help. Thanks.
It is almost impossible to answer your question in the scope of a SO "answer" - you will need to read up on implementation of parallel processing for the particular architecture of your machine if you want the "real" answer. Short answer is "it depends". But your program could be running on both processors, on any or all of the four cores; they key to understand here is that you can control that to some extent with the structure of your program, but the neat thing about OMP is that usually "you ought not to care". If threads are operating concurrently, they will usually get a core each; but if they need to access the same memory space that may slow you down since "short term data" likes to live in the processor's (core's) cache, and that means there is a lot of shuffling data going on. You get the greatest performance improvements if different threads DON'T have to share memory.

HLSL operator/functions cycle count

I'm modeling some algorithms to be run on GPU's. Is there a reference or something as to how many cycles the various intrinsics and calculations take on modern hardware? (nvidia 5xx+ series, amd 6xxx+ series) I cant seem to find any official word on this even though there are some mentions of the raised costs of normalization, square root and other functions throughout their documentation.. thanks.
Unfortunately, the cycle count documentation you're looking for either doesn't exist, or (if it does) it probably won't be as useful as you would expect. You're correct that some of the more complex GPU instructions take more time to execute than the simpler ones, but cycle counts are only important when instruction execution time is main performance bottleneck; GPUs are designed such that this is very rarely the case.
The way GPU shader programs achieve such high performance is by running many (potentially thousands) of shader threads in parallel. Each shader thread generally executes no more than a single instruction before being swapped out for a different thread. In perfect conditions, there are enough threads in flight that some of them are always ready to execute their next instruction, so the GPU never has to stall; this hides the latency of any operation executed by a single thread. If the GPU is doing useful work every cycle, then it's as if every shader instruction executes in a single cycle. In this case, the only way to make your program faster is to make it shorter (fewer instructions = fewer cycles of work overall).
Under more realistic conditions, when there isn't enough work to keep the GPU fully loaded, the bottleneck is virtually guaranteed to be memory accesses rather than ALU operations. A single texture fetch can take thousands of cycles to return in the worst case; with unpredictable stalls like that, it's generally not worth worrying about whether sqrt() takes more cycles than dot().
So, the key to maximizing GPU performance isn't to use faster instructions. It's about maximizing occupancy -- that is, making sure there's enough work to keep the GPU sufficiently busy to hide instruction / memory latencies. It's about being smart about your memory accesses, to minimize those agonizing round-trips to DRAM. And sometimes, when you're really lucky, it's about using fewer instructions.
http://books.google.ee/books?id=5FAWBK9g-wAC&lpg=PA274&ots=UWQi5qznrv&dq=instruction%20slot%20cost%20hlsl&pg=PA210#v=onepage&q=table%20a-8&f=false
this is the closest thing i've found so far, it is outdated(sm3) but i guess better than nothing.
does operator/functions have cycle? I know assembly instructions have cycle, that's the low level time measurement, and mostly depends on CPU.since operator and functions are all high level programming stuffs. so I don't think they have such measurement.

GPGPU vs. Multicore?

What are the key practical differences between GPGPU and regular multicore/multithreaded CPU programming, from the programmer's perspective? Specifically:
What types of problems are better suited to regular multicore and what types are better suited to GPGPU?
What are the key differences in programming model?
What are the key underlying hardware differences that necessitate any differences in programming model?
Which one is typically easier to use and by how much?
Is it practical, in the long term, to implement high level parallelism libraries for the GPU, such as Microsoft's task parallel library or D's std.parallelism?
If GPU computing is so spectacularly efficient, why aren't CPUs designed more like GPUs?
Interesting question. I have researched this very problem so my answer is based on some references and personal experiences.
What types of problems are better suited to regular multicore and what types are better suited to GPGPU?
Like #Jared mentioned. GPGPU are built for very regular throughput workloads, e.g., graphics, dense matrix-matrix multiply, simple photoshop filters, etc. They are good at tolerating long latencies because they are inherently designed to tolerate Texture sampling, a 1000+ cycle operation. GPU cores have a lot of threads: when one thread fires a long latency operation (say a memory access), that thread is put to sleep (and other threads continue to work) until the long latency operation finishes. This allows GPUs to keep their execution units busy a lot more than traditional cores.
GPUs are bad at handling branches because GPUs like to batch "threads" (SIMD lanes if you are not nVidia) into warps and send them down the pipeline together to save on instruction fetch/decode power. If threads encounter a branch, they may diverge, e.g., 2 threads in a 8-thread warp may take the branch while the other 6 may not take it. Now the warp has to be split into two warps of size 2 and 6. If your core has 8 SIMD lanes (which is why original warp pakced 8 threads), now your two newly formed warps will run inefficiently. The 2-thread warp will run at 25% efficiency and the 6-thread warp will run at 75% efficiency. You can imagine that if a GPU continues to encounter nested branches, its efficiency becomes very low. Therefore, GPUs aren't good at handling branches and hence code with branches should not be run on GPUs.
GPUs are also bad a co-operative threading. If threads need to talk to each other then GPUs won't work well because synchronization is not well-supported on GPUs (but nVidia is on it).
Therefore, the worst code for GPU is code with less parallelism or code with lots of branches or synchronization.
What are the key differences in programming model?
GPUs don't support interrupts and exception. To me thats the biggest difference. Other than that CUDA is not very different from C. You can write a CUDA program where you ship code to the GPU and run it there. You access memory in CUDA a bit differently but again thats not fundamental to our discussion.
What are the key underlying hardware differences that necessitate any differences in programming model?
I mentioned them already. The biggest is the SIMD nature of GPUs which requires code to be written in a very regular fashion with no branches and inter-thread communication. This is part of why, e.g., CUDA restricts the number of nested branches in the code.
Which one is typically easier to use and by how much?
Depends on what you are coding and what is your target.
Easily vectorizable code: CPU is easier to code but low performance. GPU is slightly harder to code but provides big bang for the buck.
For all others, CPU is easier and often better performance as well.
Is it practical, in the long term, to implement high level parallelism libraries for the GPU, such as Microsoft's task parallel library or D's std.parallelism?
Task-parallelism, by definition, requires thread communication and has branches as well. The idea of tasks is that different threads do different things. GPUs are designed for lots of threads that are doing identical things. I would not build task parallelism libraries for GPUs.
If GPU computing is so spectacularly efficient, why aren't CPUs designed more like GPUs?
Lots of problems in the world are branchy and irregular. 1000s of examples. Graph search algorithms, operating systems, web browsers, etc. Just to add -- even graphics is becoming more and more branchy and general-purpose like every generation so GPUs will be becoming more and more like CPUs. I am not saying they will becomes just like CPUs, but they will become more programmable. The right model is somewhere in-between the power-inefficient CPUs and the very specialized GPUs.
Even in a multi-core CPU, your units of work are going to be much larger than on a GPGPU. GPGPUs are appropriate for problems that scale very well, with each chunk of work being increadibly small. A GPGPU has much higher latency because you have to move data to the GPU's memory system before it can be accessed. However, once the data is there, your throughput, if the problem is appropriately scalable, will be much higher with a GPGPU. In my experience, the problem with GPGPU programming is the latency in getting data from normal memory to the GPGPU.
Also, GPGPUs are horrible at communicating across worker processes if the worker processes don't have a sphere of locality routing. If you're trying to communicate all the way across the GPGPU, you're going to be in a lot of pain. For this reason, standard MPI libraries are a poor fit for GPGPU programming.
All computers aren't designed like GPUs because GPUs are fantastic at high latency, high throughput calculations that are inherently parallel and can be broken down easily. Most of what the CPU doing is not inherently parallel and does not scale to thousands or millions of simultaneous workers very efficiently. Luckily, graphics programming does and that's why all this started in GPUs. People have increasingly been finding problems that they can make look like graphics problems, which has led to the rise of GPGPU programming. However, GPGPU programming is only really worth your time if it is appropriate to your problem domain.

Is there a point to multithreading?

I don’t want to make this subjective...
If I/O and other input/output-related bottlenecks are not of concern, then do we need to write multithreaded code? Theoretically the single threaded code will fare better since it will get all the CPU cycles. Right?
Would JavaScript or ActionScript have fared any better, had they been multithreaded?
I am just trying to understand the real need for multithreading.
I don't know if you have payed any attention to trends in hardware lately (last 5 years) but we are heading to a multicore world.
A general wake-up call was this "The free lunch is over" article.
On a dual core PC, a single-threaded app will only get half the CPU cycles. And CPUs are not getting faster anymore, that part of Moores law has died.
In the words of Herb Sutter The free lunch is over, i.e. the future performance path for computing will be in terms of more cores not higher clockspeeds. The thing is that adding more cores typically does not scale the performance of software that is not multithreaded, and even then it depends entirely on the correct use of multithreaded programming techniques, hence multithreading is a big deal.
Another obvious reason is maintaining a responsive GUI, when e.g. a click of a button initiates substantial computations, or I/O operations that may take a while, as you point out yourself.
The primary reason I use multithreading these days is to keep the UI responsive while the program does something time-consuming. Sure, it's not high-tech, but it keeps the users happy :-)
Most CPUs these days are multi-core. Put simply, that means they have several processors on the same chip.
If you only have a single thread, you can only use one of the cores - the other cores will either idle or be used for other tasks that are running. If you have multiple threads, each can run on its own core. You can divide your problem into X parts, and, assuming each part can run indepedently, you can finish the calculations in close to 1/Xth of the time it would normally take.
By definition, the fastest algorithm running in parallel will spend at least as much CPU time as the fastest sequential algorithm - that is, parallelizing does not decrease the amount of work required - but the work is distributed across several independent units, leading to a decrease in the real-time spent solving the problem. That means the user doesn't have to wait as long for the answer, and they can move on quicker.
10 years ago, when multi-core was unheard of, then it's true: you'd gain nothing if we disregard I/O delays, because there was only one unit to do the execution. However, the race to increase clock speeds has stopped; and we're instead looking at multi-core to increase the amount of computing power available. With companies like Intel looking at 80-core CPUs, it becomes more and more important that you look at parallelization to reduce the time solving a problem - if you only have a single thread, you can only use that one core, and the other 79 cores will be doing something else instead of helping you finish sooner.
Much of the multithreading is done just to make the programming model easier when doing blocking operations while maintaining concurrency in the program - sometimes languages/libraries/apis give you little other choice, or alternatives makes the programming model too hard and error prone.
Other than that the main benefit of multi threading is to take advantage of multiple CPUs/cores - one thread can only run at one processor/core at a time.
No. You can't continue to gain the new CPU cycles, because they exist on a different core and the core that your single-threaded app exists on is not going to get any faster. A multi-threaded app, on the other hand, will benefit from another core. Well-written parallel code can go up to about 95% faster- on a dual core, which is all the new CPUs in the last five years. That's double that again for a quad core. So while your single-threaded app isn't getting any more cycles than it did five years ago, my quad-threaded app has four times as many and is vastly outstripping yours in terms of response time and performance.
Your question would be valid had we only had single cores. The things is though, we mostly have multicore CPU's these days. If you have a quadcore and write a single threaded program, you will have three cores which is not used by your program.
So actually you will have at most 25% of the CPU cycles and not 100%. Since the technology today is to add more cores and less clockspeed, threading will be more and more crucial for performance.
That's kind of like asking whether a screwdriver is necessary if I only need to drive this nail. Multithreading is another tool in your toolbox to be used in situations that can benefit from it. It isn't necessarily appropriate in every programming situation.
Here are some answers:
You write "If input/output related problems are not bottlenecks...". That's a big "if". Many programs do have issues like that, remembering that networking issues are included in "IO", and in those cases multithreading is clearly worthwhile. If you are writing one of those rare apps that does no IO and no communication then multithreading might not be an issue
"The single threaded code will get all the CPU cycles". Not necessarily. A multi-threaded code might well get more cycles than a single threaded app. These days an app is hardly ever the only app running on a system.
Multithreading allows you to take advantage of multicore systems, which are becoming almost universal these days.
Multithreading allows you to keep a GUI responsive while some action is taking place. Even if you don't want two user-initiated actions to be taking place simultaneously you might want the GUI to be able to repaint and respond to other events while a calculation is taking place.
So in short, yes there are applications that don't need multithreading, but they are fairly rare and becoming rarer.
First, modern processors have multiple cores, so a single thraed will never get all the CPU cycles.
On a dualcore system, a single thread will utilize only half the CPU. On a 8-core CPU, it'll use only 1/8th.
So from a plain performance point of view, you need multiple threads to utilize the CPU.
Beyond that, some tasks are also easier to express using multithreading.
Some tasks are conceptually independent, and so it is more natural to code them as separate threads running in parallel, than to write a singlethreaded application which interleaves the two tasks and switches between them as necessary.
For example, you typically want the GUI of your application to stay responsive, even if pressing a button starts some CPU-heavy work process that might go for several minutes. In that time, you still want the GUI to work. The natural way to express this is to put the two tasks in separate threads.
Most of the answers here make the conclusion multicore => multithreading look inevitable. However, there is another way of utilizing multiple processors - multi-processing. On Linux especially, where, AFAIK, threads are implemented as just processes perhaps with some restrictions, and processes are cheap as opposed to Windows, there are good reasons to avoid multithreading. So, there are software architecture issues here that should not be neglected.
Of course, if the concurrent lines of execution (either threads or processes) need to operate on the common data, threads have an advantage. But this is also the main reason for headache with threads. Can such program be designed such that the pieces are as much autonomous and independent as possible, so we can use processes? Again, a software architecture issue.
I'd speculate that multi-threading today is what memory management was in the days of C:
it's quite hard to do it right, and quite easy to mess up.
thread-safety bugs, same as memory leaks, are nasty and hard to find
Finally, you may find this article interesting (follow this first link on the page). I admit that I've read only the abstract, though.

Resources