plot speed up curve vs number of OpenMP threads - scalability? - multithreading

I am working on a C++ code which uses OpenMP threads. I have plotted the speedup curve versus the number of OpenMP threads and the theorical curve (if the code was able to be fully parallelized).
here is this plot :
From this picture, can we say this code is not scalable (from a point of view of parallelization) ? i.e the code is not twice more fast with 2 OpenMP threads, four more fast with 4 threads etc ... ?
Thanks

For the code that barely achieves 2.5x speedup on 16 threads, it is fair to say that it does not scale. However "is not scalable" is often considered a stronger statement. The difference, as I understand it, is that "does not scale" typically refers to a particular implementation and does not imply inherent inability to scale; in other words, maybe you can make it scale if bottlenecks are eliminated. On the other hand, "is not scalable" usually means "you cannot make it scale, at least not without changing the core algorithm". Assuming such meaning, one cannot say "a problem/code/algorithm is not scalable" only looking at a chart.
On an additional note, it's not always reasonable to expect perfect scaling (2x with 2 threads, 4x with 4 threads, etc). A curve that is "close enough" to the ideal scaling might still be considered as showing good scalability; and what "close enough" means may depend on a number of factors. It can be useful to tell / think of parallel efficiency, and not speedup, when scalability is a question. For example, if parallel efficiency is 0.8 (or 80%) and does not drop when the number of threads increase, it could be considered a good scalability. Also, it's possible that some program scales well till a certain number of threads, but remains flat or even falls down if more resources are added.

Related

Why emulate for certain number of cycles?

I have seen in more than one places - the following way of emulating
i.e cycles is passed into emulate function
int CPU_execute(int cycles) {
int cycle_count;
cycle_count = cycles;
do {
/* OPCODE execution here */
} while(cycle_count > 0);
return cycles - cycle_count;
}
I am having hard time understand why would you do this approach for emulating i.e why would you emulate for certain number of cycles? Can you give some scenarios where this approach is useful.
Any help is heartily appreciated !!!
Emulators tend to be interested in fooling the software written for multiple chip devices — in terms of the Z80 and the best selling devices you're probably talking about at least a graphics chip and a sound chip in addition to the CPU.
In the real world those chips all act concurrently. There'll be some bus logic to allow them all to communicate but they're otherwise in worlds of their own.
You don't normally run emulation of the different chips as concurrent processes because the cost of enforcing synchronisation events is too great, especially in the common arrangement where something like the same block of RAM is shared between several of the chips.
So instead the most basic approach is to cooperatively multitask the different chips — run the Z80 for a few cycles, then run the graphics chip for the same amount of time, etc, ad infinitum. That's where the approach of running for n cycles and returning comes from.
It's usually not an accurate way of reproducing the behaviour of a real computer bus but it's easy to implement and often you can fool most software.
In the specific code you've posted the author has further decided that the emulation will round the number of cycles up to the end of the next whole instruction. Again that's about simplicity of implementation rather than being anything to do with the actual internals of a real machine. The number of cycles actually run for is returned so that other subsystems can attempt to adapt.
Since you mentioned z80, I happen to know just the perfect example of the platform where this kind of precise emulation is sometimes necessary: ZX Spectrum. The standard graphics output area on ZX Spectrum was a box of 256 x 192 pixels situated in the centre of the screen, surrounded by a fairly wide "border" area filled by a solid color. The color of the border was controlled by outputing a value into a special output port. The computer designer's idea was that one would simply choose the border color that is the most appropriate to what is happening on the main screen.
ZX Spectrum did not have a precision timer. But programmers quickly realised that the "rigid" (by modern standards) timings of z80 allowed one to do drawing that was synchronised with the movement of the monitor's beam. On ZX Spectrum one could wait for the interrupt produced at the beginning of each frame and then literally count the precise number of cycles necessary to achieve various effects. For example, a single full scanline on ZX Spectrum was scanned in 224 cycles. Thus, one could change the border color every 224 cycles and generate pixel-thick lines on the border.
Graphics capacity of the ZX Spectrum was limited in a sense that the screen was divided into 8x8 blocks of pixels, which could only use two colors at any given time. Programmers overcame this limitation by changing these two colors every 224 cycles, hence, effectively, increasing the color resolution 8-fold.
I can see that the discussion under another answer focuses on whether one scanline may be a sufficiently accurate resolution for an emulator. Well, some of the border scroller effects I've seen on ZX Spectrum are, basically, timed to a single z80-cycle. Emulator that wants to reproduce the correct output of such codes would also have to be precise to a single machine cycle.
If you want to sync your processor with other hardware it could be useful to do it like that. For instance, if you want to sync it with a timer you would like to control how many cycles can pass before the timer interrupts the CPU.

Quickest and easiest algorithm for comparing the frequency content of two sounds

I want to take two sounds that contain a dominant frequency and say 'this one is higher than this one'. I could do FFT, find the frequency with the greatest amplitude of each and compare them. I'm wondering if, as I have a specific task, there may be a simpler algorithm.
The sounds are quite dirty with many frequencies, but contain a clear dominant pitch. They aren't perfectly produced sine waves.
Given that the sounds are quite dirty, I would suggest starting to develop the algorithm with the output of an FFT as it'll be much simpler to diagnose any problems. Then when you're happy that it's working you can think about optimising/simplifying.
As a rule of thumb when developing this kind of numeric algorithm, I always try to operate first in the most relevant domain (in this case you're interested in frequencies, so analyse in frequency space) at the start, and once everything is behaving itself consider shortcuts/optimisations. That way you can test the latter solution against the best-performing former.
In the general case, decent pitch detection/estimation generally requires a more sophisticated algorithm than looking at FFT peaks, not a simpler algorithm.
There are a variety of pitch detection methods ranging in sophistication from counting zero-crossing (which obviously won't work in your case) to extremely complex algorithms.
While the frequency domain methods seems most appropriate, it's not as simple as "taking the FFT". If your data is very noisy, you may have spurious peaks that are higher than what you would consider to be the dominant frequency. One solution is use window overlapping segments of your signal, and do STFTs, and average the results. But this raises more questions: how big should the windows be? In this case, it depends on how far apart you expect those dominant peaks to be, how long your recordings are, etc. (Note: FFT methods can resolve to better than one-bin size by taking into account phase information. In this case, you would have to do something more complex than averaging all your FFT windows together).
Another approach would be a time-domain method, such as YIN:
http://recherche.ircam.fr/equipes/pcm/cheveign/pss/2002_JASA_YIN.pdf
Wikipedia discusses some more methods:
http://en.wikipedia.org/wiki/Pitch_detection_algorithm
You can also explore some more methods in chapter 9 of this book:
http://www.amazon.com/DAFX-Digital-Udo-ouml-lzer/dp/0471490784
You can get matlab sourcecode for yin from chapter 9 of that book here:
http://www2.hsu-hh.de/ant/dafx2002/DAFX_Book_Page_2nd_edition/matlab.html

Could a Cray XK6 run a real-time raytracer?

I heard about Cray's new supercomputer -- the XK6 -- today, but I am a little confused about where the bottlenecks are. Is it in interconnec? Can an XK6 configured with, say, 500,000 16-core processors achieve a graphic fidelity comprable to Toy Story 3 in real-time? By "real-time", I mean 60fps, or around 16.7 milliseconds per frame.
No. Pure computation is surprisingly little of what it takes to render a film frame from Toy Story 3 or a similar modern animated (or VFX) film. Those scenes may reference many hundreds of GB of texture, and even if you could know exactly which subset of that texture will be needed for a frame, it may be tens of GB, which still needs to be read from disk and/or transferred over a network. GPUs or massively parallel distributed computation doesn't speed that up. Furthermore, rendering is only the very last step... preparing the geometric input for a frame (simulating the fluids, cloth and hair, tessellating the geometry, reading and interpreting large scenes from disk) can be substantial.
So, just pulling numbers out of the air (but these are moderately realistic), say it takes 30 minutes to prepare the scene (load the models, tessellate it, some minor sims, etc.), and 4.5 hours to render (of which, say, 30 minutes is reading texture and other resources from disk, leaving 4 hours of "ray tracing" and other computation). If the XK6 made the ray tracing infinitely fast, it would only speed the total process up by 5x (1 hour is still hard-to-serialize prep and I/O). That's Amdahl's Law for you.
Now, you're probably asking yourself, "how do games go so fast?" They do it in two ways: (1) they drastically reduce the data set (texture size, geometric resolution, etc.) to make it all fit on the GPU and be reasonably fast to load levels (which, curiously, you the user are not counting when you think of the rendering as happening in "real time"); (2) they spare no expense in computation, tricks, and human labor to optimize the scenes and algorithms before they ship the disks, so that when it's in front of the player it can render quickly.
So, in summary, if you are asking if the total computational power of the XK6 is enough to compute in real time all the pure math required to render a film frame, then yes, it probably is. But if you are asking if an XK6 could actually render the movie in realtime given the kinds of inputs the renderer needs, then no, it couldn't. Would an XK6 be of any use to people rendering those movie frames? No, it probably wouldn't be worth the trouble of reprogramming all the software (hundreds of man years) from the ground up.
Looking at it from another viewpoint, users generally render only one scene at a time, then make small changes and re-render over and over again. To render only a single scene in realtime may still require several GB of textures loaded onto the GPU's RAM.
Could a supercomputer like those from Cray using a "sea of cores" or a vast array of modern CPU's perform the job in realtime? Yes, for simple enough scenes. More complex scenes that require 100+ rays per pixel at 8MP (4K x 2K for movies, 2MP for DSLR/indie type movies), along with lots of objects, shadows, haze, refraction, diffuse lighting sources, etc) would probably require too many computations, even at 24fps.

Massively parallel application: what about several 8 bits cores for non-vector IA applications?

I was thinking (oh god, it starts badly) about neuron networks and how it is not possible to simulate those because they require many atomic operation at the same time (here meaning simultaneously), because that's how neurons are faster: they are many to compute stuff.
Since our processors are 32 bits so they can compute a significantly larger band (meaning a lot of different atomic numbers, being floating points or integers), the frequency race is also over and manufacturers start shipping multicore processors, requiring developpers to implement multithreading in their application.
I was also thinking about the most important difference between computers and brains; brains use a lot of neurons, while computers use precision at a high frequency: that's why it seems harder or impossible to simulate an real time AI with the current processor model.
Since 32bits/64bits chips also take a great deal of transistors and since AI doesn't require vector/floating point precision, would it be a good idea to have many more 8bits cores on a single processor, like 100 or 1000 for example since they take much less room (I don't work at intel or AMD so I don't know how they design their processors, it's just a wild guess), to plan for those kind of AI simulations ?
I don't think it would only serve AI research though, because I don't know how webservers can really take advantage of 64 bits processors (strings use 8bits), Xeon processors are only different considering their cache size.
What you describe is already available by means of multimedia instruction sets. It turns out that computer graphics needs also many parallel operations on bytes or even half-bytes. So the CPUs started growing vector operations (SSE, MMX, etc); more recently, graphic processors have opened up to general purpose computing (GPGPU).
I think you are mistaken in assuming that neuronal processing is not a vector operation: many AI neuronal networks heavily rely on vector and matrix operations.

When using Direct3D, how much math is being done on the CPU?

Context: I'm just starting out. I'm not even touching the Direct3D 11 API, and instead looking at understanding the pipeline, etc.
From looking at documentation and information floating around the web, it seems like some calculations are being handled by the application. That, is, instead of simply presenting matrices to multiply to the GPU, the calculations are being done by a math library that operates on the CPU. I don't have any particular resources to point to, although I guess I can point to the XNA Math Library or the samples shipped in the February DX SDK. When you see code like mViewProj = mView * mProj;, that projection is being calculated on the CPU. Or am I wrong?
If you were writing a program, where you can have 10 cubes on the screen, where you can move or rotate cubes, as well as viewpoint, what calculations would you do on the CPU? I think I would store the geometry for the a single cube, and then transform matrices representing the actual instances. And then it seems I would use the XNA math library, or another of my choosing, to transform each cube in model space. Then get the coordinates in world space. Then push the information to the GPU.
That's quite a bit of calculation on the CPU. Am I wrong?
Am I reaching conclusions based on too little information and understanding?
What terms should I Google for, if the answer is STFW?
Or if I am right, why aren't these calculations being pushed to the GPU as well?
EDIT: By the way, I am not using XNA, but documentation notes the XNA Math Library replaces the previous DX Math library. (i see the XNA Library in the SDK as a sheer template library).
"Am I reaching conclusions based on too little information and understanding?"
Not as a bad thing, as we all do it, but in a word: Yes.
What is being done by the GPU is, generally, dependent on the GPU driver and your method of access. Most of the time you really don't care or need to know (other than curiosity and general understanding).
For mViewProj = mView * mProj; this is most likely happening on the CPU. But it is not much burden (counted in 100's of cycles at the most). The real trick is the application of the new view matrix on the "world". Every vertex needs to be transformed, more or less, along with shading, textures, lighting, etc. All if this work will be done in the GPU (if done on the CPU things will slow down really fast).
Generally you make high level changes to the world, maybe 20 CPU bound calculations, and the GPU takes care of the millions or billions of calculations needed to render the world based on the changes.
In your 10 cube example: You supply a transform for each cube, any math needed for you to create the transform is CPU bound (with exceptions). You also supply a transform for the view, again creating the transform matrix might be CPU bound. Once you have your 11 new matrices you apply the to the world. From a hardware point of view the 11 matrices need to be copied to the GPU...that will happen very, very fast...once copied the CPU is done and the GPU recalculates the world based on the new data, renders it to a buffer and poops it on the screen. So for your 10 cubes the CPU bound calculations are trivial.
Look at some reflected code for an XNA project and you will see where your calculations end and XNA begins (XNA will do everything is possibly can in the GPU).

Resources