what are the calculations or presumptions for the frequency of the core to be used in super computers? - linux

What are these calculations that lets us know, that so and so frequency should be used to do the job which may include weather forecast or calculating critical equations, like all stuff that super computers do.

Core frequency is just one aspect that governs the speed of a computer, other things like cache sizes and speed, inter core and inter module communication speeds, etc.
Super computers of today use regular CPU:s, like Xeon processors. The difference between a super computer and a regular desktop is the number of CPU:s and the interconnections between the different CPU:s and memory areas.
Modern CPU:s has a lot of caching and branch prediction that makes it hard to calculate the number of clock cycles required for a certain algorithm.

Related

Propagational delay in circuits

which is better for accurate proportional delay: spice simulation method or calculation using elmores delay (RC delay modeling)
Spice simulation is more accurate than elmore delay modelling. This is mentioned in the book CMOS VLSI Design by Weste Harris on page 93, Section 2.6
Blindly trusting one’s models
Models should be viewed as only approximations to reality, not reality itself, and used within
their limitations. In particular, simple models like the Shockley or RC models aren’t even close
to accurate fits for the I-V characteristics of a modern transistor. They are valuable for the
insight they give on trends (i.e., making a transistor wider increases its gate capacitance and
decreases its ON resistance), not for the absolute values they predict. Cutting-edge projects
often target processes that are still under development, so these models should only be
viewed as speculative. Finally, processes may not be fully characterized over all operating regimes;
for example, don’t assume that your models are accurate in the subthreshold region
unless your vendor tells you so. Having said this, modern SPICE models do an extremely good
job of predicting performance well into the GHz range for well-characterized processes and
models when using proper design practices (such as accounting for temperature, voltage, and
process variation).

OpenCL GPU Audio

There's not much on this subject, perhaps because it isn't a good idea in the first place.
I want to create a realtime audio synthesis/processing engine that runs on the GPU. The reason for this is because I will also be using a physics library that runs on the GPU, and the audio output will be determined by the physics state. Is it true that GPU only carries audio output and can't generate it? Would this mean a large increase in latency, if I were to read the data back on the CPU and output it to the soundcard? I'm looking for a latency between 10 and 20ms in terms of the time between synthesis and playback.
Would the GPU accelerate synthesis by any worthwhile amount? I'm going to have a large number of synthesizers running at once, each of which I imagine could take up their own parallel process. AMD is coming out with GPU audio, so there must be something to this.
For what it's worth, I'm not sure that this idea lacks merit. If DarkZero's observation about transfer times is correct, it doesn't sound like there would be much overhead in getting audio onto the GPU for processing, even from many different input channels, and while there are probably audio operations that are not very amenable to parallelization, many are very VERY parallelizable.
It's obvious for example, that computing sine values for 128 samples of output from a sine source could be done completely in parallel. Working in blocks of that size would permit a latency of only about 3ms, which is acceptable in most digital audio applications. Similarly, the many other fundamental oscillators could be effectively parallelized. Amplitude modulation of such oscillators would be trivial. Efficient frequency modulation would be more challenging, but I would guess it is still possible.
In addition to oscillators, FIR filters are simple to parallelize, and a google search turned up some promising looking research papers (which I didn't take the trouble to read) that suggest that there are reasonable parallel approaches to IIR filter implementation. These two types of filters are fundamental to audio processing and many useful audio operations can be understood as such filters.
Wave-shaping is another task in digital audio that is embarrassingly parallel.
Even if you couldn't take an arbitrary software synth and map it effectively to the GPU, it is easy to imagine a software synthesizer constructed specifically to take advantage of the GPU's strengths, and avoid its weaknesses. A synthesizer relying exclusively on the components I have mentioned could still produce a fantastic range of sounds.
While marko is correct to point out that existing SIMD instructions can do some parallelization on the CPU, the number of inputs they can operate on at the same time pales in comparison to a good GPU.
In short, I hope you work on this and let us know what kind of results you see!
DSP operations on modern CPUs with vector processing units (SSE on x86/x64 or NEON on ARM) are already pretty cheap if exploited properly. This is particularly the case with filters, convolution, FFT and so on - which are fundamentally stream-based operations. There are the type of operations where a GPU might also excel.
As it turns out, soft synthesisers have quite a few operations in them that are not stream-like, and furthermore, the tendency is to process increasingly small chunks of audio at once to target low latency. These are a really bad fit for the capabilities of GPU.
The effort involved in using a GPU - particularly getting data in and out - is likely to far exceed any benefit you get. Furthermore, the capabilities of inexpensive personal computers - and also tablets and mobile devices - are more than enough for many digital audio applications AMD seem to have a solution looking for a problem. For sure, the existing music and digital audio software industry is not about to start producing software that only targets a limited sub-set of hardware.
Typical transfer times for some MB to/from GPU take 50us.
Delay is not your problem, however parallelizing a audio synthesizer in the GPU may be quite difficult. If you don't do it properly it may take more time the processing rather than the copy of data.
If you are going to run multiple synthetizers at once, I would recommend you to perform each synthesizer in a work-group, and parallelize the synthesis process with the work-items available. It will not be worth to have each synthesizer in one work-item, since it is unlikely you will have thousand.
http://arxiv.org/ftp/arxiv/papers/1211/1211.2038.pdf
You might be better off using OpenMP for it's lower initialization times.
You could check out the NESS project which is all about physical modelling synthesis. They are using GPUs for audio rendering because it the process involves simulating an acoustic 3D space for whichever given sound, and calculating what happens to that sound within the virtual 3D space (and apparently GPUs are good at working with this sort of data). Note that this is not realtime synthesis because it is so demanding of processing.

Why emulate for certain number of cycles?

I have seen in more than one places - the following way of emulating
i.e cycles is passed into emulate function
int CPU_execute(int cycles) {
int cycle_count;
cycle_count = cycles;
do {
/* OPCODE execution here */
} while(cycle_count > 0);
return cycles - cycle_count;
}
I am having hard time understand why would you do this approach for emulating i.e why would you emulate for certain number of cycles? Can you give some scenarios where this approach is useful.
Any help is heartily appreciated !!!
Emulators tend to be interested in fooling the software written for multiple chip devices — in terms of the Z80 and the best selling devices you're probably talking about at least a graphics chip and a sound chip in addition to the CPU.
In the real world those chips all act concurrently. There'll be some bus logic to allow them all to communicate but they're otherwise in worlds of their own.
You don't normally run emulation of the different chips as concurrent processes because the cost of enforcing synchronisation events is too great, especially in the common arrangement where something like the same block of RAM is shared between several of the chips.
So instead the most basic approach is to cooperatively multitask the different chips — run the Z80 for a few cycles, then run the graphics chip for the same amount of time, etc, ad infinitum. That's where the approach of running for n cycles and returning comes from.
It's usually not an accurate way of reproducing the behaviour of a real computer bus but it's easy to implement and often you can fool most software.
In the specific code you've posted the author has further decided that the emulation will round the number of cycles up to the end of the next whole instruction. Again that's about simplicity of implementation rather than being anything to do with the actual internals of a real machine. The number of cycles actually run for is returned so that other subsystems can attempt to adapt.
Since you mentioned z80, I happen to know just the perfect example of the platform where this kind of precise emulation is sometimes necessary: ZX Spectrum. The standard graphics output area on ZX Spectrum was a box of 256 x 192 pixels situated in the centre of the screen, surrounded by a fairly wide "border" area filled by a solid color. The color of the border was controlled by outputing a value into a special output port. The computer designer's idea was that one would simply choose the border color that is the most appropriate to what is happening on the main screen.
ZX Spectrum did not have a precision timer. But programmers quickly realised that the "rigid" (by modern standards) timings of z80 allowed one to do drawing that was synchronised with the movement of the monitor's beam. On ZX Spectrum one could wait for the interrupt produced at the beginning of each frame and then literally count the precise number of cycles necessary to achieve various effects. For example, a single full scanline on ZX Spectrum was scanned in 224 cycles. Thus, one could change the border color every 224 cycles and generate pixel-thick lines on the border.
Graphics capacity of the ZX Spectrum was limited in a sense that the screen was divided into 8x8 blocks of pixels, which could only use two colors at any given time. Programmers overcame this limitation by changing these two colors every 224 cycles, hence, effectively, increasing the color resolution 8-fold.
I can see that the discussion under another answer focuses on whether one scanline may be a sufficiently accurate resolution for an emulator. Well, some of the border scroller effects I've seen on ZX Spectrum are, basically, timed to a single z80-cycle. Emulator that wants to reproduce the correct output of such codes would also have to be precise to a single machine cycle.
If you want to sync your processor with other hardware it could be useful to do it like that. For instance, if you want to sync it with a timer you would like to control how many cycles can pass before the timer interrupts the CPU.

Could a Cray XK6 run a real-time raytracer?

I heard about Cray's new supercomputer -- the XK6 -- today, but I am a little confused about where the bottlenecks are. Is it in interconnec? Can an XK6 configured with, say, 500,000 16-core processors achieve a graphic fidelity comprable to Toy Story 3 in real-time? By "real-time", I mean 60fps, or around 16.7 milliseconds per frame.
No. Pure computation is surprisingly little of what it takes to render a film frame from Toy Story 3 or a similar modern animated (or VFX) film. Those scenes may reference many hundreds of GB of texture, and even if you could know exactly which subset of that texture will be needed for a frame, it may be tens of GB, which still needs to be read from disk and/or transferred over a network. GPUs or massively parallel distributed computation doesn't speed that up. Furthermore, rendering is only the very last step... preparing the geometric input for a frame (simulating the fluids, cloth and hair, tessellating the geometry, reading and interpreting large scenes from disk) can be substantial.
So, just pulling numbers out of the air (but these are moderately realistic), say it takes 30 minutes to prepare the scene (load the models, tessellate it, some minor sims, etc.), and 4.5 hours to render (of which, say, 30 minutes is reading texture and other resources from disk, leaving 4 hours of "ray tracing" and other computation). If the XK6 made the ray tracing infinitely fast, it would only speed the total process up by 5x (1 hour is still hard-to-serialize prep and I/O). That's Amdahl's Law for you.
Now, you're probably asking yourself, "how do games go so fast?" They do it in two ways: (1) they drastically reduce the data set (texture size, geometric resolution, etc.) to make it all fit on the GPU and be reasonably fast to load levels (which, curiously, you the user are not counting when you think of the rendering as happening in "real time"); (2) they spare no expense in computation, tricks, and human labor to optimize the scenes and algorithms before they ship the disks, so that when it's in front of the player it can render quickly.
So, in summary, if you are asking if the total computational power of the XK6 is enough to compute in real time all the pure math required to render a film frame, then yes, it probably is. But if you are asking if an XK6 could actually render the movie in realtime given the kinds of inputs the renderer needs, then no, it couldn't. Would an XK6 be of any use to people rendering those movie frames? No, it probably wouldn't be worth the trouble of reprogramming all the software (hundreds of man years) from the ground up.
Looking at it from another viewpoint, users generally render only one scene at a time, then make small changes and re-render over and over again. To render only a single scene in realtime may still require several GB of textures loaded onto the GPU's RAM.
Could a supercomputer like those from Cray using a "sea of cores" or a vast array of modern CPU's perform the job in realtime? Yes, for simple enough scenes. More complex scenes that require 100+ rays per pixel at 8MP (4K x 2K for movies, 2MP for DSLR/indie type movies), along with lots of objects, shadows, haze, refraction, diffuse lighting sources, etc) would probably require too many computations, even at 24fps.

Massively parallel application: what about several 8 bits cores for non-vector IA applications?

I was thinking (oh god, it starts badly) about neuron networks and how it is not possible to simulate those because they require many atomic operation at the same time (here meaning simultaneously), because that's how neurons are faster: they are many to compute stuff.
Since our processors are 32 bits so they can compute a significantly larger band (meaning a lot of different atomic numbers, being floating points or integers), the frequency race is also over and manufacturers start shipping multicore processors, requiring developpers to implement multithreading in their application.
I was also thinking about the most important difference between computers and brains; brains use a lot of neurons, while computers use precision at a high frequency: that's why it seems harder or impossible to simulate an real time AI with the current processor model.
Since 32bits/64bits chips also take a great deal of transistors and since AI doesn't require vector/floating point precision, would it be a good idea to have many more 8bits cores on a single processor, like 100 or 1000 for example since they take much less room (I don't work at intel or AMD so I don't know how they design their processors, it's just a wild guess), to plan for those kind of AI simulations ?
I don't think it would only serve AI research though, because I don't know how webservers can really take advantage of 64 bits processors (strings use 8bits), Xeon processors are only different considering their cache size.
What you describe is already available by means of multimedia instruction sets. It turns out that computer graphics needs also many parallel operations on bytes or even half-bytes. So the CPUs started growing vector operations (SSE, MMX, etc); more recently, graphic processors have opened up to general purpose computing (GPGPU).
I think you are mistaken in assuming that neuronal processing is not a vector operation: many AI neuronal networks heavily rely on vector and matrix operations.

Resources