Massively parallel application: what about several 8 bits cores for non-vector IA applications? - multithreading

I was thinking (oh god, it starts badly) about neuron networks and how it is not possible to simulate those because they require many atomic operation at the same time (here meaning simultaneously), because that's how neurons are faster: they are many to compute stuff.
Since our processors are 32 bits so they can compute a significantly larger band (meaning a lot of different atomic numbers, being floating points or integers), the frequency race is also over and manufacturers start shipping multicore processors, requiring developpers to implement multithreading in their application.
I was also thinking about the most important difference between computers and brains; brains use a lot of neurons, while computers use precision at a high frequency: that's why it seems harder or impossible to simulate an real time AI with the current processor model.
Since 32bits/64bits chips also take a great deal of transistors and since AI doesn't require vector/floating point precision, would it be a good idea to have many more 8bits cores on a single processor, like 100 or 1000 for example since they take much less room (I don't work at intel or AMD so I don't know how they design their processors, it's just a wild guess), to plan for those kind of AI simulations ?
I don't think it would only serve AI research though, because I don't know how webservers can really take advantage of 64 bits processors (strings use 8bits), Xeon processors are only different considering their cache size.

What you describe is already available by means of multimedia instruction sets. It turns out that computer graphics needs also many parallel operations on bytes or even half-bytes. So the CPUs started growing vector operations (SSE, MMX, etc); more recently, graphic processors have opened up to general purpose computing (GPGPU).
I think you are mistaken in assuming that neuronal processing is not a vector operation: many AI neuronal networks heavily rely on vector and matrix operations.

Related

Interleaved vs non-interleaved vertex buffers

This seems like a question which has been answered throughout time for one IHV or another but recently I have have been trying to come to a consensus about vertex layouts and the best practices for a modern renderer across all IHVs and architectures. Before someone says benchmark, I can't easily do that as I don't have access to a card from every IHV and every architecture from the last 5 years. Therefore, I am looking for some best practices that will work decently well across all platforms.
First, the obvious:
Separating position from other attributes is good for:
Shadow and depth pre-passes
Per-triangle culling
Tiled based deferred renderers (such as Apple M1)
Interleaved is more logical on the CPU, can have a Vertex class.
Non-interleaved can make some CPU calculations faster due to being able to take advantage of SIMD.
Now onto the less obvious.
Many people quote NVIDIA as saying that you should always interleave and moreover you should align to 32 or 64 bytes. I have not found the source of this but have instead found a document about vertex shader performance by NVIDIA but it is quite old (2013) and is regarding the Tegra GPU which is mobile, not desktop. In particular it says:
Store vertex data as interleaved attribute streams ("array of structures" layout), such that "over-fetch" for an attribute tends to pre-fetch data that is likely to be useful for subsequent attributes and vertices. Storing attributes as distinct, non-interleaved ("structure of arrays") streams can lead to "page-thrashing" in the memory system, with a massive resultant drop in performance.
Fast forward 3 years to GDC 2016 and EA gives a presentation which mentions several reasons why you should de-interleave the vertex buffers. However, this recommendation seems to be tied to AMD architectures, in particular GCN. While they make a cross platform case for separating the position they propose de-interleaving everything with the statement that it will allow the GPU to:
Evict cache lines as quickly as possible
And that it is optimal for GCN (AMD) architectures.
This seems to be in conflict to what I have heard elsewhere which says to use interleaved in order to make the most use of a cache line. But again, that was not in regard to AMD.
With many different IHVs, Intel, NVIDIA, AMD, and now Apple with the M1 GPU, and each one having many different architectures it leaves me in the position of being completely uncertain about what one should do today (without the budget to test on dozens of GPUs) in order to best optimize performance across all architectures without resulting in
a massive resultant drop in performance
on some architectures. In particular, is de-interleaved still best on AMD? Is it no longer a problem on NVIDIA, or was it never a problem on desktop NVIDIA GPUs? What about the other IHVs?
NOTE: I am not interested in mobile, only all desktop GPUs in the past 5 years or so.

Propagational delay in circuits

which is better for accurate proportional delay: spice simulation method or calculation using elmores delay (RC delay modeling)
Spice simulation is more accurate than elmore delay modelling. This is mentioned in the book CMOS VLSI Design by Weste Harris on page 93, Section 2.6
Blindly trusting one’s models
Models should be viewed as only approximations to reality, not reality itself, and used within
their limitations. In particular, simple models like the Shockley or RC models aren’t even close
to accurate fits for the I-V characteristics of a modern transistor. They are valuable for the
insight they give on trends (i.e., making a transistor wider increases its gate capacitance and
decreases its ON resistance), not for the absolute values they predict. Cutting-edge projects
often target processes that are still under development, so these models should only be
viewed as speculative. Finally, processes may not be fully characterized over all operating regimes;
for example, don’t assume that your models are accurate in the subthreshold region
unless your vendor tells you so. Having said this, modern SPICE models do an extremely good
job of predicting performance well into the GHz range for well-characterized processes and
models when using proper design practices (such as accounting for temperature, voltage, and
process variation).

plot speed up curve vs number of OpenMP threads - scalability?

I am working on a C++ code which uses OpenMP threads. I have plotted the speedup curve versus the number of OpenMP threads and the theorical curve (if the code was able to be fully parallelized).
here is this plot :
From this picture, can we say this code is not scalable (from a point of view of parallelization) ? i.e the code is not twice more fast with 2 OpenMP threads, four more fast with 4 threads etc ... ?
Thanks
For the code that barely achieves 2.5x speedup on 16 threads, it is fair to say that it does not scale. However "is not scalable" is often considered a stronger statement. The difference, as I understand it, is that "does not scale" typically refers to a particular implementation and does not imply inherent inability to scale; in other words, maybe you can make it scale if bottlenecks are eliminated. On the other hand, "is not scalable" usually means "you cannot make it scale, at least not without changing the core algorithm". Assuming such meaning, one cannot say "a problem/code/algorithm is not scalable" only looking at a chart.
On an additional note, it's not always reasonable to expect perfect scaling (2x with 2 threads, 4x with 4 threads, etc). A curve that is "close enough" to the ideal scaling might still be considered as showing good scalability; and what "close enough" means may depend on a number of factors. It can be useful to tell / think of parallel efficiency, and not speedup, when scalability is a question. For example, if parallel efficiency is 0.8 (or 80%) and does not drop when the number of threads increase, it could be considered a good scalability. Also, it's possible that some program scales well till a certain number of threads, but remains flat or even falls down if more resources are added.

what are the calculations or presumptions for the frequency of the core to be used in super computers?

What are these calculations that lets us know, that so and so frequency should be used to do the job which may include weather forecast or calculating critical equations, like all stuff that super computers do.
Core frequency is just one aspect that governs the speed of a computer, other things like cache sizes and speed, inter core and inter module communication speeds, etc.
Super computers of today use regular CPU:s, like Xeon processors. The difference between a super computer and a regular desktop is the number of CPU:s and the interconnections between the different CPU:s and memory areas.
Modern CPU:s has a lot of caching and branch prediction that makes it hard to calculate the number of clock cycles required for a certain algorithm.

OpenCL GPU Audio

There's not much on this subject, perhaps because it isn't a good idea in the first place.
I want to create a realtime audio synthesis/processing engine that runs on the GPU. The reason for this is because I will also be using a physics library that runs on the GPU, and the audio output will be determined by the physics state. Is it true that GPU only carries audio output and can't generate it? Would this mean a large increase in latency, if I were to read the data back on the CPU and output it to the soundcard? I'm looking for a latency between 10 and 20ms in terms of the time between synthesis and playback.
Would the GPU accelerate synthesis by any worthwhile amount? I'm going to have a large number of synthesizers running at once, each of which I imagine could take up their own parallel process. AMD is coming out with GPU audio, so there must be something to this.
For what it's worth, I'm not sure that this idea lacks merit. If DarkZero's observation about transfer times is correct, it doesn't sound like there would be much overhead in getting audio onto the GPU for processing, even from many different input channels, and while there are probably audio operations that are not very amenable to parallelization, many are very VERY parallelizable.
It's obvious for example, that computing sine values for 128 samples of output from a sine source could be done completely in parallel. Working in blocks of that size would permit a latency of only about 3ms, which is acceptable in most digital audio applications. Similarly, the many other fundamental oscillators could be effectively parallelized. Amplitude modulation of such oscillators would be trivial. Efficient frequency modulation would be more challenging, but I would guess it is still possible.
In addition to oscillators, FIR filters are simple to parallelize, and a google search turned up some promising looking research papers (which I didn't take the trouble to read) that suggest that there are reasonable parallel approaches to IIR filter implementation. These two types of filters are fundamental to audio processing and many useful audio operations can be understood as such filters.
Wave-shaping is another task in digital audio that is embarrassingly parallel.
Even if you couldn't take an arbitrary software synth and map it effectively to the GPU, it is easy to imagine a software synthesizer constructed specifically to take advantage of the GPU's strengths, and avoid its weaknesses. A synthesizer relying exclusively on the components I have mentioned could still produce a fantastic range of sounds.
While marko is correct to point out that existing SIMD instructions can do some parallelization on the CPU, the number of inputs they can operate on at the same time pales in comparison to a good GPU.
In short, I hope you work on this and let us know what kind of results you see!
DSP operations on modern CPUs with vector processing units (SSE on x86/x64 or NEON on ARM) are already pretty cheap if exploited properly. This is particularly the case with filters, convolution, FFT and so on - which are fundamentally stream-based operations. There are the type of operations where a GPU might also excel.
As it turns out, soft synthesisers have quite a few operations in them that are not stream-like, and furthermore, the tendency is to process increasingly small chunks of audio at once to target low latency. These are a really bad fit for the capabilities of GPU.
The effort involved in using a GPU - particularly getting data in and out - is likely to far exceed any benefit you get. Furthermore, the capabilities of inexpensive personal computers - and also tablets and mobile devices - are more than enough for many digital audio applications AMD seem to have a solution looking for a problem. For sure, the existing music and digital audio software industry is not about to start producing software that only targets a limited sub-set of hardware.
Typical transfer times for some MB to/from GPU take 50us.
Delay is not your problem, however parallelizing a audio synthesizer in the GPU may be quite difficult. If you don't do it properly it may take more time the processing rather than the copy of data.
If you are going to run multiple synthetizers at once, I would recommend you to perform each synthesizer in a work-group, and parallelize the synthesis process with the work-items available. It will not be worth to have each synthesizer in one work-item, since it is unlikely you will have thousand.
http://arxiv.org/ftp/arxiv/papers/1211/1211.2038.pdf
You might be better off using OpenMP for it's lower initialization times.
You could check out the NESS project which is all about physical modelling synthesis. They are using GPUs for audio rendering because it the process involves simulating an acoustic 3D space for whichever given sound, and calculating what happens to that sound within the virtual 3D space (and apparently GPUs are good at working with this sort of data). Note that this is not realtime synthesis because it is so demanding of processing.

Resources