GPU performance indicators - graphics

Is there a way to find an Intel HD graphics equivalent to my current Nvidia card, by just comparing the specification, which would give me approximately similar performance.

Check the GPUs' specifications for GFLOPS. While it may not be absolutely precise, it may be a good enough performance indicator for computation tasks.
Another option is to search for GPU benchmarks.

Related

Interleaved vs non-interleaved vertex buffers

This seems like a question which has been answered throughout time for one IHV or another but recently I have have been trying to come to a consensus about vertex layouts and the best practices for a modern renderer across all IHVs and architectures. Before someone says benchmark, I can't easily do that as I don't have access to a card from every IHV and every architecture from the last 5 years. Therefore, I am looking for some best practices that will work decently well across all platforms.
First, the obvious:
Separating position from other attributes is good for:
Shadow and depth pre-passes
Per-triangle culling
Tiled based deferred renderers (such as Apple M1)
Interleaved is more logical on the CPU, can have a Vertex class.
Non-interleaved can make some CPU calculations faster due to being able to take advantage of SIMD.
Now onto the less obvious.
Many people quote NVIDIA as saying that you should always interleave and moreover you should align to 32 or 64 bytes. I have not found the source of this but have instead found a document about vertex shader performance by NVIDIA but it is quite old (2013) and is regarding the Tegra GPU which is mobile, not desktop. In particular it says:
Store vertex data as interleaved attribute streams ("array of structures" layout), such that "over-fetch" for an attribute tends to pre-fetch data that is likely to be useful for subsequent attributes and vertices. Storing attributes as distinct, non-interleaved ("structure of arrays") streams can lead to "page-thrashing" in the memory system, with a massive resultant drop in performance.
Fast forward 3 years to GDC 2016 and EA gives a presentation which mentions several reasons why you should de-interleave the vertex buffers. However, this recommendation seems to be tied to AMD architectures, in particular GCN. While they make a cross platform case for separating the position they propose de-interleaving everything with the statement that it will allow the GPU to:
Evict cache lines as quickly as possible
And that it is optimal for GCN (AMD) architectures.
This seems to be in conflict to what I have heard elsewhere which says to use interleaved in order to make the most use of a cache line. But again, that was not in regard to AMD.
With many different IHVs, Intel, NVIDIA, AMD, and now Apple with the M1 GPU, and each one having many different architectures it leaves me in the position of being completely uncertain about what one should do today (without the budget to test on dozens of GPUs) in order to best optimize performance across all architectures without resulting in
a massive resultant drop in performance
on some architectures. In particular, is de-interleaved still best on AMD? Is it no longer a problem on NVIDIA, or was it never a problem on desktop NVIDIA GPUs? What about the other IHVs?
NOTE: I am not interested in mobile, only all desktop GPUs in the past 5 years or so.

OpenCL GPU Audio

There's not much on this subject, perhaps because it isn't a good idea in the first place.
I want to create a realtime audio synthesis/processing engine that runs on the GPU. The reason for this is because I will also be using a physics library that runs on the GPU, and the audio output will be determined by the physics state. Is it true that GPU only carries audio output and can't generate it? Would this mean a large increase in latency, if I were to read the data back on the CPU and output it to the soundcard? I'm looking for a latency between 10 and 20ms in terms of the time between synthesis and playback.
Would the GPU accelerate synthesis by any worthwhile amount? I'm going to have a large number of synthesizers running at once, each of which I imagine could take up their own parallel process. AMD is coming out with GPU audio, so there must be something to this.
For what it's worth, I'm not sure that this idea lacks merit. If DarkZero's observation about transfer times is correct, it doesn't sound like there would be much overhead in getting audio onto the GPU for processing, even from many different input channels, and while there are probably audio operations that are not very amenable to parallelization, many are very VERY parallelizable.
It's obvious for example, that computing sine values for 128 samples of output from a sine source could be done completely in parallel. Working in blocks of that size would permit a latency of only about 3ms, which is acceptable in most digital audio applications. Similarly, the many other fundamental oscillators could be effectively parallelized. Amplitude modulation of such oscillators would be trivial. Efficient frequency modulation would be more challenging, but I would guess it is still possible.
In addition to oscillators, FIR filters are simple to parallelize, and a google search turned up some promising looking research papers (which I didn't take the trouble to read) that suggest that there are reasonable parallel approaches to IIR filter implementation. These two types of filters are fundamental to audio processing and many useful audio operations can be understood as such filters.
Wave-shaping is another task in digital audio that is embarrassingly parallel.
Even if you couldn't take an arbitrary software synth and map it effectively to the GPU, it is easy to imagine a software synthesizer constructed specifically to take advantage of the GPU's strengths, and avoid its weaknesses. A synthesizer relying exclusively on the components I have mentioned could still produce a fantastic range of sounds.
While marko is correct to point out that existing SIMD instructions can do some parallelization on the CPU, the number of inputs they can operate on at the same time pales in comparison to a good GPU.
In short, I hope you work on this and let us know what kind of results you see!
DSP operations on modern CPUs with vector processing units (SSE on x86/x64 or NEON on ARM) are already pretty cheap if exploited properly. This is particularly the case with filters, convolution, FFT and so on - which are fundamentally stream-based operations. There are the type of operations where a GPU might also excel.
As it turns out, soft synthesisers have quite a few operations in them that are not stream-like, and furthermore, the tendency is to process increasingly small chunks of audio at once to target low latency. These are a really bad fit for the capabilities of GPU.
The effort involved in using a GPU - particularly getting data in and out - is likely to far exceed any benefit you get. Furthermore, the capabilities of inexpensive personal computers - and also tablets and mobile devices - are more than enough for many digital audio applications AMD seem to have a solution looking for a problem. For sure, the existing music and digital audio software industry is not about to start producing software that only targets a limited sub-set of hardware.
Typical transfer times for some MB to/from GPU take 50us.
Delay is not your problem, however parallelizing a audio synthesizer in the GPU may be quite difficult. If you don't do it properly it may take more time the processing rather than the copy of data.
If you are going to run multiple synthetizers at once, I would recommend you to perform each synthesizer in a work-group, and parallelize the synthesis process with the work-items available. It will not be worth to have each synthesizer in one work-item, since it is unlikely you will have thousand.
http://arxiv.org/ftp/arxiv/papers/1211/1211.2038.pdf
You might be better off using OpenMP for it's lower initialization times.
You could check out the NESS project which is all about physical modelling synthesis. They are using GPUs for audio rendering because it the process involves simulating an acoustic 3D space for whichever given sound, and calculating what happens to that sound within the virtual 3D space (and apparently GPUs are good at working with this sort of data). Note that this is not realtime synthesis because it is so demanding of processing.

How to utilize 2d/3d Graphics Acceleration on a Single Board Computer

This may be a somewhat silly question, but if you are working with a single board computer that boasts that it has 2d/3d graphics acceleration, what does this actually mean?
If it supports DirectX or OpenGL obviously I could just use that framework, but I am not familiar with working from this end of things. I do not know if that means that it is capable of having those libraries included into the OS or if it just means that it does certain kinds of math more quickly (either by default or through some other process)
Any clarification on what this means or locations of resources I could use on such would be greatly appreciated.
On embedded system's, 2D/3D Graphics Acceleration could mean a lot of things. For instance, that framebuffer operations are accelerated through DirectFB, or that OpenGL ES is supported.
The fact is that the manufacturer of the board usually provides these libraries since the acceleration of the graphics itself is deeply connected to the hardware.
It's best to get in touch with your manufacturer and ask which graphics libraries they support that are hardware accelerated.
There are two very important features of 2D/3D graphics cards:
Take a load away from the CPU
Process that load much faster than the CPU can do because it has a special instruction set that was designed explicitly for calculations that are common in graphics (e.g. transformations)
Sometimes other jobs are passed on to the GPU because a job requires calculations that fit very good to the instructions of the GPU. E.g. a physics library requires lots of matrix calculation so a GPU could be used to do that. NVIDIA made PHYSIX to do exactly that. See this FAQ also
The minumum a graphics display required is to allow the setting of the state (colour) of individual pixels. This allows you to render any image within the resolution and colour depth of the display, but for complex drawing tasks and very high resolution displays this would be very slow.
Graphics acceleration refers to any graphics processing function off-loaded to hardware. At its simplest this may mean the drawing and filling of graphics primitives such as lines, and polygons, and 'blitting' - the moving of blocks of pixels from one location to another. Technically graphics acceleration has been largely replaced by graphics processors (GPUs), though the affect is the same - faster graphics. GPUs are more flexible since a hardware accelerator can accelerate only the set of operations they are hard wrired to perform, which may benefit some applications more than others.
Modern GPU graphics hardware performs far higher level graphics processing. It is also possible to use the GPU to perform more general purpose matrix computation tasks using interfaces such as Nvidia's CUDA, which can then accelerate computational tasks other than graphics but which require teh same kind of mathematical operations.
The Wikipedia "Graphics processing unit" article has a history of Graphics Accelerators and GPUs

Massively parallel application: what about several 8 bits cores for non-vector IA applications?

I was thinking (oh god, it starts badly) about neuron networks and how it is not possible to simulate those because they require many atomic operation at the same time (here meaning simultaneously), because that's how neurons are faster: they are many to compute stuff.
Since our processors are 32 bits so they can compute a significantly larger band (meaning a lot of different atomic numbers, being floating points or integers), the frequency race is also over and manufacturers start shipping multicore processors, requiring developpers to implement multithreading in their application.
I was also thinking about the most important difference between computers and brains; brains use a lot of neurons, while computers use precision at a high frequency: that's why it seems harder or impossible to simulate an real time AI with the current processor model.
Since 32bits/64bits chips also take a great deal of transistors and since AI doesn't require vector/floating point precision, would it be a good idea to have many more 8bits cores on a single processor, like 100 or 1000 for example since they take much less room (I don't work at intel or AMD so I don't know how they design their processors, it's just a wild guess), to plan for those kind of AI simulations ?
I don't think it would only serve AI research though, because I don't know how webservers can really take advantage of 64 bits processors (strings use 8bits), Xeon processors are only different considering their cache size.
What you describe is already available by means of multimedia instruction sets. It turns out that computer graphics needs also many parallel operations on bytes or even half-bytes. So the CPUs started growing vector operations (SSE, MMX, etc); more recently, graphic processors have opened up to general purpose computing (GPGPU).
I think you are mistaken in assuming that neuronal processing is not a vector operation: many AI neuronal networks heavily rely on vector and matrix operations.

GPU-based video cards to accelerate your program calculations, How?

I read in this article that a company has created a software capable of using multiple GPU-based video cards in parallel to process hundreds of billions fixed-point calculations per second.
The program seems to run in Windows. Is it possible from Windows to assign a thread to a GPU? Do they create their own driver and then interact with it? Any idea of how they do it?
I imagine that they are using a language like CUDA to program the critical sections of code on the GPUs to accelerate their computation.
The main function for the program (and its threads) would still run on the host CPU, but data are shipped off the the GPUs for processing of advanced algorithms. CUDA is an extension to C syntax, so it makes it easier to programmer than having to learn the older shader languages like Cg for programming general purpose calculations on a GPU.
A good place to start - GPGPU
Also, for the record, I don't think there is such a thing as non-GPU based graphic cards. GPU stands for graphics processing unit which is by definition the heart of a graphics card.

Resources