Did I test the ArrayFire performance incorrectly? - performance-testing

I cannot figure out what's wrong. I mean, the speed is way too fast, like 1 million items vs 10 million items basically have the same 0.0005 second computation on my machine. So fast, it looks like it wasn't doing anything. But the result of the data is actually correct.
It is mind boggling because if I make similar computation on sequential loop without storing the result in an array, it is not only number of cores slower, but, like 1000 times slower than ArrayFire.
So, maybe I wasn't using the timer correctly?
Do you think they didn't actually compute the data right away? Maybe it just sets up some kind of shadow marker? And when I call the myArray.host(), it will start doing all the actual computations?
From their website, it says there is some kind of JIT to bundle the computations.
ArrayFire uses Just In Time compilation to combine many light weight functions into a single kernel launch. This along with our easy-to-use API allows users to not only quickly prototype their algorithms, but also get the best out of the underlying hardware.
I start/stop my timer right before/after few ArrayFire computations. And it is just insanely fast. Maybe I test it wrong? What's the proper way to test ArrayFire performance?
Never mind, I found out what to do,
Based on the examples, I should be using af::timeit(function) instead of using the af::timer. Using af::timeit will be very slow, but, the result scale more reasonably when I increase the size 10x. It doens't actually compute right away, that's why using af::timer myself wouldn't work.
thank you

Related

What technique(s) could I use to determine lag time?

I am trying to determine the lag time between seeing a dip (or rise) in a predictor metric and when we would see the resulting dip (or rise) in a known response metric. I am not sure where to start, could someone put me on the right path?
For context, I would like to use R or Python and am familiar with statistics and machine learning. I am just searching for what method or modeling technique would be best to use and less about the code.

Unknown events in nodejs/v8 flamegraph using perf_events

I try to do some nodejs profiling using Linux perf_events as described by Brendan Gregg here.
Workflow is following:
run node >0.11.13 with --perf-basic-prof, which creates /tmp/perf-(PID).map file where JavaScript symbol mapping are written.
Capture stacks using perf record -F 99 -p `pgrep -n node` -g -- sleep 30
Fold stacks using stackcollapse-perf.pl script from this repository
Generate svg flame graph using flamegraph.pl script
I get following result (which look really nice at the beginning):
Problem is that there are a lot of [unknown] elements, which I suppose should be my nodejs function calls. I assume that whole process fails somwhere at point 3, where perf data should be folded using mappings generated by node/v8 executed with --perf-basic-prof. /tmp/perf-PID.map file is created and some mapping are written to it during node execution.
How to solve this problem?
I am using CentOS 6.5 x64, and already tried this with node 0.11.13, 0.11.14 (both prebuild, and compiled as well) with no success.
FIrst of all, what "[unknown]" means is the sampler couldn't figure out the name of the function, because it's a system or library function.
If so, that's OK - you don't care, because you're looking for things responsible for time in your code, not system code.
Actually, I'm suggesting this is one of those XY questions.
Even if you get a direct answer to what you asked, it is likely to be of little use.
Here are the reasons why:
1. CPU Profiling is of little use in an I/O bound program
The two towers on the left in your flame graph are doing I/O, so they probably take a lot more wall-time than the big pile on the right.
If this flame graph were derived from wall-time samples, rather than CPU-time samples, it could look more like the second graph below, which tells you where time actually goes:
What was a big juicy-looking pile on the right has shrunk, so it is nowhere near as significant.
On the other hand, the I/O towers are very wide.
Any one of those wide orange stripes, if it's in your code, represents a chance to save a lot of time, if some of the I/O could be avoided.
2. Whether the program is CPU- or I/O-bound, speedup opportunities can easily hide from flame graphs
Suppose there is some function Foo that really is doing something wasteful, that if you knew about it, you could fix.
Suppose in the flame graph, it is a dark red color.
Suppose it is called from numerous places in the code, so it's not all collected in one spot in the flame graph.
Rather it appears in multiple small places shown here by black outlines:
Notice, if all those rectangles were collected, you could see that it accounts for 11% of time, meaning it is worth looking at.
If you could cut its time in half, you could save 5.5% overall.
If what it's doing could actually be avoided entirely, you could save 11% overall.
Each of those little rectangles would shrink down to nothing, and pull the rest of the graph, to its right, with it.
Now I'll show you the method I use. I take a moderate number of random stack samples and examine each one for routines that might be speeded up.
That corresponds to taking samples in the flame graph like so:
The slender vertical lines represent twenty random-time stack samples.
As you can see, three of them are marked with an X.
Those are the ones that go through Foo.
That's about the right number, because 11% times 20 is 2.2.
(Confused? OK, here's a little probability for you. If you flip a coin 20 times, and it has a 11% chance of coming up heads, how many heads would you get? Technically it's a binomial distribution. The most likely number you would get is 2, the next most likely numbers are 1 and 3. (If you only get 1 you keep going until you get 2.) Here's the distribution:)
(The average number of samples you have to take to see Foo twice is 2/0.11 = 18.2 samples.)
Looking at those 20 samples might seem a bit daunting, because they run between 20 and 50 levels deep.
However, you can basically ignore all the code that isn't yours.
Just examine them for your code.
You'll see precisely how you are spending time,
and you'll have a very rough measurement of how much.
Deep stacks are both bad news and good news -
they mean the code may well have lots of room for speedups, and they show you what those are.
Anything you see that you could speed up, if you see it on more than one sample, will give you a healthy speedup, guaranteed.
The reason you need to see it on more than one sample is, if you only see it on one sample, you only know its time isn't zero. If you see it on more than one sample, you still don't know how much time it takes, but you do know it's not small.
Here are the statistics.
Generally speaking it is a bad idea to disagree with a subject matter expert but (with the greatest respect) here we go!
SO urges the answer to do the following:
"Please be sure to answer the question. Provide details and share your research!"
So the question was, at least my interpretation of it is, why are there [unknown] frames in the perf script output (and how do I turn these [unknown] frames in to meaningful names)?
This question could be about "how to improve the performance of my system?" but I don't see it that way in this particular case. There is a genuine problem here about how the perf record data has been post processed.
The answer to the question is that although the prerequisite set up is correct: the correct node version, the correct argument was present to generate the function names (--perf-basic-prof), the generated perf map file must be owned by root for perf script to produce the expected output.
That's it!
Writing some new scripts today I hit apon this directing me to this SO question.
Here's a couple of additional references:
https://yunong.io/2015/11/23/generating-node-js-flame-graphs/
https://github.com/jrudolph/perf-map-agent/blob/d8bb58676d3d15eeaaf3ab3f201067e321c77560/bin/create-java-perf-map.sh#L22
[ non-root files can sometimes be forced ] http://www.spinics.net/lists/linux-perf-users/msg02588.html

Some questions regarding game loop, tick and real time programming

First I want to apologize for my approximate English, as I'm French. I'm currently making a real-time game in java, using LWJGL.
I have some questions regarding game loops:
I'm running the rendering routine in a thread. Is it a good idea? Usually, the rendering routine is fairly slow and should not slow down the world update (tick) routine, which is way more important. So I guess using a thread here seems like a good idea (minus the complications from using a thread).
In the world update routine, I'm updating a list of entities with the current time. Each entity can then compute their own deltaTime, corresponding to the last time they were updated. This differs from the usual update loop, which updates every entity in the list with the same deltaTime. This seemed appropriate because of the threaded rendering. Is it a good idea? Should I use the second method instead? If so, is the threaded rendering still needed? If so, do I have to add a maximum deltaTime?
In general, is it a good idea to have a maximum deltaTime?
Thanks for your time!
Is it a good idea? Separate threads are fairly advanced stuff, I see no reason to do multithreading to begin with. All the mobile games I have worked on so far have not needed multiple threads, even though they are 'real-time'. Hardcore PC and console games are where multithreading really starts to come into play. Here is a link to a recent talk on the subject if interested : http://archive.assembly.org/2011/seminars/adventures-in-multithreaded-gameplay-coding.
Sounds like this could cause some strange things if the physics are not handled in one go. Not sure about this. Colliding an object that has already been updated to another position with an object that comes another time, for example, correcting this sort of situation may become problematic? Fast moving collisions may need to be subdivided, which may be why you have the separate update thread, but why not have them all calculated as happening at the same time?
'Variable timestep' and 'Fixed timestep' are the options available for rendering. Most games at the moment seem to choose a 30 fps fixed timestep. The rendering has to be kept under the limits so no catching up should be needed.
One problem with variable timestep is you are forced to pass deltaTime to all time-dependent areas. Fixed timestep is handy as you can assume you are running at say 30 fps, and use that value everywhere. It is a preferred method at the moment as far as I know.
Though this question is a few years old…
AFAIK,
Rendering is usually done in separated processor — GPU, so they're already a separated thread. But, drawing command must be processed by graphics driver (which is running in CPU) before dispatched to GPU, and this processing may be saved by being multi-threaded. Anyway in this case, you're responsible to manage synchronization between logics and rendering thread.
Generally speaking, games are all about interactions between objects, and it's very hard to divide state-graph into fully separated divisions. As a result, whole game state usually becomes single graph, and this graph cannot be updated while being rendered. In this case, you have no benefit by being multi-threaded.
If you can keep a separated immutable data for rendering, than you may gain some benefit from rendering in separated thread. But otherwise, I don't recommend it.
In addition, you should consider GC if you truly want a realtime game. GC related performance issues usually the biggest obstacles to make realtime stuffs.

text-based viewer for profiling results

Do you know of a text-based application for viewing results of application profiling? The profiling results basically contain a list of C++ function call backtraces and how often these backtraces were encountered; now I'm looking for a console tool to analyze the raw data (which backtrace occurred most often; which function was called most often, independent of call trace...).
So far I've created callgrind-compatible files from the raw data and then used the excellent KCachegrind tool for analysis; but now I'm also looking for a tool that works without on text-based terminal. Any ideas?
Take a look at callgrind_annotate.
This command reads in the profile data, and prints a sorted lists of functions, optionally with source annotation.
I wrote such a viewer once. It focussed on a line of code, showing the percent of samples running through that line, and a butterfly view allowing transitions to superior or subordinate lines of code.
It made a nice demo, but did I really use it? Not for long.
(I'm assuming the stack samples have been taken during the interval that you wish to speed up, i.e. not during user-wait.)
The thing is, the program is probably doing something wasteful in that time. (If it is not, you can't speed it up.)
Whatever that wasteful thing is, it consists of some percent of time being spent for poor reasons, like 10%, 50%, 90%, or whatever. During that time, it is on the stack, so an examination of the stack samples will show it.
And, you don't have to look at very many of them. If something is taking 50% of time, 1000 samples will show it on about 500, and 10 samples will show it on about 5. The larger number of samples will estimate the percentage with an extra digit of precision. If your goal is to isolate the problem so you can fix it, you don't need that extra digit.
So, a tool that shows you, by line, the percent of stack samples going through that line is a very nice thing to have, because the wasteful code will appear on it, showing the percentage.
What it does not show you is the reason why the statement is being executed, which is how you know if it's wasteful. Looking at the statement's context on the stack does tell you that.
So even though I had the viewer, I just ended up examining the samples themselves, and only about 10 or 20 of them. The bigger the percentage is, the smaller the number of samples I need to look at before I find it. Here's an example.

When using Direct3D, how much math is being done on the CPU?

Context: I'm just starting out. I'm not even touching the Direct3D 11 API, and instead looking at understanding the pipeline, etc.
From looking at documentation and information floating around the web, it seems like some calculations are being handled by the application. That, is, instead of simply presenting matrices to multiply to the GPU, the calculations are being done by a math library that operates on the CPU. I don't have any particular resources to point to, although I guess I can point to the XNA Math Library or the samples shipped in the February DX SDK. When you see code like mViewProj = mView * mProj;, that projection is being calculated on the CPU. Or am I wrong?
If you were writing a program, where you can have 10 cubes on the screen, where you can move or rotate cubes, as well as viewpoint, what calculations would you do on the CPU? I think I would store the geometry for the a single cube, and then transform matrices representing the actual instances. And then it seems I would use the XNA math library, or another of my choosing, to transform each cube in model space. Then get the coordinates in world space. Then push the information to the GPU.
That's quite a bit of calculation on the CPU. Am I wrong?
Am I reaching conclusions based on too little information and understanding?
What terms should I Google for, if the answer is STFW?
Or if I am right, why aren't these calculations being pushed to the GPU as well?
EDIT: By the way, I am not using XNA, but documentation notes the XNA Math Library replaces the previous DX Math library. (i see the XNA Library in the SDK as a sheer template library).
"Am I reaching conclusions based on too little information and understanding?"
Not as a bad thing, as we all do it, but in a word: Yes.
What is being done by the GPU is, generally, dependent on the GPU driver and your method of access. Most of the time you really don't care or need to know (other than curiosity and general understanding).
For mViewProj = mView * mProj; this is most likely happening on the CPU. But it is not much burden (counted in 100's of cycles at the most). The real trick is the application of the new view matrix on the "world". Every vertex needs to be transformed, more or less, along with shading, textures, lighting, etc. All if this work will be done in the GPU (if done on the CPU things will slow down really fast).
Generally you make high level changes to the world, maybe 20 CPU bound calculations, and the GPU takes care of the millions or billions of calculations needed to render the world based on the changes.
In your 10 cube example: You supply a transform for each cube, any math needed for you to create the transform is CPU bound (with exceptions). You also supply a transform for the view, again creating the transform matrix might be CPU bound. Once you have your 11 new matrices you apply the to the world. From a hardware point of view the 11 matrices need to be copied to the GPU...that will happen very, very fast...once copied the CPU is done and the GPU recalculates the world based on the new data, renders it to a buffer and poops it on the screen. So for your 10 cubes the CPU bound calculations are trivial.
Look at some reflected code for an XNA project and you will see where your calculations end and XNA begins (XNA will do everything is possibly can in the GPU).

Resources