OpenCV and haartraining - how to reduce time for calculation? - multithreading

Is it possible to improve the performance of haartraining application. If I see well it uses only one thread. Does nature of his algorithm exclude using multithreading?I'm looking for possibility for increasing speed of calculating classificator therefore my question is: only one way to decrease calculation time is to use faster procesor - number of cores doesn't have matter?

Related

How to determine sample size given few parameters

How to determine sample size given there is 20% point reduction [ before change – after change =20% ] with 95% confidence level and 90% power ? Any pointer on how to solve this
A good first step is always to think about, what kind of test you plan to use. From the very little information you give a paired t-test (or a single measurement t-test comparing the difference to zero) is a likely candidate.
You can now google for "statistical power of t test" to which you can add the name of any computer language or statistics software you plan to use. Except maybe for educational purposes I'd advise to compute statics not by hand but via software.
Kind of an obvious option for statistic software on stackoverflow might be R. In Ryou'll find solutions to many sample size or power calculations in the package pwr. Here is the link to a getting started text: https://cran.r-project.org/web/packages/pwr/vignettes/pwr-vignette.html
The pwr.t.test function is good for your problem. Google will readily help you to alternatives for Python and Julia and SPSS I assume for C++, Java and Javascript as well.
However you will have to make assumptions about the variance or the effect size. Will each value be reduced by almost exactly 20% or will some be reduced a lot and some increase? That is of utmost importance to the question. You will need only one observation if there is no variance, a small amount of observations if there is little variance and a large amount of observations if there is lots of variance.

Is it possible to extract instruction specific energy consumption in a program? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
What i mean is that given a source code file is it possible to extract energy consumption levels for a particular code block or 1 single instruction, using a tool like perf?
Use jRAPL which is a framework for profiling Java programs running on CPUs.
For example, the following code snippet attempts to measure the energy consumption of any code block, whose value is the difference between beginning and end:
double beginning = EnergyCheck.statCheck();
doWork();
double end = EnergyCheck.statCheck();
System.out.println(end - beginning);
And the detailed paper of this framework titled "Data-Oriented Characterization of Application-Level Energy Optimization" is in http://gustavopinto.org/lost+found/fase2015.pdf
There are tools for measuring power consumption (see #jww's comment for links), but they don't even try to attribute consumption to specific instructions the way perf record can statistically sample event -> instruction correlations.
You can get an idea by running a whole block of the same instruction, like you'd do when trying to microbenchmark the throughput or latency of an instruction. Divide energy consumed by number of instructions executed.
But a significant fraction of CPU power consumption is outside of the execution units, especially for out-of-order CPUs running relatively cheap instructions (like scalar ADD / AND, or different memory subsystem behaviour triggered by different, like hardware prefetching).
Different patterns of data dependencies and latencies might matter. (Or maybe not, maybe out-of-order schedulers tend to be constant power regardless of how many instructions are waiting for their inputs to be ready, and setting up bypass forwarding vs. reading from the register file might not be significant.)
So a power or energy-per-instruction number is not directly meaningful, mostly only relative to a long block of dependent AND instructions or something. (Should be one of the lowest-power instructions, probably fewer transistors flipping inside the ALU than with ADD.) That's a good baseline for power microbenchmarks that run 1 instruction or uop per clock, but maybe not a good baseline for power microbenches where the front-end is doing more or less work.
You might want to investigate how dependent AND vs. independent NOP or AND instructions affect energy per time or energy per instruction. (i.e. how does power outside the execution units scale with instructions-per-clock and/or register read / write-back.)

Did I test the ArrayFire performance incorrectly?

I cannot figure out what's wrong. I mean, the speed is way too fast, like 1 million items vs 10 million items basically have the same 0.0005 second computation on my machine. So fast, it looks like it wasn't doing anything. But the result of the data is actually correct.
It is mind boggling because if I make similar computation on sequential loop without storing the result in an array, it is not only number of cores slower, but, like 1000 times slower than ArrayFire.
So, maybe I wasn't using the timer correctly?
Do you think they didn't actually compute the data right away? Maybe it just sets up some kind of shadow marker? And when I call the myArray.host(), it will start doing all the actual computations?
From their website, it says there is some kind of JIT to bundle the computations.
ArrayFire uses Just In Time compilation to combine many light weight functions into a single kernel launch. This along with our easy-to-use API allows users to not only quickly prototype their algorithms, but also get the best out of the underlying hardware.
I start/stop my timer right before/after few ArrayFire computations. And it is just insanely fast. Maybe I test it wrong? What's the proper way to test ArrayFire performance?
Never mind, I found out what to do,
Based on the examples, I should be using af::timeit(function) instead of using the af::timer. Using af::timeit will be very slow, but, the result scale more reasonably when I increase the size 10x. It doens't actually compute right away, that's why using af::timer myself wouldn't work.
thank you

plot speed up curve vs number of OpenMP threads - scalability?

I am working on a C++ code which uses OpenMP threads. I have plotted the speedup curve versus the number of OpenMP threads and the theorical curve (if the code was able to be fully parallelized).
here is this plot :
From this picture, can we say this code is not scalable (from a point of view of parallelization) ? i.e the code is not twice more fast with 2 OpenMP threads, four more fast with 4 threads etc ... ?
Thanks
For the code that barely achieves 2.5x speedup on 16 threads, it is fair to say that it does not scale. However "is not scalable" is often considered a stronger statement. The difference, as I understand it, is that "does not scale" typically refers to a particular implementation and does not imply inherent inability to scale; in other words, maybe you can make it scale if bottlenecks are eliminated. On the other hand, "is not scalable" usually means "you cannot make it scale, at least not without changing the core algorithm". Assuming such meaning, one cannot say "a problem/code/algorithm is not scalable" only looking at a chart.
On an additional note, it's not always reasonable to expect perfect scaling (2x with 2 threads, 4x with 4 threads, etc). A curve that is "close enough" to the ideal scaling might still be considered as showing good scalability; and what "close enough" means may depend on a number of factors. It can be useful to tell / think of parallel efficiency, and not speedup, when scalability is a question. For example, if parallel efficiency is 0.8 (or 80%) and does not drop when the number of threads increase, it could be considered a good scalability. Also, it's possible that some program scales well till a certain number of threads, but remains flat or even falls down if more resources are added.

The best choice for random number generator

There are so many randomizers out there. Some standard ones are questionably slow. Some claim to be of high quality and speed. Some claim to be of higher quality. Some claim to be even more fast and of better quality. Some claim the speed but quality.
One fact I know is that mwc-random is being used by the Criterion benchmarking library which speaks for itself and the claims are very promising.
Since there are at least two qualities to every generator: the robustness and the quality of the generated number - I'll split the question of choosing the best generator into three categories:
The fastest
The one generating the most random number
The one having the optimal combination of both of these qualities at adequate rate
So which is which and why?
I only can speak about mwc-random.
It is fast ~15ns per Word32 on Phenom II. If you want to measure how fast is it on your computer it comes with benchmark set. Still it possible to trade period for speed. Xorshift RNG should be faster but have shorter periods 2^32 or 2^64 instead of 2^8222.
Randomness. mwc-random uses algorithm MWC256 (another name: MWC8222) which is not cryptographicaly secure but fares well in randomness tests. In particular mwc-random passes dieharder randomness test.

Resources