How to determine sample size given few parameters - statistics

How to determine sample size given there is 20% point reduction [ before change – after change =20% ] with 95% confidence level and 90% power ? Any pointer on how to solve this

A good first step is always to think about, what kind of test you plan to use. From the very little information you give a paired t-test (or a single measurement t-test comparing the difference to zero) is a likely candidate.
You can now google for "statistical power of t test" to which you can add the name of any computer language or statistics software you plan to use. Except maybe for educational purposes I'd advise to compute statics not by hand but via software.
Kind of an obvious option for statistic software on stackoverflow might be R. In Ryou'll find solutions to many sample size or power calculations in the package pwr. Here is the link to a getting started text: https://cran.r-project.org/web/packages/pwr/vignettes/pwr-vignette.html
The pwr.t.test function is good for your problem. Google will readily help you to alternatives for Python and Julia and SPSS I assume for C++, Java and Javascript as well.
However you will have to make assumptions about the variance or the effect size. Will each value be reduced by almost exactly 20% or will some be reduced a lot and some increase? That is of utmost importance to the question. You will need only one observation if there is no variance, a small amount of observations if there is little variance and a large amount of observations if there is lots of variance.

Related

Is SRM in Google Optimize (Bayesian Model) a thing

So checking for Sample Ratio Mismatch is good for data quality.
But in Google Optimize i can't influence the sample size or do something against it.
My problem is, out of 15 A/B Tests I only got 2 Experiment with no SRM.
(Used this tool https://www.lukasvermeer.nl/srm/microsite/)
In the other hand the bayesian model deals with things like different sample sizes and I dont need to worry about, but the opinions on this topic are different.
Is SRM really a problem in Google Optimize or can I ignore it?
SRM affects Bayesian experiments just as much as it affects Frequentist. SRM happens when you expect a certain traffic split, but end up with a different one. Google Optimize is a black box, so it's impossible to tell if the uneven sample sizes you are experiencing are intentional or not.
Lots of things can cause a SRM, for example if your variation's javascript code has a bug in some browsers those users may not be tracked properly. Another common cause is if your variation causes page load times to increase, more people will abandon the page and you'll see a smaller sample size than expected.
That lack of statistical rigor and transparency is one of the reasons I built Growth Book, which is an open source A/B testing platform with a Bayesian stats engine and automatic SRM checks for every experiment.

Unknown events in nodejs/v8 flamegraph using perf_events

I try to do some nodejs profiling using Linux perf_events as described by Brendan Gregg here.
Workflow is following:
run node >0.11.13 with --perf-basic-prof, which creates /tmp/perf-(PID).map file where JavaScript symbol mapping are written.
Capture stacks using perf record -F 99 -p `pgrep -n node` -g -- sleep 30
Fold stacks using stackcollapse-perf.pl script from this repository
Generate svg flame graph using flamegraph.pl script
I get following result (which look really nice at the beginning):
Problem is that there are a lot of [unknown] elements, which I suppose should be my nodejs function calls. I assume that whole process fails somwhere at point 3, where perf data should be folded using mappings generated by node/v8 executed with --perf-basic-prof. /tmp/perf-PID.map file is created and some mapping are written to it during node execution.
How to solve this problem?
I am using CentOS 6.5 x64, and already tried this with node 0.11.13, 0.11.14 (both prebuild, and compiled as well) with no success.
FIrst of all, what "[unknown]" means is the sampler couldn't figure out the name of the function, because it's a system or library function.
If so, that's OK - you don't care, because you're looking for things responsible for time in your code, not system code.
Actually, I'm suggesting this is one of those XY questions.
Even if you get a direct answer to what you asked, it is likely to be of little use.
Here are the reasons why:
1. CPU Profiling is of little use in an I/O bound program
The two towers on the left in your flame graph are doing I/O, so they probably take a lot more wall-time than the big pile on the right.
If this flame graph were derived from wall-time samples, rather than CPU-time samples, it could look more like the second graph below, which tells you where time actually goes:
What was a big juicy-looking pile on the right has shrunk, so it is nowhere near as significant.
On the other hand, the I/O towers are very wide.
Any one of those wide orange stripes, if it's in your code, represents a chance to save a lot of time, if some of the I/O could be avoided.
2. Whether the program is CPU- or I/O-bound, speedup opportunities can easily hide from flame graphs
Suppose there is some function Foo that really is doing something wasteful, that if you knew about it, you could fix.
Suppose in the flame graph, it is a dark red color.
Suppose it is called from numerous places in the code, so it's not all collected in one spot in the flame graph.
Rather it appears in multiple small places shown here by black outlines:
Notice, if all those rectangles were collected, you could see that it accounts for 11% of time, meaning it is worth looking at.
If you could cut its time in half, you could save 5.5% overall.
If what it's doing could actually be avoided entirely, you could save 11% overall.
Each of those little rectangles would shrink down to nothing, and pull the rest of the graph, to its right, with it.
Now I'll show you the method I use. I take a moderate number of random stack samples and examine each one for routines that might be speeded up.
That corresponds to taking samples in the flame graph like so:
The slender vertical lines represent twenty random-time stack samples.
As you can see, three of them are marked with an X.
Those are the ones that go through Foo.
That's about the right number, because 11% times 20 is 2.2.
(Confused? OK, here's a little probability for you. If you flip a coin 20 times, and it has a 11% chance of coming up heads, how many heads would you get? Technically it's a binomial distribution. The most likely number you would get is 2, the next most likely numbers are 1 and 3. (If you only get 1 you keep going until you get 2.) Here's the distribution:)
(The average number of samples you have to take to see Foo twice is 2/0.11 = 18.2 samples.)
Looking at those 20 samples might seem a bit daunting, because they run between 20 and 50 levels deep.
However, you can basically ignore all the code that isn't yours.
Just examine them for your code.
You'll see precisely how you are spending time,
and you'll have a very rough measurement of how much.
Deep stacks are both bad news and good news -
they mean the code may well have lots of room for speedups, and they show you what those are.
Anything you see that you could speed up, if you see it on more than one sample, will give you a healthy speedup, guaranteed.
The reason you need to see it on more than one sample is, if you only see it on one sample, you only know its time isn't zero. If you see it on more than one sample, you still don't know how much time it takes, but you do know it's not small.
Here are the statistics.
Generally speaking it is a bad idea to disagree with a subject matter expert but (with the greatest respect) here we go!
SO urges the answer to do the following:
"Please be sure to answer the question. Provide details and share your research!"
So the question was, at least my interpretation of it is, why are there [unknown] frames in the perf script output (and how do I turn these [unknown] frames in to meaningful names)?
This question could be about "how to improve the performance of my system?" but I don't see it that way in this particular case. There is a genuine problem here about how the perf record data has been post processed.
The answer to the question is that although the prerequisite set up is correct: the correct node version, the correct argument was present to generate the function names (--perf-basic-prof), the generated perf map file must be owned by root for perf script to produce the expected output.
That's it!
Writing some new scripts today I hit apon this directing me to this SO question.
Here's a couple of additional references:
https://yunong.io/2015/11/23/generating-node-js-flame-graphs/
https://github.com/jrudolph/perf-map-agent/blob/d8bb58676d3d15eeaaf3ab3f201067e321c77560/bin/create-java-perf-map.sh#L22
[ non-root files can sometimes be forced ] http://www.spinics.net/lists/linux-perf-users/msg02588.html

Quickest and easiest algorithm for comparing the frequency content of two sounds

I want to take two sounds that contain a dominant frequency and say 'this one is higher than this one'. I could do FFT, find the frequency with the greatest amplitude of each and compare them. I'm wondering if, as I have a specific task, there may be a simpler algorithm.
The sounds are quite dirty with many frequencies, but contain a clear dominant pitch. They aren't perfectly produced sine waves.
Given that the sounds are quite dirty, I would suggest starting to develop the algorithm with the output of an FFT as it'll be much simpler to diagnose any problems. Then when you're happy that it's working you can think about optimising/simplifying.
As a rule of thumb when developing this kind of numeric algorithm, I always try to operate first in the most relevant domain (in this case you're interested in frequencies, so analyse in frequency space) at the start, and once everything is behaving itself consider shortcuts/optimisations. That way you can test the latter solution against the best-performing former.
In the general case, decent pitch detection/estimation generally requires a more sophisticated algorithm than looking at FFT peaks, not a simpler algorithm.
There are a variety of pitch detection methods ranging in sophistication from counting zero-crossing (which obviously won't work in your case) to extremely complex algorithms.
While the frequency domain methods seems most appropriate, it's not as simple as "taking the FFT". If your data is very noisy, you may have spurious peaks that are higher than what you would consider to be the dominant frequency. One solution is use window overlapping segments of your signal, and do STFTs, and average the results. But this raises more questions: how big should the windows be? In this case, it depends on how far apart you expect those dominant peaks to be, how long your recordings are, etc. (Note: FFT methods can resolve to better than one-bin size by taking into account phase information. In this case, you would have to do something more complex than averaging all your FFT windows together).
Another approach would be a time-domain method, such as YIN:
http://recherche.ircam.fr/equipes/pcm/cheveign/pss/2002_JASA_YIN.pdf
Wikipedia discusses some more methods:
http://en.wikipedia.org/wiki/Pitch_detection_algorithm
You can also explore some more methods in chapter 9 of this book:
http://www.amazon.com/DAFX-Digital-Udo-ouml-lzer/dp/0471490784
You can get matlab sourcecode for yin from chapter 9 of that book here:
http://www2.hsu-hh.de/ant/dafx2002/DAFX_Book_Page_2nd_edition/matlab.html

The best choice for random number generator

There are so many randomizers out there. Some standard ones are questionably slow. Some claim to be of high quality and speed. Some claim to be of higher quality. Some claim to be even more fast and of better quality. Some claim the speed but quality.
One fact I know is that mwc-random is being used by the Criterion benchmarking library which speaks for itself and the claims are very promising.
Since there are at least two qualities to every generator: the robustness and the quality of the generated number - I'll split the question of choosing the best generator into three categories:
The fastest
The one generating the most random number
The one having the optimal combination of both of these qualities at adequate rate
So which is which and why?
I only can speak about mwc-random.
It is fast ~15ns per Word32 on Phenom II. If you want to measure how fast is it on your computer it comes with benchmark set. Still it possible to trade period for speed. Xorshift RNG should be faster but have shorter periods 2^32 or 2^64 instead of 2^8222.
Randomness. mwc-random uses algorithm MWC256 (another name: MWC8222) which is not cryptographicaly secure but fares well in randomness tests. In particular mwc-random passes dieharder randomness test.

Open source or free financial analysis programs/libraries

I'm looking for something containing similar functions to Matlab’s financial and financial derivatives toolbox but don’t have the cash to spend on matlab. I would appreciate any info on free or open source libraries or programs that will let me easily calculate interest rates, risk etc.
How about JQuantLib or QuantLib?
How about the Octave financial functions?
http://www.gnu.org/software/octave/doc/interpreter/Financial-Functions.html#Financial-Functions
I'm not familiar with the Matlab toolbox, so you'll have to judge for yourself.
GNU Octave is a high-level language,
primarily intended for numerical
computations. It provides a convenient
command line interface for solving
linear and nonlinear problems
numerically, and for performing other
numerical experiments using a language
that is mostly compatible with Matlab.
It may also be used as a
batch-oriented language.
Exactly what functions do you need? How advanced? You have some financial functions in .Net
Im sure it doesnt cover everything, but calulating interests and some other are no problem:
http://msdn.microsoft.com/en-us/library/daksysx3(VS.80).aspx
Calculate depreciation. DDB, SLN, SYD
Calculate future value.
FV
Calculate interest rate.
Rate
Calculate internal rate of return.
IRR, MIRR
Calculate number of periods.
NPer
Calculate payments.
IPmt, Pmt, PPmt
Calculate present value.
NPV, PV

Resources