Misinterpretation of Sum Variance Law - statistics

I'm trying to understand how to combine variances of batches of observations. My understanding is that you can simply sum them according to the sum variance law. But my experiments seem to differ from this theorem.
Here is the python code used:
import numpy as np
x = np.random.rand(100000)
expected = np.var(x)
print("expected:", expected)
for n in [2,4,5,10,20,40,50,100,1000]:
s = np.split(x, n)
sigma_sq = [np.var(v) for v in s]
result = np.sum(sigma_sq)
print("result", n, ":", result, "(", np.abs(result - expected), ")")
the printed result is:
expected: 0.0832224743666757
result 2 : 0.16644455708841321 ( 0.08322208272173752 )
result 4 : 0.3328814911392468 ( 0.24965901677257113 )
result 5 : 0.4161068624507617 ( 0.33288438808408605 )
result 10 : 0.832183555011673 ( 0.7489610806449972 )
result 20 : 1.664227484757454 ( 1.5810050103907785 )
result 40 : 3.3278497945218355 ( 3.2446273201551596 )
result 50 : 4.159353197179163 ( 4.076130722812487 )
result 100 : 8.314084653397305 ( 8.23086217903063 )
result 1000 : 82.397691161862 ( 82.31446868749532 )
As the number of splits grows the difference between the expected and the result grows.
However if I divide the sums by n (i.e. average them) then the error is acceptable (on the order of 1e-5).
I must be misinterpreting the sum variance law, but I'm not sure where my misunderstanding is.

There are different reasons for this I think.
1. If we have a small sample -> computing the variance could be error (i.e. not really the true variance of a certain distribution).
2. There might be a chance that two samples are not totally independent.
The best way to fix these things is to have two very large samples. You can run the following code and see the variance of two lists is close to the sum of two variances. This is not the case when we replace 10000 with a smaller number, say, 10, 100, 1000.

Related

Divide a value in excel by a set of preset values to find out how many of each are needed

I am curious if there is a way to make my life easier. In excel I am producing a total value, say 750 and need to find out how many orders of pipe I need from values of 50,100,200,250,500. Is there anyway to have excel take a value and then return how many of each of these numbers I would need, so for the 750 case 1 500 and 1 250?
Currently the solution is just worked out in my head
Assuming you want to try to fit pipes in decreasing order of size,and that you have access to the required functions, you can use Reduce as demonstrated here to step through the sizes and successively divide by each one although the formula is a little laboured:
=LET(pipes,{500;250;200;100;50},reqd,750,DROP(REDUCE(0,pipes,
LAMBDA(a,c,VSTACK(a,QUOTIENT(reqd-IF(ROWS(a)>1,SUM(DROP(a,1)*TAKE(pipes,ROWS(a)-1)),0),c)))),1))
As pointed out by #Jos Woolley, this may not give you the answer you want if the total is something like 749. It will fit as many values in as possible and give a result 500+200 total 700 (remainder 49). You could fix it perhaps by rounding up to the next multiple of 50.
For the example of 823, you would have:
=LET(pipes,{500;250;200;100;50},reqd,CEILING(823,MIN(pipes)),DROP(REDUCE(0,pipes,
LAMBDA(a,c,VSTACK(a,QUOTIENT(reqd-IF(ROWS(a)>1,SUM(DROP(a,1)*TAKE(pipes,ROWS(a)-1)),0),c)))),1))
which gives 500+250+100=850.
Well I've got a bit obsessed with this now and I am determined to get a lambda working to find the optimal answer! I have been looking at the brute-force solution to finding the minimum number of coins required to make up a given total in the reference mentioned previously and have managed to translate it into a lambda using Reduce:
Mincoins1= LAMBDA(coins, m, v,
IF(
v <= 0,
0,
REDUCE(
999,
coins,
LAMBDA(a, c,
IF(v >= c, LET(mc, mincoins1.mincoins1(coins, m, v - c) + 1, IF(mc < a, mc, a)), a)
)
)
)
)
This does give the correct answer, 2, for the case when you want to make up a value of 400 from the list of pipes given. The next step will be to modify the code to return the list of pipes which give that total (200,200).
https://www.enjoyalgorithms.com/blog/minimum-coin-change
Here is the lambda modified to return a string containing the chosen pipes:
Mincoins2= LAMBDA(coins, m, v,
IF(
v <= 0,
"",
REDUCE(
rept("x",999),
coins,
LAMBDA(a, c,
IF(v >= c, LET(mc, c&"|"&mincoins2.mincoins2(coins, m, v - c), IF(len(mc) < len(a), mc, a)), a)
)
)
)
);
It does work BUT (and this is a big but) it hits a limit as soon as the value to be produced exceeds 1000 and you get a #value error. Disappointing. But interesting I think as a proof of concept.
Not sure I understand the question but lets try.
if you have 1 450 to divide, have a formula that divides 1 450 with you highest lenght (750) and then round it down.
so the formula would be something of the line: = rounddown(1 450 / 750; 0)
you will then get the answer that you need 1 of the length 750.
then keep the info about how much length you have remaining. So a formula like:
=1 450 - 750 * [the answer from previous formula = 1]. this would sum to 700.
then start over with the same thing, but divide 700 with 500 (second largest size).
Your question is extremely difficult: one might think for this easy solution, starting with value_begin:
amount_of_500 = value_begin DIV 500; // integer division
temp = value_begin - 500 * amount_of_500;
amount_of_250 = temp DIV 250; // again integer division
temp = temp - 250 * amount_of_250;
amount_of_200 = temp DIV 200; // again integer division
temp = temp - 200 * amount_of_200;
...
However, this will not work because of the value 200, which is far too close to 250: just start with value_begin equal to 400 (algorithm solution : 250 + 100 + 50, while best solution : 200 + 200).
Are you sure you need both 200 and 250 as possible numbers to divide by? If yes, you might have a serious problem getting this implemented.

Calculating a custom probability distribution in python (numerically)

I have a custom (discrete) probability distribution defined somewhat in the form: f(x)/(sum(f(x')) for x' in a given discrete set X). Also, 0<=x<=1.
So I have been trying to implement it in python 3.8.2, and the problem is that the numerator and denominator both come out to be really small and python's floating point representation just takes them as 0.0.
After calculating these probabilities, I need to sample a random element from an array, whose each index may be selected with the corresponding probability in the distribution. So if my distribution is [p1,p2,p3,p4], and my array is [a1,a2,a3,a4], then probability of selecting a2 is p2 and so on.
So how can I implement this in an elegant and efficient way?
Is there any way I could use the np.random.beta() in this case? Since the difference between the beta distribution and my actual distribution is only that the normalization constant differs and the domain is restricted to a few points.
Note: The Probability Mass function defined above is actually in the form given by the Bayes theorem and f(x)=x^s*(1-x)^f, where s and f are fixed numbers for a given iteration. So the exact problem is that, when s or f become really large, this thing goes to 0.
You could well compute things by working with logs. The point is that while both the numerator and denominator might underflow to 0, their logs won't unless your numbers are really astonishingly small.
You say
f(x) = x^s*(1-x)^t
so
logf (x) = s*log(x) + t*log(1-x)
and you want to compute, say
p = f(x) / Sum{ y in X | f(y)}
so
p = exp( logf(x) - log sum { y in X | f(y)}
= exp( logf(x) - log sum { y in X | exp( logf( y))}
The only difficulty is in computing the second term, but this is a common problem, for example here
On the other hand computing logsumexp is easy enough to to by hand.
We want
S = log( sum{ i | exp(l[i])})
if L is the maximum of the l[i] then
S = log( exp(L)*sum{ i | exp(l[i]-L)})
= L + log( sum{ i | exp( l[i]-L)})
The last sum can be computed as written, because each term is now between 0 and 1 so there is no danger of overflow, and one of the terms (the one for which l[i]==L) is 1, and so if other terms underflow, that is harmless.
This may however lose a little accuracy. A refinement would be to recognize the set A of indices where
l[i]>=L-eps (eps a user set parameter, eg 1)
And then compute
N = Sum{ i in A | exp(l[i]-L)}
B = log1p( Sum{ i not in A | exp(l[i]-L)}/N)
S = L + log( N) + B

How to define 'phi' so I don't get an error for trying to multiply a non-int by a complex number in the exponential

Using:
arr2 = np.array([np.arange(0,20)]*20,dtype=np.complex_)
for a in range(arr2.shape[0]):
for b in range(arr2.shape[1]):
if a-b:
arr2[a,b] =1-(1j/np.sqrt(3))
else:
arr2[a,b] = np.e**(1j*phi*abs(a-b))
to produce a matrix of size N * N which I have defined as a * b above gives the error 'TypeError: can't multiply sequence by non-int of type 'complex''.
I need to define \phi however I do not want python to solve it numerically.
I want my output to be of the form (this is just an example of the first 3 terms of the first 3 lines of the matrix where i is an imaginary number):
( 1-(i/sqrt(3)) , e^(i{\phi}) , e^(2i{\phi}) , ... )
( e^(i{\phi}) , 1-(i/sqrt(3)) , e^(i{\phi}) , ... )
( e^(2i{\phi}) , e^(i{\phi}) , 1-(i/sqrt(3)) , ... )
( ... , ... , ... , ... )
instead of trying to numerically solve this matrix, is there a way to express the matrix in the general form I have displayed above? Would a simple
arr2inv = inv(np.matrix(arr2))
return the inverse of the general form I want or produce an error?
So to summarise:
How do I define phi to remove the complex-non-int error?
How do I get python to produce the general form shown above?
Will it invert without an error?
The simplest and faster way to do that numerically is :
i=np.arange(20)
phi = (1 + np.sqrt(5))/2 # for example :)
a_b = np.abs(np.subtract.outer(i,i)) # make all the differences in a matrix
arr2 = np.exp(1j*phi*a_b) # compute the exp
np.fill_diagonal(arr2, 1-1j/np.sqrt(3)) # set the diag
For symbolic approach python have the sympy module which can manage that ;
it will probably difficultly simplify inv(arr2) with rank = 20. but numpy is good to quickly check a mathematical assumption.
Numpy and related can only handle numerical computations. If you need an output for the matrix as you describe consider using Mathematica or another equivalent symbolic manipulation software.

What can be a newbie explanation for a dynamic programming?

I am trying to learn the basics of Dynamic Programming (DP), and went through some the online resources i could get, such as...
What is dynamic programming?
Good examples, articles, books for understanding dynamic programming
Tutorial for Dynamic Programming
Dynamic Programming – From Novice to Advanced -- (I can't understand it properly (i.e. how to approach a problem using DP))
and till now i get to understand that
A dynamic problem is almost same that of
recursion with just a difference (which gives it the power it is know
for)
i.e. storing the value or solution we got and using it again to find next solution
For Example:
According to an explanation from codechef
Problem : Minimum Steps to One
Problem Statement: On a positive integer, you can perform any one of the following 3 steps.
Subtract 1 from it. ( n = n - 1 )
If its divisible by 2, divide by 2. ( if n % 2 == 0 , then n = n / 2 )
If its divisible by 3, divide by 3. ( if n % 3 == 0 , then n = n / 3 )
Now the question is, given a positive integer n, find the minimum number of steps that takes n to 1
eg:
For n = 1 , output: 0
For n = 4 , output: 2 ( 4 /2 = 2 /2 = 1 )
For n = 7 , output: 3 ( 7 -1 = 6 /3 = 2 /2 = 1 )
int memo[n+1]; // we will initialize the elements to -1 ( -1 means, not solved it yet )
Top-Down Approach for the above problem
int getMinSteps ( int n ) {
if ( n == 1 ) return 0;
if( memo[n] != -1 ) return memo[n];
int r = 1 + getMinSteps( n - 1 );
if( n%2 == 0 ) r = min( r , 1 + getMinSteps( n / 2 ) );
if( n%3 == 0 ) r = min( r , 1 + getMinSteps( n / 3 ) );
memo[n] = r ; // save the result. If you forget this step, then its same as plain recursion.
return r;
}
Am i correct in understanding the dp, or can anyone explain it in a better and easy way, so that i can learn it and can approach a problem with Dynamic programming.
The Fibonacci sequence example from wikipedia gives a good example.
Dynamic programming is an optimization technique that transforms a potentially exponential recursive solution into a polynomial time solution assuming the problem satisfies the principle of optimality. Basically meaning you can build an optimal solution from optimal sub-problems.
Another important characteristic of problems that are tractable with dynamic programming is that they are overlapping. If those problems are broken down into sub-problems that are repetitive, the same solution can be reused for solving those sub problems.
A problem with optimal substructure property and overlapping subproblems, dynamic programming is a potentially efficient way to solve it.
In the example you can see that recursive version of the Fibonacci numbers would grow in a tree like structure, suggesting an exponential explosion.
function fib(n)
if n <=1 return n
return fib(n − 1) + fib(n − 2)
So for fib(5) you get:
fib(5)
fib(4) + fib(3)
(fib(3) + fib(2)) + (fib(2) + fib(1))
And so on in a tree like fashion.
Dynamic programming lets us build the solution incrementally using optimal sub-problems in polynomial time. This is usually done with some form of record keeping such as a table.
Note that there are repeating instances of sub problems, i.e. calculating fib(2) one time is enough.
Also from Wikipedia, a dynamic programming solution
function fib(n)
if n = 0
return 0
else
var previousFib := 0, currentFib := 1
repeat n − 1 times // loop is skipped if n = 1
var newFib := previousFib + currentFib
previousFib := currentFib
currentFib := newFib
return currentFib
Here the solution is built up from previousFib and currentFib which are set initially. The newFib is calculated from the previous steps in this loop. previousFib and currentFib represent our record keeping for previous sub-problems.
The result is a polynomial time solution (O(n) in this case) for a problem whose recursive formulation would have been exponential (O(2^n) in this case).
There is wonderful answer How should I explain dynamic programming to a 4-year-old?
Just quoting same here :
Writes down "1+1+1+1+1+1+1+1 =" on a sheet of paper
"What's that equal to?"
counting "Eight!"
writes down another "1+" on the left
"What about that?"
quickly "Nine!"
"How'd you know it was nine so fast?"
"You just added one more"
"So you didn't need to recount because you remembered there were
eight! Dynamic Programming is just a fancy way to say 'remembering
stuff to save time later'"

Statistical Analysis Error? python 3 proof read please

The code below generates two random integers within range specified by argv, tests if the integers match and starts again. At the end it prints some stats about the process.
I've noticed though that increasing the value of argv reduces the percentage of tested possibilities exponentially.
This seems counter intuitive to me so my question is, is this an error in the code or are the numbers real and if so then what am I not thinking about?
#!/usr/bin/python3
import sys
import random
x = int(sys.argv[1])
a = random.randint(0,x)
b = random.randint(0,x)
steps = 1
combos = x**2
while a != b:
a = random.randint(0,x)
b = random.randint(0,x)
steps += 1
percent = (steps / combos) * 100
print()
print()
print('[{} ! {}]'.format(a,b), end=' ')
print('equality!'.upper())
print('steps'.upper(), steps)
print('possble combinations = {}'.format(combos))
print('explored {}% possibilitys'.format(percent))
Thanks
EDIT
For example:
./runscrypt.py 100000
will returm me something like:
[65697 ! 65697] EQUALITY!
STEPS 115867
possble combinations = 10000000000
explored 0.00115867% possibilitys
"explored 0.00115867% possibilitys" <-- This number is too low?
This experiment is really a geometric distribution.
Ie.
Let Y be the random variable of the number of iterations before a match is seen. Then Y is geometrically distributed with parameter 1/x (the probability of generating two matching integers).
The expected value, E[Y] = 1/p where p is the mentioned probability (the proof of this can be found in the link above). So in your case the expected number of iterations is 1/(1/x) = x.
The number of combinations is x^2.
So the expected percentage of explored possibilities is really x/(x^2) = 1/x.
As x approaches infinity, this number approaches 0.
In the case of x=100000, the expected percentage of explored possibilities = 1/100000 = 0.001% which is very close to your numerical result.

Resources