probability of linear trend - statistics

I have got a small amount of sample ([10 16 11 16 26 17 16 16 15 13 15 14 12 12 14 20 14 12 16 21 13 13 14 16
17 18 16 14 16 23 24 12 13 13 15 16 15 14 14 16 20 17 17 15 23 18 12 19
12 11 19 17 14 18 15 23 30 24 16 14 22 17 17 17 17 20 19 27 17 36]
):
There are two models:
Model A – there is not linear trend, so the center of the noise
histogram is the mean of the data.
Model B – there is linear trend,
so the center of the noise histogram is the distance from a fitted
linear trendline.
Obviously, I can choice the model with smaller sigma^2 to choose the better model. Which is apparently the (B). However, I am not confident there is really have a trend in the data, and not just the noise randomly happened like this. So, I made a Dickey-Fuller test on both model, and both under the 1% limit ('1%': -3.529, A: -5.282, B: -6.149 ) . Which telling me it is possible the (A) is the right model.
So I come to the question: What is the probability of (A) is the better model?
I tried to solve this like:
I assume the noise is normally distributed, so I fit the best normal distribution on the sigma separately on (A) and (B). So, I got two models for the noise.
After this, I have taken n (the original sample length) sample from these two models and I compared they sigma^2. If (A) model sigma^2 was smaller I increased the possibility the model (A) is better, if not decreased. I repeated this test a reasonable amount of time.
In Python code, probably more clear:
model_b_mu, model_b_sigma = stats.norm.fit(model_b['residual'])
model_a_mu, model_a_sigma = stats.norm.fit(model_a['residual'])
def compare_models(modela_mu, modela_sigma, modelb_mu, modelb_sigma, length):
repate = 20000
modela_better = 0
for i in range(repate):
modela = np.random.normal(modela_mu, modela_sigma, size = length )
modelb = np.random.normal(modelb_mu, modelb_sigma, size = length )
# test which sigma^2 is smaller
sigma_a = np.sum(np.sqrt(np.power(modela, 2)))
sigma_b = np.sum(np.sqrt(np.power(modelb, 2)))
if sigma_a < sigma_b:
modela_better += 1
return modela_better/repate
model_a_better = compare_models(model_a_mu, model_a_sigma, model_b_mu, model_b_sigma, len(model_a))
print(model_a_better)
Which gave me: 0.3152. I interpreted this result: If the noise is normally distributed, 31.52% of the probability that model (A) is better.
My question is: I am thinking right way? If not, why? And how should I solve the problem?
Ps: I am not statistician, more like programmer, so it is highly possible this all above solution is wrong. Therefore, I ask some confirmation.

This is a so-called model selection problem. There isn't a single right answer, although the most nearly correct way to go about it is via Bayesian inference. That is, to compute the posterior probability p(model | data) for each of the models under consideration (two or more). Note that the result of Bayesian inference is a probability distribution over models, not a single "this model is correct" selection; any subsequent result which depends on a model is to be averaged over the distribution over models. Note also that Bayesian inference requires a prior over the models, that is, it's required that you specify a probability for each model a priori, in the absence of data. This is a feature, not a bug.
Glancing at the problem as stated, it would probably be straightforward to work out the posterior probability for the two models you mention, but first you'll need to get somewhat familiar with the conceptual framework. A web search for Bayesian model inference should turn up a lot of resources. Also this question is more suitable for stats.stackexchange.com.

Related

How to use Excel Solver for piecewise linear fit?

I am trying to use Excel Solver to get fits for a piecewise linear function (here, a three line fit). The Solver explanation here is helpful for a single linear case, but I am not sure how to set the model up "smartly" so that it re-calculates the hinge-points (i.e., x-values of line intersections will change with the input data). I've never used Solver before.
x y
1 0.1552
2 0.1877
3 0.2016
4 0.2094
5 0.2142
6 0.2176
7 0.2201
8 0.2220
9 0.2235
10 0.2247
11 0.2256
12 0.2265
13 0.2272
14 0.2278
15 0.2283
16 0.2288
17 0.2292
18 0.2296
19 0.2299
20 0.2302

Get Poisson expectation of preceding values of a time series in Python

I have some time series data (in a Pandas dataframe), d(t):
time 1 2 3 4 ... 99 100
d(t) 5 3 17 6 ... 23 78
I would like to get a time-shifted version of the data, e.g. d(t-1):
time 1 2 3 4 ... 99 100
d(t) 5 3 17 6 ... 23 78
d(t-1) NaN 5 3 17 6 ... 23
But with a complication. Instead of simply time-shifting the data, I need to take the expected value based on a Poisson-distributed shift. So instead of d(t-i), I need E(d(t-j)), where j ~ Poisson(i).
Is there an efficient way to do this in Python?
Ideally, I would be able to dynamically generate the result with i as a parameter (that I can use in an optimization).
numpy's Poisson functions seem to be about generating draws from a Poisson rather than giving a PMF that could be used to calculate expected value. If I could generate a PMF, I could do something like:
for idx in len(d(t)):
Ed(t-i) = np.multiply(d(t)[:idx:-1], PMF(Poisson, i)).sum()
But I have no idea what actual functions to use for this, or if there is an easier way than iterating over indices. This approach also won't easily let me optimize over i.
You can use scipy.stats.poisson to get PMF.
Here's a sample:
from scipy.stats import poisson
mu = 10
# Declare 'rv' to be a poisson random variable with λ=mu
rv = poisson(mu)
# poisson.pmf(k) = (e⁻ᵐᵘ * muᵏ) / k!
print(rv.pmf(4))
For more information about scipy.stats.poisson check this doc.

Difference between consecutive maxima and minima in a .csv dataset

I have a dataset which represents tracking data of a mouse's paw moving up and down in the y-axis as it reaches up for and pulls down on a piece of string.
The output of the data is a list of y-coordinates corresponding to a fraction of a second. For example:
1 333.9929833
2 345.4504726
3 355.7046572
4 367.6136684
5 379.7906121
6 390.5470788
7 397.9017118
8 403.677123
9 412.1550843
10 416.516814
11 419.8205706
12 423.7994881
13 429.4874275
14 419.2652898
15 360.1626136
16 298.8212249
17 264.3647809
18 265.0078862
19 268.1828407
20 283.101321
21 294.8219163
22 308.4875135
In this series, there is a max value of 429... and a minimum of 264... - however, as you can see from an example image:
(excuse the gaps), there are multiple consecutive wave-like maxima and minima.
The goal is to find the difference between each maxima and consecutive minima, and each minima and consecutive maxima (i.e. max1-min1, min2-max1, max2-min2...). Ideally, this would also provide the timepoints of each max and min (e.g. 13 and 17 for the provided dataset) - there is a column with integer labels (1, 2, 3...) corresponding to each coordinate.
Thanks for your help!

How can I read this stem-leaf plot correctly?

Like the title, I used online an online data set for stem-leaf plot. But I don't know how to read it. For example, in the line of Stem 7. and Leaf .5555, why Frequency = 18? And what does the line Each leaf: 4 case(s) mean?
Every answer is very helpful to me.
Here is an example.
DATA LIST FREE /x1.
BEGIN DATA.
10 22 22 13 14 10 16 17 17 17
END DATA.
EXAMINE VARIABLES=x1 /PLOT STEMLEAF.
x1 Stem-and-Leaf Plot
Frequency Stem & Leaf
4.00 1 . 0034
4.00 1 . 6777
2.00 2 . 22
Stem width: 10.00
Each leaf: 1 case(s)
In these data, the "Stem" is the tens place of each value and the "Leaf" is the ones place. There are four cases in the first line, representing the values 10, 10, 13, and 14 in the data. That's why "Frequency" is 4; there are four cases. There are only 2 in the last one, for both values of 22 in the original data. As the data get larger, StemLeaf plots can get a little harder to read, but the other real value of them is their shape, which gives you an idea of the shape of the distribution. For another view of that shape, ask SPSS to produce a histogram.

A problem with connected points and determining geometry figures based on points' location analysis

In school we have a really hard problem, and still no one from the students has solved it yet. Take a look at the picture below:
http://d.imagehost.org/0422/mreza.gif
That's a kind of a network of connected points, which doesn't end and each point has its own number representing it. Let say the numbers are like this: 1-23-456-78910-etc. etc.. (You can't see the number 5 or 8,9... on the picture but they are there and their position is obvious, the point in middle of 4 and 6 is 5 and so on).
1 is connected to 2 and 3, 2 is connected to 1,3,5 and 4 etc.
The numbers 1-2-3 indicate they represent a triangle on the picture, but the numbers 1-4-6 do not because 4 is not directly connected with 6.
Let's look at 2-3-4-5, that's a parallelogram (you know why), but 4-6-7-9 is NOT a parallelogram because the in this problem there's a rule which says all the sides must be equal for all the figures - triangles and parallelograms.
Also there are hexagons, for ex. 4-5-7-9-13-12 is a hexagon - all sides must be equal here too.
12345 - that doesn't represent anything, so we ignore it.
I think i explained the problem well. The actual problem which is given to us by using an input of numbers like above to determine if that's a triangle/parallelogram/hexagon(according to the described rules).
For ex:
1 2 3 - triangle
11 13 24 26 -parallelogram
1 2 3 4 5 - nothing
11 23 13 25 - nothing
3 2 5 - triangle
I was reading computational geometry in order to solve this, but i gave up quickly, nothing seems to help here. One friend told me this site so i decided to give it a try.
If you have any ideas about how to solve this, please reply, you can use pseudo code or c++ whatever. Thank you very much.
Let's order the points like this:
1
2 3
4 5 6
7 8 9 10
11 12 13 14 15
16 17 18 19 20 21
22 23 24 25 26 27 28
You can store this in a matrix. Now let row[i] = the row number i is on and col[i] = the column number i is on. These can be computed more or less efficiently for each i.
First, sort your given numbers ascendingly. You will need exactly 3 points for a triangle, 4 for a parallelogram and 6 for a hexagon - anything else and you can dismiss it as no-figure.
Notice that we can only have right-angled triangles in this matrix, according to your rules. Label the three points A, B, C. You can check if these form a triangle by iterating from row[A] to row[B], then from col[B] to col[C] and then diagonally from row[C] to row[A] and checking to see if the distances are the same and if you get to the right positions. You can terminate this early, for example if B is 8 and A is 1, then you can tell you won't find it once you hit 11 on column 1.
For parallelograms a similar reasoning can be made. Label the 4 points A, B, C, D and remember to sort them ascendingly (remember, your points here are actually numbers). See if you can get from col[A] to col[B] on the same line, then from col[C] to col[D] on the same line and then diagonally or vertically-down from row[A] to row[C] and then (in the same direction you went the previous diagonal!) from row[B] to row[D].
Hexagons are also have a specific format you must test for. Here's how hexagons look like in this representation:
1
2 3
4 5 6
7 8 9 10
11 12 13 14 15
16 17 18 19 20 21
22 23 24 25 26 27 28
1
2 3
4 5 6
7 8 9 10
11 12 13 14 15
16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31 32 33 34 35 36
You can notice that every two pairs of points share the same column, and that the horizontal distance between the two middle points is twice the vertical distance between any two points and also twice the horizontal distance between any other two points.
You will also want to consider rotations, so you'll need to do more tests for each case.
You don't even really need the row and col arrays unless you plan on computing them efficiently. Just walk over your matrix until you identify the first point in sorted order and try to get to the others while following each of the rules.
Not exactly a nice way, but you will only need a 256x256 matrix for this, so while this does result in quite a lot of code, it's pretty efficient. I hope I made myself clear, if not please say what isn't clear. Anyway, maybe someone else will post a better solution, so wait a while longer if you can..

Resources