One Sample, one sided Hypothesis test with T-stat - statistics

I am new in R and for my class, I am trying to write the codes for one sample, a one-sided t-test for the alternative hypothesis B<0. As you probably can understand, weight is the independent variable I want to test.
I wrote:
t.test(Data$weight, alternative = "less", mu = 0, paired = FALSE, var.equal = FALSE)
For some mistake I did, I am receiving a p value=1 where it is certainly should be 0.1.
I know I can find the p-value from the two-sided test but I want to be able to do it this way.
Would you please point out my mistake or suggest a fix, a better option?

Related

Fitting exponential function in GNUPLOT

I have a problem fitting an exponentional function
f(x)= Aexp(-bx)sin(2pi*x/T + phi) + S
data
it kept being a straight line then I tried giving it some values for A, b, T, phi, S and it became something closer to the data but still shite
Multidimensional fitting is very non-trivial and algorithms often fail on this one. Try to help the algorithm by giving a better initial guess. You can also try to fit variables 1 by 1, e.g., the average S first, then the periodic length, then this 2 together, etc.
Please also provide how you tried to fit the function and which version of Gnuplot you used. If the 3rd column consists of 0s and you provided it as error values for fit in Gnuplot v4, fit completely fails.
On this given set of data, using a bad guess, the fit fails. But a better guess can succeed:
f(x)=A*exp(-b*x)*sin(2.*pi*x/T+phi)+S
A = 40.
b = 1/500.
T = 400.
phi = 1.
S = 170.
f_bad_guess(x) = 40. * exp(-x/500.) * sin(2.*pi*x/150+3.) + 170.
f_good_guess(x) = 40. * exp(-x/500.) * sin(2.*pi*x/400+1.) + 170.
fit f(x) "data.txt" via A,b,T,phi,S
p "data.txt" t "data", f(x) t "fitted function", f_good_guess(x) t "good initial guess set manually", f_bad_guess(x) t "bad initial guess set manually"
The non-linear regression calculus is iterative starting from "guessed" initial values of the parameters. Especially when the model involves sinusoidal functions the key point is to start with guessed values close enough to the correct values which are not known.
Probably your difficulty is to guess good enough values (or the difficulty of the software to try some initial good enough values).
A non-conventional method which is not iterative and which doesn't need initial values is explained in this paper : https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales. The application of this method in the present case is shown below :
If more accuracy is wanted one have to try a non-linear regression (with an available software). Using the numerical values of the parameters found above as inital values increases the chances of good convergence.

Automatic model selection

I am writing a machine learning master algorithm from scratch where the user just inputs the training and testing data, i was wondering is there a way to automatically decide what algorithm is to be used : regression vs classification
like for example,
(assuming the last column is always the output and it is always a number )
if we could search through the last column and decide what model it is by seeing if they are discrete class labels or continuous values.
How would one go about this?
and if not this method, is there a better one?
It is to be in python3.
Thank you.
type of target using sk-learn
we can use the typeoftarget() function to get the type of targget : continuous, or label and hence can figure what type of problem it is, based on that.
This answer provided by #Vivekkumar works as i needed it to.
Thank you!
May not always work. still needs to be better as we can fool it. but for the best case we can run an ML algorithm for this and use this as a model to train it better.
The following bit of code should help determine if the value in the last column can be converted to a floating point number:
s_1 = 'text'
s_2 = '1.23'
for s in [s_1, s_2]:
try:
f = float(s)
print(s, 'conversion to float was OK, value:', f)
except:
print(s, 'could not be converted to a number')

Hypothetical Testing in Kolmogorov-Smirnov Test- Whether critical value or p value

I am new to Statistics.
I am trying out one sample Kolmogorov-Smirnov Test. I was able to find till D max. But I am confused to move forward for Hypothesis Testing.
Inorder to determine the Hypothesis should I move forward with --
Critical value from table
Rejected if the test statistic, D, is greater than the critical value obtained from a table
or
p value of KS Statustic value .
Which one is better?Read that p Value is better.
In this they tell
"kstest decides to reject the null hypothesis by comparing the p-value p with the significance level Alpha, not by comparing the test statistic ksstat with the critical value cv. Since cv is approximate, comparing ksstat with cv occasionally leads to a different conclusion than comparing p with Alpha."
But couldn't find any equation regarding to the same.
Reference
In the above reference I doubt if they are taking D max as p value.
Please advice
Quick answer: using the critical value and the p-value will always give the same conclusion.
Details:
The p-value is the probability that the test statistic is greater than the critical value, if the null hypothesis is true.
Alpha is the pre-determined acceptable false positive rate. You decide this. It will affect your decision to reject or not, but will not change the numerical values of the p-value or test statistic.
Whether you decide to reject or not based on the critical value or the p-value, you should and will always get the same result. If the critical value is approximate, then so is the p-value.

Quadratic Programming and quasi newton method BFGS

Yesterday, I posted a question about general concept of SVM Primal Form Implementation:
Support Vector Machine Primal Form Implementation
and "lejlot" helped me out to understand that what I am solving is a QP problem.
But I still don't understand how my objective function can be expressed as QP problem
(http://en.wikipedia.org/wiki/Support_vector_machine#Primal_form)
Also I don't understand how QP and Quasi-Newton method are related
All I know is Quasi-Newton method will SOLVE my QP problem which supposedly formulated from
my objective function (which I don't see the connection)
Can anyone walk me through this please??
For SVM's, the goal is to find a classifier. This problem can be expressed in terms of a function that you are trying to minimize.
Let's first consider the Newton iteration. Newton iteration is a numerical method to find a solution to a problem of the form f(x) = 0.
Instead of solving it analytically we can solve it numerically by the follwing iteration:
x^k+1 = x^k - DF(x)^-1 * F(x)
Here x^k+1 is the k+1th iterate, DF(x)^-1 is the inverse of the Jacobian of F(x) and x is the kth x in the iteration.
This update runs as long as we make progress in terms of step size (delta x) or if our function value approaches 0 to a good degree. The termination criteria can be chosen accordingly.
Now consider solving the problem f'(x)=0. If we formulate the Newton iteration for that, we get
x^k+1 = x - HF(x)^-1 * DF(x)
Where HF(x)^-1 is the inverse of the Hessian matrix and DF(x) the gradient of the function F. Note that we are talking about n-dimensional Analysis and can not just take the quotient. We have to take the inverse of the matrix.
Now we are facing some problems: In each step, we have to calculate the Hessian matrix for the updated x, which is very inefficient. We also have to solve a system of linear equations, namely y = HF(x)^-1 * DF(x) or HF(x)*y = DF(x).
So instead of computing the Hessian in every iteration, we start off with an initial guess of the Hessian (maybe the identity matrix) and perform rank one updates after each iterate. For the exact formulas have a look here.
So how does this link to SVM's?
When you look at the function you are trying to minimize, you can formulate a primal problem, which you can the reformulate as a Dual Lagrangian problem which is convex and can be solved numerically. It is all well documented in the article so I will not try to express the formulas in a less good quality.
But the idea is the following: If you have a dual problem, you can solve it numerically. There are multiple solvers available. In the link you posted, they recommend coordinate descent, which solves the optimization problem for one coordinate at a time. Or you can use subgradient descent. Another method is to use L-BFGS. It is really well explained in this paper.
Another popular algorithm for solving problems like that is ADMM (alternating direction method of multipliers). In order to use ADMM you would have to reformulate the given problem into an equal problem that would give the same solution, but has the correct format for ADMM. For that I suggest reading Boyds script on ADMM.
In general: First, understand the function you are trying to minimize and then choose the numerical method that is most suited. In this case, subgradient descent and coordinate descent are most suited, as stated in the Wikipedia link.

How do I efficiently estimate a probability based on a small amount of evidence?

I've been trying to find an answer to this for months (to be used in a machine learning application), it doesn't seem like it should be a terribly hard problem, but I'm a software engineer, and math was never one of my strengths.
Here is the scenario:
I have a (possibly) unevenly weighted coin and I want to figure out the probability of it coming up heads. I know that coins from the same box that this one came from have an average probability of p, and I also know the standard deviation of these probabilities (call it s).
(If other summary properties of the probabilities of other coins aside from their mean and stddev would be useful, I can probably get them too.)
I toss the coin n times, and it comes up heads h times.
The naive approach is that the probability is just h/n - but if n is small this is unlikely to be accurate.
Is there a computationally efficient way (ie. doesn't involve very very large or very very small numbers) to take p and s into consideration to come up with a more accurate probability estimate, even when n is small?
I'd appreciate it if any answers could use pseudocode rather than mathematical notation since I find most mathematical notation to be impenetrable ;-)
Other answers:
There are some other answers on SO that are similar, but the answers provided are unsatisfactory. For example this is not computationally efficient because it quickly involves numbers way smaller than can be represented even in double-precision floats. And this one turned out to be incorrect.
Unfortunately you can't do machine learning without knowing some basic math---it's like asking somebody for help in programming but not wanting to know about "variables" , "subroutines" and all that if-then stuff.
The better way to do this is called a Bayesian integration, but there is a simpler approximation called "maximum a postieri" (MAP). It's pretty much like the usual thinking except you can put in the prior distribution.
Fancy words, but you may ask, well where did the h/(h+t) formula come from? Of course it's obvious, but it turns out that it is answer that you get when you have "no prior". And the method below is the next level of sophistication up when you add a prior. Going to Bayesian integration would be the next one but that's harder and perhaps unnecessary.
As I understand it the problem is two fold: first you draw a coin from the bag of coins. This coin has a "headsiness" called theta, so that it gives a head theta fraction of the flips. But the theta for this coin comes from the master distribution which I guess I assume is Gaussian with mean P and standard deviation S.
What you do next is to write down the total unnormalized probability (called likelihood) of seeing the whole shebang, all the data: (h heads, t tails)
L = (theta)^h * (1-theta)^t * Gaussian(theta; P, S).
Gaussian(theta; P, S) = exp( -(theta-P)^2/(2*S^2) ) / sqrt(2*Pi*S^2)
This is the meaning of "first draw 1 value of theta from the Gaussian" and then draw h heads and t tails from a coin using that theta.
The MAP principle says, if you don't know theta, find the value which maximizes L given the data that you do know. You do that with calculus. The trick to make it easy is that you take logarithms first. Define LL = log(L). Wherever L is maximized, then LL will be too.
so
LL = hlog(theta) + tlog(1-theta) + -(theta-P)^2 / (2*S^2)) - 1/2 * log(2*pi*S^2)
By calculus to look for extrema you find the value of theta such that dLL/dtheta = 0.
Since the last term with the log has no theta in it you can ignore it.
dLL/dtheta = 0 = (h/theta) + (P-theta)/S^2 - (t/(1-theta)) = 0.
If you can solve this equation for theta you will get an answer, the MAP estimate for theta given the number of heads h and the number of tails t.
If you want a fast approximation, try doing one step of Newton's method, where you start with your proposed theta at the obvious (called maximum likelihood) estimate of theta = h/(h+t).
And where does that 'obvious' estimate come from? If you do the stuff above but don't put in the Gaussian prior: h/theta - t/(1-theta) = 0 you'll come up with theta = h/(h+t).
If your prior probabilities are really small, as is often the case, instead of near 0.5, then a Gaussian prior on theta is probably inappropriate, as it predicts some weight with negative probabilities, clearly wrong. More appropriate is a Gaussian prior on log theta ('lognormal distribution'). Plug it in the same way and work through the calculus.
You can use p as a prior on your estimated probability. This is basically the same as doing pseudocount smoothing. I.e., use
(h + c * p) / (n + c)
as your estimate. When h and n are large, then this just becomes h / n. When h and n are small, this is just c * p / c = p. The choice of c is up to you. You can base it on s but in the end you have to decide how small is too small.
You don't have nearly enough info in this question.
How many coins are in the box? If it's two, then in some scenarios (for example one coin is always heads, the other always tails) knowing p and s would be useful. If it's more than a few, and especially if only some of the coins are only slightly weighted then it is not useful.
What is a small n? 2? 5? 10? 100? What is the probability of a weighted coin coming up heads/tail? 100/0, 60/40, 50.00001/49.99999? How is the weighting distributed? Is every coin one of 2 possible weightings? Do they follow a bell curve? etc.
It boils down to this: the differences between a weighted/unweighted coin, the distribution of weighted coins, and the number coins in your box will all decide what n has to be for you to solve this with a high confidence.
The name for what you're trying to do is a Bernoulli trial. Knowing the name should be helpful in finding better resources.
Response to comment:
If you have differences in p that small, you are going to have to do a lot of trials and there's no getting around it.
Assuming a uniform distribution of bias, p will still be 0.5 and all standard deviation will tell you is that at least some of the coins have a minor bias.
How many tosses, again, will be determined under these circumstances by the weighting of the coins. Even with 500 tosses, you won't get a strong confidence (about 2/3) detecting a .51/.49 split.
In general, what you are looking for is Maximum Likelihood Estimation. Wolfram Demonstration Project has an illustration of estimating the probability of a coin landing head, given a sample of tosses.
Well I'm no math man, but I think the simple Bayesian approach is intuitive and broadly applicable enough to put a little though into it. Others above have already suggested this, but perhaps if your like me you would prefer more verbosity.
In this lingo, you have a set of mutually-exclusive hypotheses, H, and some data D, and you want to find the (posterior) probabilities that each hypothesis Hi is correct given the data. Presumably you would choose the hypothesis that had the largest posterior probability (the MAP as noted above), if you had to choose one. As Matt notes above, what distinguishes the Bayesian approach from only maximum likelihood (finding the H that maximizes Pr(D|H)) is that you also have some PRIOR info regarding which hypotheses are most likely, and you want to incorporate these priors.
So you have from basic probability Pr(H|D) = Pr(D|H)*Pr(H)/Pr(D). You can estimate these Pr(H|D) numerically by creating a series of discrete probabilities Hi for each hypothesis you wish to test, eg [0.0,0.05, 0.1 ... 0.95, 1.0], and then determining your prior Pr(H) for each Hi -- above it is assumed you have a normal distribution of priors, and if that is acceptable you could use the mean and stdev to get each Pr(Hi) -- or use another distribution if you prefer. With coin tosses the Pr(D|H) is of course determined by the binomial using the observed number of successes with n trials and the particular Hi being tested. The denominator Pr(D) may seem daunting but we assume that we have covered all the bases with our hypotheses, so that Pr(D) is the summation of Pr(D|Hi)Pr(H) over all H.
Very simple if you think about it a bit, and maybe not so if you think about it a bit more.

Resources