Probability calculation by Proc means T-test - statistics

I need to calculate P-values linked to these t-values will reflect the probability that they are equal to 0 by using PROC Means for each visit. Also I have only one classifying (i,e Treatment group=1). Alpha for this test is set to 0.05.
Following code is what I am using, but I am not getting desired p-value?
PROC MEANS DATA=< input dataset>ALPHA=0.05;
BY _VISNAME;
VAR Total_Score;
OUTPUT OUT=D1<output dataset>
N=
MEAN=
MEDIAN=
STD=
STDERR=
UCLM=
LCLM=
PROBT= / AUTONAME;
RUN;
Please suggest where I am going wrong.

I cannot see your data set, so I can only guess at question. For starters, I would put your descriptive statistical keywords into the top line next to data=. This should at least generate some form of result:
proc means data=work.sample N MEAN MEDIAN STD STDERR UCLM LCLM PROBT T;
BY _VISNAME;
VAR Total_Score;
output out=work.means;
run;

Related

Why my fit for a logarithm function looks so wrong

I'm plotting this dataset and making a logarithmic fit, but, for some reason, the fit seems to be strongly wrong, at some point I got a good enough fit, but then I re ploted and there were that bad fit. At the very beginning there were a 0.0 0.0076 but I changed that to 0.001 0.0076 to avoid the asymptote.
I'm using (not exactly this one for the image above but now I'm testing with this one and there is that bad fit as well) this for the fit
f(x) = a*log(k*x + b)
fit = fit f(x) 'R_B/R_B.txt' via a, k, b
And the output is this
Also, sometimes it says 7 iterations were as is the case shown in the screenshot above, others only 1, and when it did the "correct" fit, it did like 35 iterations or something and got a = 32 if I remember correctly
Edit: here is again the good one, the plot I got is this one. And again, I re ploted and get that weird fit. It's curious that if there is the 0.0 0.0076 when the good fit it's about to be shown, gnuplot says "Undefined value during function evaluation", but that message is not shown when I'm getting the bad one.
Do you know why do I keep getting this inconsistence? Thanks for your help
As I already mentioned in comments the method of fitting antiderivatives is much better than fitting derivatives because the numerical calculus of derivatives is strongly scattered when the data is slightly scatered.
The principle of the method of fitting an integral equation (obtained from the original equation to be fitted) is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales . The application to the case of y=a.ln(c.x+b) is shown below.
Numerical calculus :
In order to get even better result (according to some specified criteria of fitting) one can use the above values of the parameters as initial values for iterarive method of nonlinear regression implemented in some convenient software.
NOTE : The integral equation used in the present case is :
NOTE : On the above figure one can compare the result with the method of fitting an integral equation to the result with the method of fitting with derivatives.
Acknowledgements : Alex Sveshnikov did a very good work in applying the method of regression with derivatives. This allows an interesting and enlightening comparison. If the goal is only to compute approximative values of parameters to be used in nonlinear regression software both methods are quite equivalent. Nevertheless the method with integral equation appears preferable in case of scattered data.
UPDATE (After Alex Sveshnikov updated his answer)
The figure below was drawn in using the Alex Sveshnikov's result with further iterative method of fitting.
The two curves are almost indistinguishable. This shows that (in the present case) the method of fitting the integral equation is almost sufficient without further treatment.
Of course this not always so satisfying. This is due to the low scatter of the data.
In ADDITION , answer to a question raised in comments by CosmeticMichu :
The problem here is that the fit algorithm starts with "wrong" approximations for parameters a, k, and b, so during the minimalization it finds a local minimum, not the global one. You can improve the result if you provide the algorithm with starting values, which are close to the optimal ones. For example, let's start with the following parameters:
gnuplot> a=47.5087
gnuplot> k=0.226
gnuplot> b=1.0016
gnuplot> f(x)=a*log(k*x+b)
gnuplot> fit f(x) 'R_B.txt' via a,k,b
....
....
....
After 40 iterations the fit converged.
final sum of squares of residuals : 16.2185
rel. change during last iteration : -7.6943e-06
degrees of freedom (FIT_NDF) : 18
rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 0.949225
variance of residuals (reduced chisquare) = WSSR/ndf : 0.901027
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 35.0415 +/- 2.302 (6.57%)
k = 0.372381 +/- 0.0461 (12.38%)
b = 1.07012 +/- 0.02016 (1.884%)
correlation matrix of the fit parameters:
a k b
a 1.000
k -0.994 1.000
b 0.467 -0.531 1.000
The resulting plot is
Now the question is how you can find "good" initial approximations for your parameters? Well, you start with
If you differentiate this equation you get
or
The left-hand side of this equation is some constant 'C', so the expression in the right-hand side should be equal to this constant as well:
In other words, the reciprocal of the derivative of your data should be approximated by a linear function. So, from your data x[i], y[i] you can construct the reciprocal derivatives x[i], (x[i+1]-x[i])/(y[i+1]-y[i]) and the linear fit of these data:
The fit gives the following values:
C*k = 0.0236179
C*b = 0.106268
Now, we need to find the values for a, and C. Let's say, that we want the resulting graph to pass close to the starting and the ending point of our dataset. That means, that we want
a*log(k*x1 + b) = y1
a*log(k*xn + b) = yn
Thus,
a*log((C*k*x1 + C*b)/C) = a*log(C*k*x1 + C*b) - a*log(C) = y1
a*log((C*k*xn + C*b)/C) = a*log(C*k*xn + C*b) - a*log(C) = yn
By subtracting the equations we get the value for a:
a = (yn-y1)/log((C*k*xn + C*b)/(C*k*x1 + C*b)) = 47.51
Then,
log(k*x1+b) = y1/a
k*x1+b = exp(y1/a)
C*k*x1+C*b = C*exp(y1/a)
From this we can calculate C:
C = (C*k*x1+C*b)/exp(y1/a)
and finally find the k and b:
k=0.226
b=1.0016
These are the values used above for finding the better fit.
UPDATE
You can automate the process described above with the following script:
# Name of the file with the data
data='R_B.txt'
# The coordinates of the last data point
xn=NaN
yn=NaN
# The temporary coordinates of a data point used to calculate a derivative
x0=NaN
y0=NaN
linearFit(x)=Ck*x+Cb
fit linearFit(x) data using (xn=$1,dx=$1-x0,x0=$1,$1):(yn=$2,dy=$2-y0,y0=$2,dx/dy) via Ck, Cb
# The coordinates of the first data point
x1=NaN
y1=NaN
plot data using (x1=$1):(y1=$2) every ::0::0
a=(yn-y1)/log((Ck*xn+Cb)/(Ck*x1+Cb))
C=(Ck*x1+Cb)/exp(y1/a)
k=Ck/C
b=Cb/C
f(x)=a*log(k*x+b)
fit f(x) data via a,k,b
plot data, f(x)
pause -1

Very large float in python

I'm trying to construct a neural network for the Mnist database. When computing the softmax function I receive an error to the same ends as "you can't store a float that size"
code is as follows:
def softmax(vector): # REQUIRES a unidimensional numpy array
adjustedVals = [0] * len(vector)
totalExp = np.exp(vector)
print("totalExp equals")
print(totalExp)
totalSum = totalExp.sum()
for i in range(len(vector)):
adjustedVals[i] = (np.exp(vector[i])) / totalSum
return adjustedVals # this throws back an error sometimes?!?!
After inspection, most recommend using the decimal module. However when I've messed around with the values being used in the command line with this module, that is:
from decimal import Decimal
import math
test = Decimal(math.exp(720))
I receive a similar error for any values which are math.exp(>709).
OverflowError: (34, 'Numerical result out of range')
My conclusion is that even decimal cannot handle this number. Does anyone know of another method I could use to represent these very large floats.
There is a technique which makes the softmax function more feasible computationally for a certain kind of value distribution in your vector. Namely, you can subtract the maximum value in the vector (let's call it x_max) from each of its elements. If you recall the softmax formula, such operation doesn't affect the outcome as it reduced to multiplication of the result by e^(x_max) / e^(x_max) = 1. This way the highest intermediate value you get is e^(x_max - x_max) = 1 so you avoid the overflow.
For additional explanation I recommend the following article: https://nolanbconaway.github.io/blog/2017/softmax-numpy
With a value above 709 the function 'math.exp' exceeds the floating point range and throws this overflow error.
If, instead of math.exp, you use numpy.exp for such large exponents you will see that it evaluates to the special value inf (infinity).
All this apart, I wonder why you would want to produce such a big number (not sure you are aware how big it is. Just to give you an idea, the number of atoms in the universe is estimated to be in the range of 10 to the power of 80. The number you are trying to produce is MUCH larger than that).

SAS - Kolmogorov-Smirnov Two Sided Critical Values

I am trying to compute the critical values for the two-sided Kolmogorov-Smirnov test (PROC NPAR1WAY doesn't output these!). This is calculated as c(a) * sqrt( (n+m)/(nm) ) where n and m are the number of observations in each dataset and c(a) = 1.36 for for confidence level a = 0.05.
Either,
A) is there a routine in SAS that will calculate these for me? (I've been searching a while) or,
B) what is the best way to compute the statistic myself? My initial approach is to select the number of rows from each data set into macro variables then compute the statistic, but this feel ugly.
Thanks in advance
A) Probably not, if you've searched all the relevant documentation.
B) That method sounds fine, but you can use a data step if you prefer, e.g.
data example1 example2;
set sashelp.class;
if _n_ < 6 then output example1;
else output example2;
run;
data _null_;
if 0 then set example1 nobs = n;
if 0 then set example2 nobs = m;
call symput('Kolmogorov_Smirnov_05',1.36 * sqrt((n+m)/(n*m)));
run;
%put &=Kolmogorov_Smirnov_05;

Time of execution of SAS code

Im studying for my SAS test and i found the following question:
Given a dataset with 5000 observations, and given the following two dataset (a subset of the first).
Data test1 (keep= cust_id trans_type gender);
Set transaction;
Where trans_type = ‘ Travel’ ;
Run;
Data test2;
Set transaction (keep= cust_id trans_type gender);
Where trans_type = ‘ Travel’ ;
Run;
I. Which one of the above two datasets (test1, test2 ) will take less
I think that both take the same time to run because basically both made the same instructions in different order. Im right? or the order of instructions affects the runtime?
The answer the book is looking for is that test2 will be faster. That is because there is a difference in the two:
test1: Read in all variables, only write out 3 of them
test2: Read in 3 variables, write out all variables read
SAS has some advantages based on the physical dataset structure that allows it to more efficiently read in subsets of the dataset, particularly if those variables are stored consecutively.
However, in real world scenarios this may or may not be true, and in particular in a 5000 row dataset, probably won't see any difference between the two. For example:
data class1M;
set sashelp.class;
do _i = 1 to 1e6;
output;
end;
run;
data test1(keep=name sex);
set class1M;
run;
data test2;
set class1M(keep=name sex);
run;
Both of these data steps take the identical length of time. That's likely because the dataset is being read into memory and then bits are being grabbed as needed - a 250MB dataset just isn't big enough to trigger any efficiencies that way.
However, if you add a bunch of other variables:
data class1M;
set sashelp.class;
length a b c d e f g h i j k l 8;
do _i = 1 to 1e6;
output;
end;
run;
data test1(keep=name sex);
set class1M;
run;
data test2;
set class1M(keep=name sex);
run;
Now it takes a lot longer to run test1 than test2. That's because the dataset for test1 now doesn't fit into memory, so it's reading it by bits, while the dataset for test2 does fit in memory. Make it a lot bigger row-wise, say 10M rows, and it will take a long time for both test1 and test2 - but a bit shorter for test2.
The test2 will take less time to run, as it brings in less variables.

Two independent sample proportion test

This question is pretty simple (I hope). I am working my way through some introductory to SAS material and cannot find the proper way of running a two sample proportion test of location.
proc freq data;
tables / binomial (p=...)
run;
requires a known proportion (i.e. testing against a known value). I'd like to compare two samples of categorical variables with null hypothesis p1 = p2 and p1 < p2.
Data resembles:
V1 Yes
V1 No
V2 Yes
V2 No
For many lines. I need to compare the proportion of Yes's and No's between the two populations (V1 and V2). Can someone point me towards the correct procedure? Google search has left me spinning.
Thanks.
1/0 and 1/0 seems like a Chi Squared test.
data for_test;
do _t = 1 to 20;
x1 = ifn(ranuni(7)<0.5,1,0);
x2 = ifn(ranuni(7)<0.5,1,0);
output;
end;
run;
proc freq data=for_test;
tables x1*x2/chisq;
run;
The chi square test for quality of proportion. Here the use case of p1 < p2 is still pending. It is possible to do the test for p1 < p2 using chi

Resources