Two independent sample proportion test

Two independent sample proportion test - statistics

This question is pretty simple (I hope). I am working my way through some introductory to SAS material and cannot find the proper way of running a two sample proportion test of location.
proc freq data;
tables / binomial (p=...)
run;
requires a known proportion (i.e. testing against a known value). I'd like to compare two samples of categorical variables with null hypothesis p1 = p2 and p1 < p2.
Data resembles:
V1 Yes
V1 No
V2 Yes
V2 No
For many lines. I need to compare the proportion of Yes's and No's between the two populations (V1 and V2). Can someone point me towards the correct procedure? Google search has left me spinning.
Thanks.

1/0 and 1/0 seems like a Chi Squared test.
data for_test;
do _t = 1 to 20;
x1 = ifn(ranuni(7)<0.5,1,0);
x2 = ifn(ranuni(7)<0.5,1,0);
output;
end;
run;
proc freq data=for_test;
tables x1*x2/chisq;
run;

The chi square test for quality of proportion. Here the use case of p1 < p2 is still pending. It is possible to do the test for p1 < p2 using chi

Related

Discretizing PDE in space for use with modelica

I am currently doing a course called "Modeling of dynamic systems" and have been given the task of modeling a warm water tank in modelica with a distributed temperature description.
Most of the tasks have gone well, and my group is left with the task of introducing the heat flux due to buoyancy effects into the model. Here is where we get stuck.
the equation given is this:
Given PDE
But how do we discretize this into something we can use in modelica?
The discretized version we ended up with was this:
(Qd_pp_b[k+1] - Qd_pp_b[k]) / h_dz = -K_b *(T[k+1] - 2 * T[k] + T[k-1]) / h_dz^2
where Qd_pp_b is the left-hand side variable, ie the heat flux, k is the current slice of the tank and T is the temperature in the slices.
Are we on the right path? or completely wrong?

This doesn't seem to be a differential equation (as is) so this does not make sense without surrounding problem. For the second derivative you should always create auxiliary variables and for each partial derivative a separate equation. I added dummy values for parameters and dummy equations for T[k]. This can be simulated, is this about what you expected?
model test
constant Integer n = 10;
Real[n] Qd_pp_b;
Real[n] dT;
Real[n] T;
parameter Real K_b = 1;
equation
for k in 1:n loop
der(Qd_pp_b[k]) = -K_b *der(dT[k]);
der(T[k]) = dT[k];
T[k] = sin(time+k);
end for;
end test;

SAS - Kolmogorov-Smirnov Two Sided Critical Values

I am trying to compute the critical values for the two-sided Kolmogorov-Smirnov test (PROC NPAR1WAY doesn't output these!). This is calculated as c(a) * sqrt( (n+m)/(nm) ) where n and m are the number of observations in each dataset and c(a) = 1.36 for for confidence level a = 0.05.
Either,
A) is there a routine in SAS that will calculate these for me? (I've been searching a while) or,
B) what is the best way to compute the statistic myself? My initial approach is to select the number of rows from each data set into macro variables then compute the statistic, but this feel ugly.
Thanks in advance

A) Probably not, if you've searched all the relevant documentation.
B) That method sounds fine, but you can use a data step if you prefer, e.g.
data example1 example2;
set sashelp.class;
if _n_ < 6 then output example1;
else output example2;
run;
data _null_;
if 0 then set example1 nobs = n;
if 0 then set example2 nobs = m;
call symput('Kolmogorov_Smirnov_05',1.36 * sqrt((n+m)/(n*m)));
run;
%put &=Kolmogorov_Smirnov_05;

Probability calculation by Proc means T-test

I need to calculate P-values linked to these t-values will reflect the probability that they are equal to 0 by using PROC Means for each visit. Also I have only one classifying (i,e Treatment group=1). Alpha for this test is set to 0.05.
Following code is what I am using, but I am not getting desired p-value?
PROC MEANS DATA=< input dataset>ALPHA=0.05;
BY _VISNAME;
VAR Total_Score;
OUTPUT OUT=D1<output dataset>
N=
MEAN=
MEDIAN=
STD=
STDERR=
UCLM=
LCLM=
PROBT= / AUTONAME;
RUN;
Please suggest where I am going wrong.

I cannot see your data set, so I can only guess at question. For starters, I would put your descriptive statistical keywords into the top line next to data=. This should at least generate some form of result:
proc means data=work.sample N MEAN MEDIAN STD STDERR UCLM LCLM PROBT T;
BY _VISNAME;
VAR Total_Score;
output out=work.means;
run;

How to calculate correlation if one value is 0

For calculation the pearsons coefficient between two arrays I use the following :
double[] arr1 = new double[4];
arr1[0] = 1;
arr1[1] = 1;
arr1[2] = 1;
arr1[3] = 1;
double[] arr2 = new double[4];
arr2[0] = 1;
arr2[1] = 1;
arr2[2] = 1;
arr2[3] = 1;
PearsonsCorrelation pc = new PearsonsCorrelation();
println("Correlation is "+pc.correlation(arr1, arr2));
For output I receive : Correlation is NaN
The PearsonsCorrelation class is contained in the apache commons API : http://commons.apache.org/proper/commons-math/userguide/stat.html
The values in each of the arrays is based on whether or not a user contains a word in their dataset. The above arrays should be perfectly correlated ?
This question is related to How to set a value's for calculating Eucludeian distance and correlation

Someone had a similar issue here [link]. Apparently, the issue is related to having a 0 standard deviation in your arrays.

You attempt to compute the correlation between two vectors of length four. As all values in each vector are the same (0 in one vector, 1 in the other), this is equivalent to attempting to compute the correlation coefficient between two numbers (0 and 1 on this case).
It is perhaps obvious to see that there is no such a thing; you need at least two distinct pairs. Just as you cannot draw a meaningful regression line if you only have one pair of values.
If only one of the vectors had some variation, the result would still be NA, but it in that case it would be reasonable to set it to zero.

Ways to calculate similarity

I am doing a community website that requires me to calculate the similarity between any two users. Each user is described with the following attributes:
age, skin type (oily, dry), hair type (long, short, medium), lifestyle (active outdoor lover, TV junky) and others.
Can anyone tell me how to go about this problem or point me to some resources?

Another way of computing (in R) all the pairwise dissimilarities (distances) between observations in the data set. The original variables may be of mixed types. The handling of nominal, ordinal, and (a)symmetric binary data is achieved by using the general dissimilarity coefficient of Gower (Gower, J. C. (1971) A general coefficient of similarity and some of its properties, Biometrics 27, 857–874). For more check out this on page 47. If x contains any columns of these data-types, Gower's coefficient will be used as the metric.
For example
x1 <- factor(c(10, 12, 25, 14, 29))
x2 <- factor(c("oily", "dry", "dry", "dry", "oily"))
x3 <- factor(c("medium", "short", "medium", "medium", "long"))
x4 <- factor(c("active outdoor lover", "TV junky", "TV junky", "active outdoor lover", "TV junky"))
x <- cbind(x1,x2,x3,x4)
library(cluster)
daisy(x, metric = "euclidean")
you'll get :
Dissimilarities :
1 2 3 4
2 2.000000
3 3.316625 2.236068
4 2.236068 1.732051 1.414214
5 4.242641 3.741657 1.732051 2.645751
If you are interested on a method for dimensionality reduction for categorical data (also a way to arrange variables into homogeneous clusters) check this

Give each attribute an appropriate weight, and add the differences between values.
enum SkinType
Dry, Medium, Oily
enum HairLength
Bald, Short, Medium, Long
UserDifference(user1, user2)
total := 0
total += abs(user1.Age - user2.Age) * 0.1
total += abs((int)user1.Skin - (int)user2.Skin) * 0.5
total += abs((int)user1.Hair - (int)user2.Hair) * 0.8
# etc...
return total
If you really need similarity instead of difference, use 1 / UserDifference(a, b)

You probably should take a look for
Data Mining and Data Warehousing (Essential)
Machine Learning (Extra)
Artificial Neural Networks (Especially SOM)
Pattern Recognition (Related)
These topics will let you your program recognize similarities and clusters in your users collection and try to adapt to them...
You can then know different hidden common groups of related users... (i.e users with green hair usually do not like watching TV..)
As an advice, try to use ready implemented tools for this feature instead of implementing it yourself...
Take a look at Open Directory Data Mining Projects

Three steps to achieve a simple subjective metric for difference between two datapoints that might work fine in your case:
Capture all your variables in a representative numeric variable, for example: skin type (oily=-1, dry=1), hair type (long=2, short=0, medium=1),lifestyle (active outdoor lover=1, TV junky=-1), age is a number.
Scale all numeric ranges so that they fit the relative importance you give them for indicating difference. For example: An age difference of 10 years is about as different as the difference between long and medium hair, and the difference between oily and dry skin. So 10 on the age scale is as different as 1 on the hair scale is as different as 2 on the skin scale, so scale the difference in age by 0.1, that in hair by 1 and and that in skin by 0.5
Use an appropriate distance metric to combine the differences between two people on the various scales in one overal difference. The smaller this number, the more similar they are. I'd suggest simple quadratic difference as a first attempt at your distance function.
Then the difference between two people could be calculated with (I assume Person.age, .skin, .hair, etc. have already gone through step 1 and are numeric):
double Difference(Person p1, Person p2) {
double agescale=0.1;
double skinscale=0.5;
double hairscale=1;
double lifestylescale=1;
double agediff = (p1.age-p2.age)*agescale;
double skindiff = (p1.skin-p2.skin)*skinscale;
double hairdiff = (p1.hair-p2.hair)*hairscale;
double lifestylediff = (p1.lifestyle-p2.lifestyle)*lifestylescale;
double diff = sqrt(agediff^2 + skindiff^2 + hairdiff^2 + lifestylediff^2);
return diff;
}
Note that diff in this example is not on a nice scale like (0..1). It's value can range from 0 (no difference) to something large (high difference). Also, this method is almost completely unscientific, it is just designed to quickly give you a working difference metric.

Look at algorithms for computing srting difference. Its very similar to what you need. Store your attributes as a bit string and compute the distance between the strings

You should read these two topics.
Most popular clustering algorithm k - means
And similarity matrix are essential in clustering

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Two independent sample proportion test - statistics

1/0 and 1/0 seems like a Chi Squared test. data for_test; do _t = 1 to 20; x1 = ifn(ranuni(7)<0.5,1,0); x2 = ifn(ranuni(7)<0.5,1,0); output; end; run; proc freq data=for_test; tables x1*x2/chisq; run;

The chi square test for quality of proportion. Here the use case of p1 < p2 is still pending. It is possible to do the test for p1 < p2 using chi

Related

Discretizing PDE in space for use with modelica

SAS - Kolmogorov-Smirnov Two Sided Critical Values

Probability calculation by Proc means T-test

How to calculate correlation if one value is 0

Ways to calculate similarity

Categories

Resources