Output simple T-Test stats to SAS dataset - statistics

I am aware this is a very naive question, however I am trying to find a way to export T-Test statistics to a simple output dataset. For example, I have the following code being run now:
proc ttest plots(only) = (summary) data = work.mydf;
options orientation = landscape;
class byvar;
var var1 var2;
ods output statistics = outputdf;
by UNIT_ID;
run;
The ods output statistics = outputdf yields a dataset with upper an lower confidence intervals, mean of the two groups, upper and lower limit of STD...etc.
I need a dataset with p-values from the test of equality of variances. Any help is appreciated.

The way you answer this is typically to add
ods trace on;
before you run it once. Then the log will report all of the different tables that the proc outputs, and you can add ods output statements for them.
In this case you see, among other things in the log:
Output Added:
-------------
Name: Equality
Label: Equality of Variances
Template: Stat.TTest.Equality
Path: Ttest.MPG_Highway.Equality
This means you need to add equality (the Name: above) to your ods output statement and give it a dataset name to output to.
ods output statistics = outputdf equality=outputeq;

Related

Analysis on data txt.files

I have a series of .txt files that I want to analyse all at the same time. The files are typically having a lenght of about 1000 values. I want to analyse the first 200 values of them on outliers. An outlier is whenever the value is below 12. Therefore I use the code, however, I get the error:
'numpy.bool_' object does not support item assignment. How to overcome? Should I not use loadtxt in order to perform these kind of checks?
for files in document:
Rf_file = open(files, "r")
Rf_value = np.loadtxt(Rf_file)
#Indicate outliers
for i in range(0,200):
outliers = Rf_value[i] < 12
Rf_value = Rf_value[outliers]
enter image description here
Without an example, it is hard to give a perfect answer, but the answer is most probably something like this:
import numpy as np
for document in documents:
values = np.loadtxt(document)
values_200 = values[:200]
outliers = values_200[values_200 < 12]

How to preprocess a dataset with many types of missing data

I'm trying to do the beginner machine learning project Big Mart Sales.
The data set of this project contains many types of missing values (NaN), and values that need to be changed (lf -> Low Fat, reg -> Regular, etc.)
My current approach to preprocess this data is to create an imputer for every type of data needs to be fixed:
from sklearn.impute import SimpleImputer as Imputer
# make the values consistent
lf_imputer = Imputer(missing_values='LF', strategy='constant', fill_value='Low Fat')
lowfat_imputer = Imputer(missing_values='low fat', strategy='constant', fill_value='Low Fat')
X[:,1:2] = lf_imputer.fit_transform(X[:,1:2])
X[:,1:2] = lowfat_imputer.fit_transform(X[:,1:2])
# nan for a categorical variable
nan_imputer = Imputer(missing_values=np.nan, strategy='most_frequent')
X[:, 7:8] = nan_imputer.fit_transform(X[:, 7:8])
# nan for a numerical variable
nan_num_imputer = Imputer(missing_values=np.nan, strategy='mean')
X[:, 0:1] = nan_num_imputer.fit_transform(X[:, 0:1])
However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?
In addition, it is frustrating that imputer.fit_transform() requires a 2D array as an input whereas I only want to fix the values in a single column (1D). Thus, I always have to use the column that I want to fix plus a column next to it as inputs. Is there any other way to get around this? Thanks.
Here are some rows of my data:
There is a python package which can do this for you in a simple way, ctrl4ai
pip install ctrl4ai
from ctrl4ai import preprocessing
preprocessing.impute_nulls(dataset)
Usage: [arg1]:[pandas dataframe],[method(default=central_tendency)]:[Choose either central_tendency or KNN]
Description: Auto identifies the type of distribution in the column and imputes null values
Note: KNN consumes more system mermory if the size of the dataset is huge
Returns: Dataframe [with separate column for each categorical values]
However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?
If you have a numerical column, you can use some approaches to fill the missing data:
A constant value that has meaning within the domain, such as 0, distinct from all other values.
A value from another randomly selected record.
A mean, median or mode value for the column.
A value estimated by another predictive model.
Lets see how it works for a mean for one column e.g.:
One method would be to use fillna from pandas:
X['Name'].fillna(X['Name'].mean(), inplace=True)
For categorical data please have a look here: Impute categorical missing values in scikit-learn

Extracting data from Boxplot on Jupyter

I have the Boxplot for a certain attribute but is there a way to extract mean, median, mode, and midrange, variance etc from Boxplots i.e is there a command that does this easily.
sns.boxplot(x = 'Pos',y = 'BLK', data=dataset) .
If your dataframe name is dataset, you can use
dataset.describe()
this gives mean, mode and other summary statistics.
if you want to divide this by groups, use:
dataset.groupby('variable_to_be_grouped').describe().
here is an example:
x = pd.DataFrame({'x1':[1,2,3,4,5],'x2':[2,4,6,8,10], 'x3':['a','a','a','b','b']})
x.groupby('x3').describe()

Pandas .rolling.corr using date/time offset

I am having a bit of an issue with pandas's rolling function and I'm not quite sure where I'm going wrong. If I mock up two test series of numbers:
df_index = pd.date_range(start='1990-01-01', end ='2010-01-01', freq='D')
test_df = pd.DataFrame(index=df_index)
test_df['Series1'] = np.random.randn(len(df_index))
test_df['Series2'] = np.random.randn(len(df_index))
Then it's easy to have a look at their rolling annual correlation:
test_df['Series1'].rolling(365).corr(test_df['Series2']).plot()
which produces:
All good so far. If I then try to do the same thing using a datetime offset:
test_df['Series1'].rolling('365D').corr(test_df['Series2']).plot()
I get a wildly different (and obviously wrong) result:
Is there something wrong with pandas or is there something wrong with me?
Thanks in advance for any light you can shed on this troubling conundrum.
It's very tricky, I think the behavior of window as int and offset is different:
New in version 0.19.0 are the ability to pass an offset (or
convertible) to a .rolling() method and have it produce variable sized
windows based on the passed time window. For each time point, this
includes all preceding values occurring within the indicated time
delta.
This can be particularly useful for a non-regular time frequency index.
You should checkout the doc of Time-aware Rolling.
r1 = test_df['Series1'].rolling(window=365) # has default `min_periods=365`
r2 = test_df['Series1'].rolling(window='365D') # has default `min_periods=1`
r3 = test_df['Series1'].rolling(window=365, min_periods=1)
r1.corr(test_df['Series2']).plot()
r2.corr(test_df['Series2']).plot()
r3.corr(test_df['Series2']).plot()
This code would produce similar shape of plots for r2.corr().plot() and r3.corr().plot(), but note that the calculation results still different: r2.corr(test_df['Series2']) == r3.corr(test_df['Series2']).
I think for regular time frequency index, you should just stick to r1.
This mainly because the result of two rolling 365 and 365D are different.
For example
sub = test_df.head()
sub['Series2'].rolling(2).sum()
Out[15]:
1990-01-01 NaN
1990-01-02 -0.355230
1990-01-03 0.844281
1990-01-04 2.515529
1990-01-05 1.508412
sub['Series2'].rolling('2D').sum()
Out[16]:
1990-01-01 -0.043692
1990-01-02 -0.355230
1990-01-03 0.844281
1990-01-04 2.515529
1990-01-05 1.508412
Since there are a lot NaN in rolling 365, so the corr of two series in two way are quit different.

SAS homogeneity of variance

I have different categorical data in which are CLASS and I want to test homogeneity of variance using Proc GML and it doesn't display the output of the test
Proc GLM DATA=MYLIB.musictask;
CLASS TASK Music_Type Child_number_ID;
MODEL Emotional_state = Task Music_Type Child_number_ID Task*Music_Type Task*Child_number_ID Music_Type*Child_number_ID;
Means TASK Music_Type Child_number_ID/ hovtest=levene;
run;
quit;
You need to define an ODS Output statement. The ODS table names for Proc GLM include Bartlett, which is the test underlying the HOVTEST= option. So, add these lines to your code:
ods output bartlett=hovtestoutput;
run;
proc print data=hovtestoutput;
run;
Or something like that.

Resources