Removing outliers from sas dataset - statistics

I have loaded an excel table into SAS using this code -
FILENAME REFFILE "/folders/myfolders/subji.xlsx" TERMSTR=CR;
PROC IMPORT DATAFILE=REFFILE
DBMS=XLSX
OUT=ds;
GETNAMES=YES;
RUN;
And then sorted it to apply a repeated measures analysis to it using this sort proc -
PROC SORT DATA=ds;
BY subject Color_Compatibility sameloc;
RUN;
And then, ran a univariate ANOVA to retrieve stats and effects using this code -
PROC UNIVARIATE DATA=ds NOPRINT;
VAR resprt;
OUTPUT OUT=unids1 MEAN=resprt;
BY subject Color_Compatibility sameloc;
where Color_Compatibility >0
and practice = 0
and outlier = 0
and respAC=1;
RUN;
The outlier column is currently calculated through excel, but I've noticed the values excel's STDEV function gives are not accurate. For that reason I want to create an outlier variable with SAS, and then remove every outlier row from my analysis (using +/-2.5 STDEV as a benchmark).
How could this be done?
Thanks.

Here's a way using proc sql in one step to identify outliers. You can calculate aggregate statistics in SQL though it does leave a warning in your log about remerging. The key is to ensure that you GROUP BY variable is the level you want the calculations at. In this example I'm looking for outliers in the MPG_CITY metric from the SASHELP.CARS data set based on the number of cylinders in a vehicle.
*Identify Outliers;
proc sql;
create table outliers as
select *, std(mpg_city) as std, mean(mpg_city) as avg,
case when ((mpg_city - calculated avg)/(calculated std) < -2.5) or ((mpg_city - calculated avg)/(calculated std) > 2.5) then 'Outlier'
else 'Normal'
end as outlier_status
from sashelp.cars
group by cylinders;
quit;
*Check number of outliers;
proc freq data=outliers;
table outlier_status;
run;
*Print observations of interest;
proc print data=outliers;
where outlier_status='Outlier';
var origin make model cylinders mpg_city std avg;
run;

Related

Multi-step time series forecast using Holt-Winters algorithm in python

This is regarding a time-series forecast problem, with a dataset which has almost no seasonality, with a trend that follows the input data. The data is stationary (p-value is less than 5%)
Trying to convert the single-step forecast into a multi-step forecast, by feeding back the predictions as inputs to the Holt-Winters algorithm to achieve predictions for multiple days.
PFB a small snippet of the logic.
from statsmodels.tsa.holtwinters import ExponentialSmoothing
data = pd.read_csv('test_data.csv')
#After time series decomposition and stationarity check using the AD Fuller's test
model = ExponentialSmoothing(data).fit()
number_of_days = 5
for i in range(0,number_of_days):
yhat = model.predict(len(data), len(data))
data = pd.DataFrame(data)
data = data.append(pd.DataFrame(yhat),ignore_index=True)
data_length = data.size
The forecast (output) for all the days is the same value.
Can anyone please help me understand how to tune the algorithm (and / or the logic above) for a better forecast?

Checking relationship between two categorical object data types column in python

In my Pandas DataFrame there are two categorical variable one is the target which has 2 unique values & the other one is the feature which has 300 unique values now I want to check the relationship between two variables using ChiSquare test now the data types of the two-column is the object so how can I perform the chi-square test or check the relationship between two columns that is - is the two-column is Correlated or not
300 unique values in a variable is too much, still you can use below lines of code to test:
import pandas as pd
from scipy.stats import chi2_contingency
table = pd.crosstab(df['Feature_Var'],df['Target_Var'])
print(table)
stat, pvalue, dof, expected = chi2_contingency(table)
print('Chi-sq Test Statistics = %.3f \nP-Value = %.3f \nDegrees of Freedom = %.3f' % (stat, pvalue, dof))

How to query attribute table in QGIS to return the sum of a subset

I am using Jupyter notebook to query a dataset too big to be opened with excel and my pc is too slow to perform calculations directly on QGIS.
My logic is as follows, after imported pandas:
x = df[(df.OBJECTID == 4440) & (df.Landuse == 'Grass - Urban')]
want_area = x['Area'].sum() #returning the whole dataframe sum for that field!!
summed = x['Area'].sum()
ratio = round(want_area / summed, 2)
How can I tweak the code in order to obtain the sum of the 'Area' of the above subset only, and not for the whole dataframe (800+ thousand features)?
Hope my simple question makes sense and thank you very much!

Hive statistics

I am trying to compute statistics on ORC File, but I am unable see any changes in PART_COL_STATS as well on using
set hive.compute.query.using.stats=true;
set hive.stats.reliable=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.cbo.enable=true;
to get max value of a column it is running full Map reduce on column ..
what I want to use is max value stored in meta store, but I am unable to catch these statistics.
My table desc is:
load_inst_id int
src_filename string
server_date date
My analyze query is:
analyze table mytable partition(server_date=’2013-11-30′) compute statistics for columns load_inst_id;
I am always getting 0 as loadinstant id, I have to turn off my hive.compute.query.using.stats to get correct result(through map reduce max(load_inst_id))

SAS Code that works like Excel's "VLOOKUP" function

I'm looking for a SAS Code that works just like "VLOOKUP" function in Excel.
I have two tables:
table_1 has an ID column with some other columns in it with 10 rows. Table_2 has two columns: ID and Definition with 50 rows. I want to define a new variable "Definition " in table_1 and lookup the ID values from table_2.
I haven't really tried anything other than merge. but merge keeps all the extra 40 variables from table_2 and that's not what I like.
Thanks, SE
The simplest way is to use the keep option on your merge statement.
data result;
merge table_1 (in=a) table_2 (in=b keep=id definition);
by id;
if a;
run;
An alternative that means you don't have to sort your datasets is to use proc sql.
proc sql;
create table result as
select a.*,
b.definition
from table_1 a
left join table_2 b on a.id = b.id;
quit;
Finally, there is the hash table option if table_2 is small:
data result;
if _n_ = 1 then do;
declare hash b(dataset:'table_2');
b.definekey('id');
b.definedata('definition');
b.definedone();
call missing(definition);
end;
set table_1;
b.find();
run;
Here is one very useful (and often very fast) method specifically for 1:1 matching, which is what VLOOKUP does. You create a Format or Informat with the match-variable and the lookup-result, and put or input the match-variable in the master table.
data class_income;
set sashelp.class(keep=name);
income = ceil(12*ranuni(7));
run;
data for_format;
set class_income end=eof;
retain fmtname 'INCOMEI';
start=name;
label=income;
type='i'; *i=informat numeric, j=informat character, n=format numeric, c=format character;
output;
if eof then do;
hlo='o'; *hlo contains some flags, o means OTHER for nonmatching records;
start=' ';
label=.;
output;
end;
run;
proc format cntlin=for_format;
quit;
data class;
set sashelp.class;
income = input(name,INCOMEI.);
run;

Resources