summary of categorical variables from multiple imputed data in sas - statistics

I am using SAS for multiple imputation. After imputing data and using proc freq I want to have frequency tables of the imputed data. I am not able to produce frequency tables. Below is the code I have tried. Any help will be appreciated. I guess I am doing some mistake in second and third steps (codes).
proc mi data=data1 nimpute=5 seed=54321 out=imput
min= 27 1 1 17.6354
max= 77 6 3 46.6550;
class age work;
fcs discrim (work edu/details) reg(age bmi);
var age work edu bmi;
run;
proc freq data=aa.osa_revised1;
tables work*edu/chisq;
run;
proc freq data=imput;
tables _imputation_*work*edu/chisq;
ods output chisq=out;
run;
proc mianalyze parms=out ;
modeleffects frequency percentage ;
run ;

Actually I get frequency table for all replicates seperately. I'm unable to get the combined frequency and chisq statistics using the last two codes.

Related

Opening up observations when given a frequency table

So I have a table and instead of having 12 rows of frequencies, I would like to expand the table to have include all 3303 observations (total of all frequencies).
I tried using pivot_longer but all I am getting is the same table with an added column. I could make a data frame for each observation with the total frequency for that observation minus 1 and rbind it to the dataset but that is 12 lines of code! Is there a simpler way?? Let us say the dataset = prostateca
dataset

Sampling a dataframe according to some rules: balancing a multilabel dataset

I have a dataframe like this:
df = pd.DataFrame({'id':[10,20,30,40],'text':['some text','another text','random stuff', 'my cat is a god'],
'A':[0,0,1,1],
'B':[1,1,0,0],
'C':[0,0,0,1],
'D':[1,0,1,0]})
Here I have columns from Ato D but my real dataframe has 100 columns with values of 0and 1. This real dataframe has 100k reacords.
For example, the column A is related to the 3rd and 4rd row of text, because it is labeled as 1. The Same way, A is not related to the 1st and 2nd rows of text because it is labeled as 0.
What I need to do is to sample this dataframe in a way that I have the same or about the same number of features.
In this case, the feature C has only one occurrece, so I need to filter all others columns in a way that I have one text with A, one text with B, one text with Cetc..
The best would be: I can set using for example n=100 that means I want to sample in a way that I have 100 records with all the features.
This dataset is a multilabel dataset training and is higly unbalanced, I am looking for the best way to balance it for a machine learning task.
Important: I don't want to exclude the 0 features. I just want to have ABOUT the same number of columns with 1 and 0
For example. with a final data set with 1k records, I would like to have all columns from A to the final_column and all these columns with the same numbers of 1 and 0. To accomplish this I will need to random discard text rows and id only.
The approach I was trying was to look to the feature with the lowest 1 and 0 counts and then use this value as threshold.
Edit 1: One possible way I thought is to use:
df.sum(axis=0, skipna=True)
Then I can use the column with the lowest sum value as threshold to filter the text column. I dont know how to do this filtering step
Thanks
The exact output you expect is unclear, but assuming you want to get 1 random row per letter with 1 you could reshape (while dropping the 0s) and use GroupBy.sample:
(df
.set_index(['id', 'text'])
.replace(0, float('nan'))
.stack()
.groupby(level=-1).sample(n=1)
.reset_index()
)
NB. you can rename the columns if needed
output:
id text level_2 0
0 30 random stuff A 1.0
1 20 another text B 1.0
2 40 my cat is a god C 1.0
3 30 random stuff D 1.0

Summary statistics for each group and transpose using pandas

I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,11,11,11,11,11,12,12,12],
'time' :[0,0,0,1,2,3,4,4,0,0,1],
'value':[101,102,np.nan,120,143,153,160,170,96,97,99]})
What I would like to do is
a) Get the summary statistics for each subject for each time point (ex: 0hr, 1hr, 2hr etc)
b) Please note that NA rows shouldn't be counted as separate record/row during computing mean
I was trying the below
for i in df['subject_id'].unique()
df[df['subject_id'].isin([i])].time.unique
val_mean = df.groupby(['subject_id','time']][value].mean()
val_stddev = df[value].std()
But I couldn't get the expected output
I expect my output to be like as shown below where I expect one row for each time point (ex: 0hr, 1 hr , 2hr, 3hr etc). Please note that NA rows shouldn't be counted as seperated record/row during computing mean

How do I efficiently generate data with random variation in a time series based on existing data points

I have a handful of data points in a csv as follows:
date value
0 8/1/2019 0.243902
1 8/17/2019 0.322581
2 9/1/2019 0.476190
3 10/6/2019 0.322581
4 10/29/2019 0.476190
5 11/10/2019 0.526316
6 11/21/2019 1.818182
7 12/8/2019 2.500000
8 12/22/2019 3.076923
9 1/5/2020 3.333333
10 1/12/2020 3.333333
11 1/19/2020 0.000000
12 2/2/2020 0.000000
I want to generate a value for every hour between the first date and the last date (assuming that each one starts at 00:00 on that date) such that the generated values create a fairly smooth curve between each existing data point. I would also like to add a small amount of random variation to the generated values if possible so that the curves are not perfectly smooth. I ultimately want to output this new dataset to a csv with the same two columns containing the original rows along with the generated values and their associated datetimes (each in its own row).
Is there way to easily generate these points and output the result to a csv? I have thus far tried using pandas to store the data but I can't figure out a way to ensure that the generated data takes the existing data points into account.
Let's try scipy.interpolate:
# this is the new timestamps
new_date = pd.date_range(df.date.min(), df.date.max() + pd.to_timedelta('23h'),
freq='H')
from scipy import interpolate
tck = interpolate.splrep(df['date'].astype('int64'), df['value'], s=0)
new_values = interpolate.splev(new_date.astype('int64'), tck)
# visualize
plt.plot(df.date, df.value, label='raw')
plt.plot(new_date, new_values, label='intepolated')
plt.legend();
Output:

Error in Proc Freq

I am have a data set with multiple visits with 2 treatment arms and a Vehicle group. Also i have a variable say "SSA" having two values 1 and 0. here 1 stands for responder and 0 for non-responder subjects. while performing Proc Freq for chi-square statistics i am getting the following error. Here is the code i used
PROC FREQ DATA=P&V1;
TABLE TREATMENT*SSA/CHISQ ;
WHERE TREATMENT IN (1 &TR1); *(&tri for treatment 2 and treatment 3);
RUN;
NOTE: No statistics are computed for TREATMENT * SSA since SSA has less than 2 no missing levels.
WARNING: No OUTPUT data set is produced because no statistics can be computed for this table, which has a row or column variable with less than 2 no missing levels.
this error is for my last visit where i am having a single value 0 in all treatment groups and Vehicle.

Resources