Summary statistics for each group and transpose using pandas - python-3.x

I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,11,11,11,11,11,12,12,12],
'time' :[0,0,0,1,2,3,4,4,0,0,1],
'value':[101,102,np.nan,120,143,153,160,170,96,97,99]})
What I would like to do is
a) Get the summary statistics for each subject for each time point (ex: 0hr, 1hr, 2hr etc)
b) Please note that NA rows shouldn't be counted as separate record/row during computing mean
I was trying the below
for i in df['subject_id'].unique()
df[df['subject_id'].isin([i])].time.unique
val_mean = df.groupby(['subject_id','time']][value].mean()
val_stddev = df[value].std()
But I couldn't get the expected output
I expect my output to be like as shown below where I expect one row for each time point (ex: 0hr, 1 hr , 2hr, 3hr etc). Please note that NA rows shouldn't be counted as seperated record/row during computing mean

Related

How to look into previous three row values to Current Row in Python after applying Group by

How I can get the following expected output in python
Sample Input with Expected Output
ACTUAL_EXPECTED_OUTPUT is the expected output column Column.
The scenario is for each account we need to look into IS_DEFAULT COlumn prior three observations and if 1 is there in any of the previous three observation we need to get result as 1 else 0.
Group by the account id and if needed we can use order by MONTH_SINCE_DISB and then for each account id we need to look into prior three observations if 1 is there in any of the three observations for that account id then the new column label should be marked as 1 else 0. Iteratively the same logic should be applied for all accounts_id
Something like this should work
#Create temp column where when first 1 found, ffill the rest to 1 for that ACCT_ID
df['ISDEFAULT_TEMP']=df.groupby('ACCT_ID')['IS_DEFAULT'].apply(lambda x: x.replace(to_replace=0,method='ffill'))
import numpy as np
#Create condition using that new column and if the cumsum >2 for an AcctID , then true
# (.i.e. a IS_DEFAULT=1 has been seen 2 rows ago)
cond=df.groupby('ACCT_ID')['ISDEFAULT_TEMP'].transform('cumsum')>2
#Define that new column given the condition
df['ACTUAL_EXPECTED_OUTPUT']=np.where(cond,1,0)
df.drop('ISDEFAULT_TEMP',axis=1,inplace=True)
df

Opening up observations when given a frequency table

So I have a table and instead of having 12 rows of frequencies, I would like to expand the table to have include all 3303 observations (total of all frequencies).
I tried using pivot_longer but all I am getting is the same table with an added column. I could make a data frame for each observation with the total frequency for that observation minus 1 and rbind it to the dataset but that is 12 lines of code! Is there a simpler way?? Let us say the dataset = prostateca
dataset

Sampling a dataframe according to some rules: balancing a multilabel dataset

I have a dataframe like this:
df = pd.DataFrame({'id':[10,20,30,40],'text':['some text','another text','random stuff', 'my cat is a god'],
'A':[0,0,1,1],
'B':[1,1,0,0],
'C':[0,0,0,1],
'D':[1,0,1,0]})
Here I have columns from Ato D but my real dataframe has 100 columns with values of 0and 1. This real dataframe has 100k reacords.
For example, the column A is related to the 3rd and 4rd row of text, because it is labeled as 1. The Same way, A is not related to the 1st and 2nd rows of text because it is labeled as 0.
What I need to do is to sample this dataframe in a way that I have the same or about the same number of features.
In this case, the feature C has only one occurrece, so I need to filter all others columns in a way that I have one text with A, one text with B, one text with Cetc..
The best would be: I can set using for example n=100 that means I want to sample in a way that I have 100 records with all the features.
This dataset is a multilabel dataset training and is higly unbalanced, I am looking for the best way to balance it for a machine learning task.
Important: I don't want to exclude the 0 features. I just want to have ABOUT the same number of columns with 1 and 0
For example. with a final data set with 1k records, I would like to have all columns from A to the final_column and all these columns with the same numbers of 1 and 0. To accomplish this I will need to random discard text rows and id only.
The approach I was trying was to look to the feature with the lowest 1 and 0 counts and then use this value as threshold.
Edit 1: One possible way I thought is to use:
df.sum(axis=0, skipna=True)
Then I can use the column with the lowest sum value as threshold to filter the text column. I dont know how to do this filtering step
Thanks
The exact output you expect is unclear, but assuming you want to get 1 random row per letter with 1 you could reshape (while dropping the 0s) and use GroupBy.sample:
(df
.set_index(['id', 'text'])
.replace(0, float('nan'))
.stack()
.groupby(level=-1).sample(n=1)
.reset_index()
)
NB. you can rename the columns if needed
output:
id text level_2 0
0 30 random stuff A 1.0
1 20 another text B 1.0
2 40 my cat is a god C 1.0
3 30 random stuff D 1.0

Finding average age of incidents in a datetime series

I'm new to Stackoverflow and fairly fresh with Python (some 5 months give or take), so apologies if I'm not explaining this too clearly!
I want to build up a historic trend of the average age of outstanding incidents on a daily basis.
I have two dataframes.
df1 contains incident data going back 8 years, with the two most relevant columns being "opened_at" and "resolved_at" which contains datetime values.
df2 contains a column called date with the full date range from 2012-06-13 to now.
The goal is to have df2 contain the number of outstanding incidents on each date (as of 00:00:00) and the average age of all those deemed outstanding.
I know it's possible to get all rows that exist between two dates, but I believe I want the opposite and find where each date row in df2 exists between dates in opened_at and resolved_at in df1
(It would be helpful to have some example code containing an anonymized/randomized short extract of your data to try solutions on)
This is unlikely to be the most efficient solution, but I believe you could do:
df2["incs_open"] = 0 # Ensure the column exists
for row_num in range(df2.shape[0]):
df2.at[row_num, "incs_open"] = sum(
(df1["opened_at"] < df2.at[row_num, "date"]) &
(df2.at[row_num, "date"] < df1["opened_at"])
)
(This assumes you haven't set an index on the data frame other than the default one)
For the second part of your question, the average age, I believe you can calculate that in the body of the loop like this:
open_incs = (df1["opened_at"] < df2.at[row_num, "date"]) & \
(df2.at[row_num, "date"] < df1["opened_at"])
ages_of_open_incs = df2.at[row_num, "date"] - df1.loc[open_incs, "opened_at"]
avg_age = ages_of_open_incs.mean()
You'll hit some weirdnesses about rounding and things. If an incident was opened last night at 3am, what is its age "today" -- 1 day, 0 days, 9 hours (if we take noon as the point to count from), etc. -- but I assume once you've got code that works you can adjust that to taste.

pandas is not summing a numeric column

I have read into a DataFrame an Excel spreadsheet with column names such as Gross, Fee, Net, etc. When I invoke the sum method on the resulting DataFrame, I saw that it was not summing the Fee column because several rows had string data in that column. So I first loop through each row testing that column to see if it contains a string and if it does, I replace it with a 0. The DataFrame sum method still does not sum the Fee column. Yet when I write out the resulting DataFrame to a new Excel spreadsheet and read it back in and apply the sum method to the resulting DataFrame, it does sum the Fee column. Can anyone explain this? Here is the code and the printed output:
import pandas as pd
pp = pd.read_excel('pp.xlsx')
# get rid of any strings in column 'Fee':
for i in range(pp.shape[0]):
if isinstance(pp.loc[i, 'Fee'], str):
pp.loc[i, 'Fee'] = 0
pd.to_numeric(pp['Fee']) #added this but it makes no difference
# the Fee column is still not summed:
print(pp.sum(numeric_only=True))
print('\nSecond Spreadsheet\n')
# write out Dataframe: to an Excel spreadheet:
with pd.ExcelWriter('pp2.xlsx') as writer:
pp.to_excel(writer, sheet_name='PP')
# now read the spreadsheet back into another DataFrame:
pp2 = pd.read_excel('pp2.xlsx')
# the Fee column is summed:
print(pp2.sum(numeric_only=True))
Prints:
Gross 8677.90
Net 8572.43
Address Status 0.00
Shipping and Handling Amount 0.00
Insurance Amount 0.00
Sales Tax 0.00
etc.
Second Spreadsheet
Unnamed: 0 277885.00
Gross 8677.90
Fee -105.47
Net 8572.43
Address Status 0.00
Shipping and Handling Amount 0.00
Insurance Amount 0.00
Sales Tax 0.00
etc.
Try using pd.to_numeric
Ex:
pp = pd.read_excel('pp.xlsx')
print(pd.to_numeric(pp['Fee'], errors='coerce').dropna().sum())
The problem here is that the Fee column isn't numeric. So you need to convert it to a numeric field, save that updated field in the existing dataframe, and then compute the sum.
So that would be:
df = df.assign(Fee=pd.to_numeric(df['Fee'], errors='coerce'))
print(df.sum())
After a quick analysis, from what I can see is that you are replacing the string with an integer and the values of 'Fee' column could be a mix of both of float and integer which means the dtype of that column is an object. When you do pp.sum(numeric_only=True) , it ignores the object column because of the condition numeric_only. Convert your column to a float64 as in pp['Fee'] = pd.to_numeric(pp['Fee']) and it should work for you.
The reason that it is happening second time is because excel does the data conversion for you and when you read it, it's a numeric data type.
Everyone who has responded should get partial credit for telling me about pd.to_numeric. But they were all missing one piece. It is not sufficient to say pd.to_numeric(pp['Fee']. That returns the column converted to numeric but does not update the original DataFrame, so when I do a pp.sum(), nothing in pp was modified. You need:
pp['Fee'] = pd.to_numeric(pp['Fee'])
pp.sum()

Resources