How to put unique values of 1 series as columns and count each occurence of the unique values from the series per quarter? - python-3.x

I have a df that looks like this:
date col1
0 2020-01-09T19:25 a
1 2020-01-09T13:27 a
2 2020-01-04T13:44 b
3 2019-12-31T15:37 b
4 2019-12-23T21:47 c
I want to assign the unique values of col1 as columns headers and group the dates by quarter and count the unique values of col1 by quarter.
I can groupby quarters and count like so:
df['date'] = pd.to_datetime(df['date'])
df = df.groupby(df['date'].dt.to_period('Q'))['col1'].agg(['count'])
The df now looks like this:
count
dateresponded
2019Q4 2
2020Q1 3
I cant tell what the count of the unique values are by broken down.
I want the df to look like this:
a b c
dateresponded
2019Q4 1 1
2020Q1 2 1

IIUC, you want pd.crosstab
new_df = pd.crosstab(df['date'].dt.to_period('Q'),df['col1'],
rownames=['dateresponded'],
colnames=[None])
print(new_df)
We could also use groupby + DataFrame.unstack.We can rename the axis using DataFrame.rename_axis.
new_df = (df.groupby([df['date'].dt.to_period('Q'),'col1'])
.size()
.unstack(fill_value = 0)
.rename_axis(columns = None,index = 'dateresponded'))
print(new_df)
new_df = (df.groupby(df['date'].dt.to_period('Q'))
.col1
.value_counts()
.unstack(fill_value = 0)
.rename_axis(columns = None,index = 'dateresponded'))
print(new_df)
Output
a b c
dateresponded
2019Q4 0 1 1
2020Q1 2 1 0

Related

Pandas Aggregate data other than a specific value in specific column

I have my data like this in pandas dataframe python
df = pd.DataFrame({
'ID':range(1, 8),
'Type':list('XXYYZZZ'),
'Value':[2,3,2,9,6,1,4]
})
The oputput that i want to generate is
How can i generate these results using python pandas dataframe. I want to include all the Y values of type column, and does not want to aggregate them.
First filter values by boolean indexing, aggregate and append filter out rows, last sorting:
mask = df['Type'] == 'Y'
df1 = (df[~mask].groupby('Type', as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.append(df[mask])
.sort_values('ID'))
print (df1)
ID Type Value
0 1 X 5
2 3 Y 2
3 4 Y 9
1 5 Z 11
If want range 1 to length of data for ID column:
mask = df['Type'] == 'Y'
df1 = (df[~mask].groupby('Type', as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.append(df[mask])
.sort_values('ID')
.assign(ID = lambda x: np.arange(1, len(x) + 1)))
print (df1)
ID Type Value
0 1 X 5
2 2 Y 2
3 3 Y 9
1 4 Z 11
Another idea is create helper column for unique values only for Y rows and aggregate by both columns:
mask = df['Type'] == 'Y'
df['g'] = np.where(mask, mask.cumsum() + 1, 0)
df1 = (df.groupby(['Type','g'], as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.drop('g', axis=1)[['ID','Type','Value']])
print (df1)
ID Type Value
0 1 X 5
1 3 Y 2
2 4 Y 9
3 5 Z 11
Similar alternative with Series g, then drop is not necessary:
mask = df['Type'] == 'Y'
g = np.where(mask, mask.cumsum() + 1, 0)
df1 = (df.groupby(['Type',g], as_index=False)
.agg({'ID':'first', 'Value':'sum'})[['ID','Type','Value']])

How to get?sort descending order between dataframes using python

My goal here is to print the descending order between dataframe.
I have 5 dataframe and each has column "Quantity". I need to calculate the sum of this column"Quantity" in each dataframe and wish to print the result in decending order in terms of dataframe.
df1:
order quantity
A 1
B 4
C 3
D 2
df2:
order quantity
A 1
B 4
C 4
D 2
df3:
order quantity
A 1
B 4
C 1
D 2
df4:
order quantity
A 1
B 4
C 1
D 2
df5:
order quantity
A 1
B 4
C 1
D 1
my desired result
descending order :
df2,df1,df3,df4,df5
here df3 and df4 are equal and it can be in anyway.
suggestion please.
Use sorted with custom sorted lambda function:
dfs = [df1, df2, df3, df4, df5]
dfs = sorted(dfs, key=lambda x: -x['quantity'].sum())
#another solution
#dfs = sorted(dfs, key=lambda x: x['quantity'].sum(), reverse=True)
print (dfs)
[ order quantity
0 A 1
1 B 4
2 C 4
3 D 2, order quantity
0 A 1
1 B 4
2 C 3
3 D 2, order quantity
0 A 1
1 B 4
2 C 1
3 D 2, order quantity
0 A 1
1 B 4
2 C 1
3 D 2, order quantity
0 A 1
1 B 4
2 C 1
3 D 1]
EDIT:
dfs = {'df1':df1, 'df2': df2, 'df3': df3, 'df4': df4, 'df5': df5}
dfs = [i for i, j in sorted(dfs.items(), key=lambda x: -x[1]['quantity'].sum())]
print (dfs)
['df2', 'df1', 'df3', 'df4', 'df5']
You can use sorted method to sort a dataframe list and sum to get the sum of a column
dfs = [df2,df1,df3,df4,df5]
sorted_dfs = sorted(dfs, key=lambda df: df.quantity.sum(), reverse=True)
Edit:- to print only the name sorted dataframe
df_map = {"df1": df1, "df2":df2, "df3":df3, "df4":df4}
sorted_dfs = sorted(df_map.items(), key=lambda kv: kv[1].quantity.sum(), reverse=True)
print(list(x[0] for x in sorted_dfs))

How do I make a panda frames values across multiple columns, its columns

I have the following dataframe loaded up in Pandas.
print(pandaDf)
id col1 col2 col3
12a a b d
22b d a b
33c c a b
I am trying to convert the values across multiple rows into its columns so the output would be like this :
Desired output:
id a b c d
12a 1 1 0 1
22b 1 1 0 0
33c 1 1 1 0
I have tried adding in a value column where the value = 1 and using a pivot table
pandaDf['value'] = 1
column = ['col1', 'col2', 'col3']
pandaDf.pivot_table(index = 'id', value = 'value', columns = column)
However, the resulting data frame is a multilevel index and the pandaDf.pivot() method does not allow multiple column values.
Please advise about how I could do this with an output of a single level index.
Thanks for taking the time to read this and I apologize if I have made any formatting errors in posting the question. I am still learning the proper stackoverflow syntax.
You can use One-Hot Encoding to solve this problem.
Here is one way to do this pd.get_dummies and some multiindex flatten and sum:
df1 = df.set_index('id')
df_out = pd.get_dummies(df1)
df_out.columns = df_out.columns.str.split('_', expand=True)
df_out = df_out.sum(level=1, axis=1).reset_index()
print(df_out)
Output:
id a c d b
0 12a 1 0 1 1
1 22b 1 0 1 1
2 33c 1 1 0 1
Using get_dummies
pd.get_dummies(df.set_index('id'),prefix='', prefix_sep='').sum(level=0,axis=1)
Out[81]:
a c d b
id
12a 1 0 1 1
22b 1 0 1 1
33c 1 1 0 1

Pandas Pivot Table Conditional Counting

I have a simple dataframe:
df = pd.DataFrame({'id': ['a','a','a','b','b'],'value':[0,15,20,30,0]})
df
id value
0 a 0
1 a 15
2 a 20
3 b 30
4 b 0
And I want a pivot table with the number of values greater than zero.
I tried this:
raw = pd.pivot_table(df, index='id',values='value',aggfunc=lambda x:len(x>0))
But returned this:
value
id
a 3
b 2
What I need:
value
id
a 2
b 1
I read lots of solutions with groupby and filter. Is it possible to achieve this only with pivot_table command? If it is not, which is the best approach?
Thanks in advance
UPDATE
Just to make it clearer why I am avoinding filter solution. In my real and complex df, I have other columns, like this:
df = pd.DataFrame({'id': ['a','a','a','b','b'],'value':[0,15,20,30,0],'other':[2,3,4,5,6]})
df
id other value
0 a 2 0
1 a 3 15
2 a 4 20
3 b 5 30
4 b 6 0
I need to sum the column 'other', but when i filter I got this:
df=df[df['value']>0]
raw = pd.pivot_table(df, index='id',values=['value','other'],aggfunc={'value':len,'other':sum})
other value
id
a 7 2
b 5 1
Instead of:
other value
id
a 9 2
b 11 1
Need sum for count Trues created by condition x>0:
raw = pd.pivot_table(df, index='id',values='value',aggfunc=lambda x:(x>0).sum())
print (raw)
value
id
a 2
b 1
As #Wen mentioned, another solution is:
df = df[df['value'] > 0]
raw = pd.pivot_table(df, index='id',values='value',aggfunc=len)
You can filter the dataframe before pivoting:
pd.pivot_table(df.loc[df['value']>0], index='id',values='value',aggfunc='count')

Pandas add empty rows as filler

Given the following data frame:
import pandas as pd
DF = pd.DataFrame({'COL1': ['A', 'A','B'],
'COL2' : [1,2,1],
'COL3' : ['X','Y','X']})
DF
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
I would like to have an additional row for COL1 = 'B' so that both values (COL1 A and B) are represented by the COL3 values X and Y, with a 0 for COL2 in the generated row.
The desired result is as follows:
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
3 B 0 Y
This is just a simplified example, but I need a calculation that could handle many such instances (and not just inserting the row in interest manually).
Thanks in advance!
UPDATE:
For a generalized scenario where there are many different combinations of values under 'COL1' and 'COL3', this works but is probably not nearly as efficient as it can be:
#Get unique set of COL3
COL3SET = set(DF['COL3'])
#Get unique set of COL1
COL1SET = set(DF['COL1'])
#Get all possible combinations of unique sets
import itertools
COMB=[]
for combination in itertools.product(COL1SET, COL3SET):
COMB.append(combination)
#Create dataframe from new set:
UNQ = pd.DataFrame({'COMB':COMB})
#Split tuples into columns
new_col_list = ['COL1unq','COL3unq']
for n,col in enumerate(new_col_list):
UNQ[col] = UNQ['COMB'].apply(lambda COMB: COMB[n])
UNQ = UNQ.drop('COMB',axis=1)
#Merge original data frame with unique set data frame
DF = pd.merge(DF,UNQ,left_on=['COL1','COL3'],right_on=['COL1unq','COL3unq'],how='outer')
#Fill in empty values of COL1 and COL3 where they did not have records
DF['COL1'] = DF['COL1unq']
DF['COL3'] = DF['COL3unq']
#Replace 'NaN's in column 2 with zeros
DF['COL2'].fillna(0, inplace=True)
#Get rid of COL1unq and COL3unq
DF.drop(['COL1unq','COL3unq'],axis=1, inplace=True)
DF
Something like this?
col1_b_vals = set(DF.loc[DF.COL1 == 'B', 'COL3'])
col1_not_b_col3_vals = set(DF.loc[DF.COL1 != 'B', 'COL3'])
missing_vals = col1_not_b_col3_vals.difference(col1_b_vals)
missing_rows = DF.loc[(DF.COL1 != 'B') & (DF.COL3.isin(missing_vals)), :]
missing_rows['COL1'] = 'B'
missing_rows['COL2'] = 0
>>> pd.concat([DF, missing_rows], ignore_index=True)
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
3 B 0 Y

Resources