Pandas Aggregate data other than a specific value in specific column - python-3.x

I have my data like this in pandas dataframe python
df = pd.DataFrame({
'ID':range(1, 8),
'Type':list('XXYYZZZ'),
'Value':[2,3,2,9,6,1,4]
})
The oputput that i want to generate is
How can i generate these results using python pandas dataframe. I want to include all the Y values of type column, and does not want to aggregate them.

First filter values by boolean indexing, aggregate and append filter out rows, last sorting:
mask = df['Type'] == 'Y'
df1 = (df[~mask].groupby('Type', as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.append(df[mask])
.sort_values('ID'))
print (df1)
ID Type Value
0 1 X 5
2 3 Y 2
3 4 Y 9
1 5 Z 11
If want range 1 to length of data for ID column:
mask = df['Type'] == 'Y'
df1 = (df[~mask].groupby('Type', as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.append(df[mask])
.sort_values('ID')
.assign(ID = lambda x: np.arange(1, len(x) + 1)))
print (df1)
ID Type Value
0 1 X 5
2 2 Y 2
3 3 Y 9
1 4 Z 11
Another idea is create helper column for unique values only for Y rows and aggregate by both columns:
mask = df['Type'] == 'Y'
df['g'] = np.where(mask, mask.cumsum() + 1, 0)
df1 = (df.groupby(['Type','g'], as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.drop('g', axis=1)[['ID','Type','Value']])
print (df1)
ID Type Value
0 1 X 5
1 3 Y 2
2 4 Y 9
3 5 Z 11
Similar alternative with Series g, then drop is not necessary:
mask = df['Type'] == 'Y'
g = np.where(mask, mask.cumsum() + 1, 0)
df1 = (df.groupby(['Type',g], as_index=False)
.agg({'ID':'first', 'Value':'sum'})[['ID','Type','Value']])

Related

Split rows to create new rows in Pandas Dataframe

I have a pandas dataframe in which one column of text strings contains multiple comma-separated values. I want to split each field and create a new row per entry only where the number of commas is equal to 2. My entire dataframe has only values with either no. of commas =1 or 2.
For example, a should become b:
In [7]: a
Out[7]:
var1 var2 var3
0 a,b,c 1 X
1 d,e,f 2 Y
2 g,h 3 Z
In [8]: b
Out[8]:
var1 var2 var3
0 a,c 1 X
1 b,c 1 X
2 d,f 2 Y
3 e,f 2 Y
4 g,h 3 Z
Based on your comment that var1 column has only 1 or 2 commas:
def fn(x):
x = x.split(",")
if len(x) == 2:
return [",".join(x)]
return ["{},{}".format(x[0], x[2]), "{},{}".format(x[1], x[2])]
df = df.assign(var1=df["var1"].apply(fn)).explode("var1").reset_index(drop=True)
print(df)
Prints:
var1 var2 var3
0 a,c 1 X
1 b,c 1 X
2 d,f 2 Y
3 e,f 2 Y
4 g,h 3 Z
have taken the approach that you want combinations of constituent parts
specifically there is a combination you want to exclude
have used an additional column just for purpose of transparency of solution
import io
import itertools
df = pd.read_csv(io.StringIO(""" var1 var2 var3
0 a,b,c 1 X
1 d,e,f 2 Y
2 g,h 3 Z"""), sep="\s+")
df["var1_2"] = df["var1"].str.split(",").apply(lambda x: [",".join(list(c))
for c in itertools.combinations(x, 2)
if len(x)<=2 or list(c) != x[:2]])
df.explode("var1_2")
var1
var2
var3
var1_2
a,b,c
1
X
a,c
a,b,c
1
X
b,c
d,e,f
2
Y
d,f
d,e,f
2
Y
e,f
g,h
3
Z
g,h
I'm doing in two steps: first, transform the first column when there are two commas, introducing a tuple of strings (this is done by applying func to the first column, each s is a cell string contents). Then use explode to turn those tuples into a couple of rows.
def func(s):
t = s.split(',')
return s if len(t) == 2 else (f'{t[0]},{t[2]}', f'{t[1]},{t[2]}')
df.var1 = df.var1.apply(func)
df = df.explode('var1').reset_index(drop=True)
Here is another way:
df2 = df.loc[df['var1'].str.count(',').eq(2)]
s = (df2.assign(var1 = df2['var1'].str.split(','))
.explode('var1').groupby(level=0)
.agg(one = ('var1',lambda x: x.iloc[0] +','+ x.iloc[-1]),
two = ('var1',lambda x: x.iloc[1] +','+ x.iloc[-1]))
.stack().droplevel(1))
df2 = (pd.concat([df.loc[s.index].assign(var1 = s.to_numpy()),
df.loc[df['var1'].str.count(',').eq(1)]],ignore_index=True))

Collapse values from multiple rows of a column into an array when all other columns values are same

I have a table with 7 columns where for every few rows, 6 columns remain same and only the 7th changes. I would like to merge all these rows into one row, and combine the value of the 7th column into a list.
So if I have this dataframe:
A B C
0 a 1 2
1 b 3 4
2 c 5 6
3 c 7 6
I would like to convert it to this:
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6
Since the values of column A and C were same in row 2 and 3, they would get collapsed into a single row and the values of B will be combined into a list.
Melt, explode, and pivot don't seem to have such functionality. How can achieve this using Pandas?
Use GroupBy.agg with custom lambda function, last add DataFrame.reindex for same order of columns by original:
f = lambda x: x.tolist() if len(x) > 1 else x
df = df.groupby(['A','C'])['B'].agg(f).reset_index().reindex(df.columns, axis=1)
You can also create columns names dynamic like:
changes = ['B']
cols = df.columns.difference(changes).tolist()
f = lambda x: x.tolist() if len(x) > 1 else x
df = df.groupby(cols)[changes].agg(f).reset_index().reindex(df.columns, axis=1)
print (df)
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6
For all lists in column solution is simplier:
changes = ['B']
cols = df.columns.difference(changes).tolist()
df = df.groupby(cols)[changes].agg(list).reset_index().reindex(df.columns, axis=1)
print (df)
A B C
0 a [1] 2
1 b [3] 4
2 c [5, 7] 6
Here is another approach using pivot_table and applymap:
(df.pivot_table(index='A',aggfunc=list).applymap(lambda x: x[0] if len(set(x))==1 else x)
.reset_index())
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6

How to get?sort descending order between dataframes using python

My goal here is to print the descending order between dataframe.
I have 5 dataframe and each has column "Quantity". I need to calculate the sum of this column"Quantity" in each dataframe and wish to print the result in decending order in terms of dataframe.
df1:
order quantity
A 1
B 4
C 3
D 2
df2:
order quantity
A 1
B 4
C 4
D 2
df3:
order quantity
A 1
B 4
C 1
D 2
df4:
order quantity
A 1
B 4
C 1
D 2
df5:
order quantity
A 1
B 4
C 1
D 1
my desired result
descending order :
df2,df1,df3,df4,df5
here df3 and df4 are equal and it can be in anyway.
suggestion please.
Use sorted with custom sorted lambda function:
dfs = [df1, df2, df3, df4, df5]
dfs = sorted(dfs, key=lambda x: -x['quantity'].sum())
#another solution
#dfs = sorted(dfs, key=lambda x: x['quantity'].sum(), reverse=True)
print (dfs)
[ order quantity
0 A 1
1 B 4
2 C 4
3 D 2, order quantity
0 A 1
1 B 4
2 C 3
3 D 2, order quantity
0 A 1
1 B 4
2 C 1
3 D 2, order quantity
0 A 1
1 B 4
2 C 1
3 D 2, order quantity
0 A 1
1 B 4
2 C 1
3 D 1]
EDIT:
dfs = {'df1':df1, 'df2': df2, 'df3': df3, 'df4': df4, 'df5': df5}
dfs = [i for i, j in sorted(dfs.items(), key=lambda x: -x[1]['quantity'].sum())]
print (dfs)
['df2', 'df1', 'df3', 'df4', 'df5']
You can use sorted method to sort a dataframe list and sum to get the sum of a column
dfs = [df2,df1,df3,df4,df5]
sorted_dfs = sorted(dfs, key=lambda df: df.quantity.sum(), reverse=True)
Edit:- to print only the name sorted dataframe
df_map = {"df1": df1, "df2":df2, "df3":df3, "df4":df4}
sorted_dfs = sorted(df_map.items(), key=lambda kv: kv[1].quantity.sum(), reverse=True)
print(list(x[0] for x in sorted_dfs))

How to use pandas df column value in if-else expression to calculate additional columns

I am trying to calculate additional metrics from existing pandas dataframe by using an if/else condition on existing column values.
if(df['Sell_Ind']=='N').any():
df['MarketValue'] = df.apply(lambda row: row.SharesUnits * row.CurrentPrice, axis=1).astype(float).round(2)
elif(df['Sell_Ind']=='Y').any():
df['MarketValue'] = df.apply(lambda row: row.SharesUnits * row.Sold_price, axis=1).astype(float).round(2)
else:
df['MarketValue'] = df.apply(lambda row: 0)
For the if condition the MarketValue is calculated correctly but for the elif condition, its not giving the correct value.
Can anyone point me as what wrong I am doing in this code.
I think you need numpy.select, apply can be removed and multiple columns by mul:
m1 = df['Sell_Ind']=='N'
m2 = df['Sell_Ind']=='Y'
a = df.SharesUnits.mul(df.CurrentPrice).astype(float).round(2)
b = df.SharesUnits.mul(df.Sold_price).astype(float).round(2)
df['MarketValue'] = np.select([m1, m2], [a,b], default=0)
Sample:
df = pd.DataFrame({'Sold_price':[7,8,9,4,2,3],
'SharesUnits':[1,3,5,7,1,0],
'CurrentPrice':[5,3,6,9,2,4],
'Sell_Ind':list('NNYYTT')})
#print (df)
m1 = df['Sell_Ind']=='N'
m2 = df['Sell_Ind']=='Y'
a = df.SharesUnits.mul(df.CurrentPrice).astype(float).round(2)
b = df.SharesUnits.mul(df.Sold_price).astype(float).round(2)
df['MarketValue'] = np.select([m1, m2], [a,b], default=0)
print (df)
CurrentPrice Sell_Ind SharesUnits Sold_price MarketValue
0 5 N 1 7 5.0
1 3 N 3 8 9.0
2 6 Y 5 9 45.0
3 9 Y 7 4 28.0
4 2 T 1 2 0.0
5 4 T 0 3 0.0

Python/Pandas return column and row index of found string

I've searched previous answers relating to this but those answers seem to utilize numpy because the array contains numbers. I am trying to search for a keyword in a sentence in a dataframe ('Timeframe') where the full sentence is 'Timeframe for wave in ____' and would like to return the column and row index. For example:
df.iloc[34,0]
returns the string I am looking for but I am avoiding a hard code for dynamic reasons. Is there a way to return the [34,0] when I search the dataframe for the keyword 'Timeframe'
EDIT:
For check index need contains with boolean indexing, but then there are possible 3 values:
df = pd.DataFrame({'A':['Timeframe for wave in ____', 'a', 'c']})
print (df)
A
0 Timeframe for wave in ____
1 a
2 c
def check(val):
a = df.index[df['A'].str.contains(val)]
if a.empty:
return 'not found'
elif len(a) > 1:
return a.tolist()
else:
#only one value - return scalar
return a.item()
print (check('Timeframe'))
0
print (check('a'))
[0, 1]
print (check('rr'))
not found
Old solution:
It seems you need if need numpy.where for check value Timeframe:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,'Timeframe'],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 Timeframe 0 4 b
a = np.where(df.values == 'Timeframe')
print (a)
(array([5], dtype=int64), array([2], dtype=int64))
b = [x[0] for x in a]
print (b)
[5, 2]
In case you have multiple columns where to look into you can use following code example:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],["a","b","Timeframe for wave in____","d"],[5,6,7,8]])
mask = np.column_stack([df[col].str.contains("Timeframe", na=False) for col in df])
find_result = np.where(mask==True)
result = [find_result[0][0], find_result[1][0]]
Then output for df and result would be:
>>> df
0 1 2 3
0 1 2 3 4
1 a b Timeframe for wave in____ d
2 5 6 7 8
>>> result
[1, 2]

Resources