How to get?sort descending order between dataframes using python - python-3.x

My goal here is to print the descending order between dataframe.
I have 5 dataframe and each has column "Quantity". I need to calculate the sum of this column"Quantity" in each dataframe and wish to print the result in decending order in terms of dataframe.
df1:
order quantity
A 1
B 4
C 3
D 2
df2:
order quantity
A 1
B 4
C 4
D 2
df3:
order quantity
A 1
B 4
C 1
D 2
df4:
order quantity
A 1
B 4
C 1
D 2
df5:
order quantity
A 1
B 4
C 1
D 1
my desired result
descending order :
df2,df1,df3,df4,df5
here df3 and df4 are equal and it can be in anyway.
suggestion please.

Use sorted with custom sorted lambda function:
dfs = [df1, df2, df3, df4, df5]
dfs = sorted(dfs, key=lambda x: -x['quantity'].sum())
#another solution
#dfs = sorted(dfs, key=lambda x: x['quantity'].sum(), reverse=True)
print (dfs)
[ order quantity
0 A 1
1 B 4
2 C 4
3 D 2, order quantity
0 A 1
1 B 4
2 C 3
3 D 2, order quantity
0 A 1
1 B 4
2 C 1
3 D 2, order quantity
0 A 1
1 B 4
2 C 1
3 D 2, order quantity
0 A 1
1 B 4
2 C 1
3 D 1]
EDIT:
dfs = {'df1':df1, 'df2': df2, 'df3': df3, 'df4': df4, 'df5': df5}
dfs = [i for i, j in sorted(dfs.items(), key=lambda x: -x[1]['quantity'].sum())]
print (dfs)
['df2', 'df1', 'df3', 'df4', 'df5']

You can use sorted method to sort a dataframe list and sum to get the sum of a column
dfs = [df2,df1,df3,df4,df5]
sorted_dfs = sorted(dfs, key=lambda df: df.quantity.sum(), reverse=True)
Edit:- to print only the name sorted dataframe
df_map = {"df1": df1, "df2":df2, "df3":df3, "df4":df4}
sorted_dfs = sorted(df_map.items(), key=lambda kv: kv[1].quantity.sum(), reverse=True)
print(list(x[0] for x in sorted_dfs))

Related

How to perform cumulative sum inside iterrows

I have a pandas dataframe as below:
df2 = pd.DataFrame({ 'b' : [1, 1, 1]})
df2
b
0 1
1 1
2 1
I want to create a column 'cumsum' with the cumulative sum of column b starting row 2. Also I want to use iterrows to perform this. I tried below code but it doesnot seem to work.
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[row_index, 'b'].cumsum()
My expected output:
b cum_sum
0 1 NaN
1 1 2
2 1 3
As your requirement, you may try this
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[:row_index, 'b'].sum()
Out[10]:
b cumsum
0 1 NaN
1 1 2.0
2 1 3.0
To stick to iterrows():
i=0
df2['cumsum']=0
col=list(df2.columns).index('cumsum')
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[row_index, 'b']+df2.iloc[i, col]
i+=1
Outputs:
b cumsum
0 1 0
1 1 1
2 1 2

Collapse values from multiple rows of a column into an array when all other columns values are same

I have a table with 7 columns where for every few rows, 6 columns remain same and only the 7th changes. I would like to merge all these rows into one row, and combine the value of the 7th column into a list.
So if I have this dataframe:
A B C
0 a 1 2
1 b 3 4
2 c 5 6
3 c 7 6
I would like to convert it to this:
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6
Since the values of column A and C were same in row 2 and 3, they would get collapsed into a single row and the values of B will be combined into a list.
Melt, explode, and pivot don't seem to have such functionality. How can achieve this using Pandas?
Use GroupBy.agg with custom lambda function, last add DataFrame.reindex for same order of columns by original:
f = lambda x: x.tolist() if len(x) > 1 else x
df = df.groupby(['A','C'])['B'].agg(f).reset_index().reindex(df.columns, axis=1)
You can also create columns names dynamic like:
changes = ['B']
cols = df.columns.difference(changes).tolist()
f = lambda x: x.tolist() if len(x) > 1 else x
df = df.groupby(cols)[changes].agg(f).reset_index().reindex(df.columns, axis=1)
print (df)
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6
For all lists in column solution is simplier:
changes = ['B']
cols = df.columns.difference(changes).tolist()
df = df.groupby(cols)[changes].agg(list).reset_index().reindex(df.columns, axis=1)
print (df)
A B C
0 a [1] 2
1 b [3] 4
2 c [5, 7] 6
Here is another approach using pivot_table and applymap:
(df.pivot_table(index='A',aggfunc=list).applymap(lambda x: x[0] if len(set(x))==1 else x)
.reset_index())
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6

Pandas Aggregate data other than a specific value in specific column

I have my data like this in pandas dataframe python
df = pd.DataFrame({
'ID':range(1, 8),
'Type':list('XXYYZZZ'),
'Value':[2,3,2,9,6,1,4]
})
The oputput that i want to generate is
How can i generate these results using python pandas dataframe. I want to include all the Y values of type column, and does not want to aggregate them.
First filter values by boolean indexing, aggregate and append filter out rows, last sorting:
mask = df['Type'] == 'Y'
df1 = (df[~mask].groupby('Type', as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.append(df[mask])
.sort_values('ID'))
print (df1)
ID Type Value
0 1 X 5
2 3 Y 2
3 4 Y 9
1 5 Z 11
If want range 1 to length of data for ID column:
mask = df['Type'] == 'Y'
df1 = (df[~mask].groupby('Type', as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.append(df[mask])
.sort_values('ID')
.assign(ID = lambda x: np.arange(1, len(x) + 1)))
print (df1)
ID Type Value
0 1 X 5
2 2 Y 2
3 3 Y 9
1 4 Z 11
Another idea is create helper column for unique values only for Y rows and aggregate by both columns:
mask = df['Type'] == 'Y'
df['g'] = np.where(mask, mask.cumsum() + 1, 0)
df1 = (df.groupby(['Type','g'], as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.drop('g', axis=1)[['ID','Type','Value']])
print (df1)
ID Type Value
0 1 X 5
1 3 Y 2
2 4 Y 9
3 5 Z 11
Similar alternative with Series g, then drop is not necessary:
mask = df['Type'] == 'Y'
g = np.where(mask, mask.cumsum() + 1, 0)
df1 = (df.groupby(['Type',g], as_index=False)
.agg({'ID':'first', 'Value':'sum'})[['ID','Type','Value']])

Index order of a shuffle dataframe

I have two DataFrame, namely A and B. Bis generated by shuffling rows of A. I would like to know each row of B, what's the index of the same row in A.
Example:
A=pd.DataFrame({"a":[1,2,3],"b":[1,2,3],"c":[1,2,3]})
B=pd.DataFrame({"a":[2,3,1],"b":[2,3,1],"c":[2,3,1]})
A
a b c
0 1 1 1
1 2 2 2
2 3 3 3
B
a b c
0 2 2 2
1 3 3 3
2 1 1 1
The answer should be [1,2,0], because B equals A.loc[[1,2,0]]. I am wondering how to do this efficiently since my A and B is large.
I came up with probable solution using Dataframe.merge
A=pd.DataFrame({"a":[1,2,3],"b":[1,2,3],"c":[1,2,3]})
B=pd.DataFrame({"a":[2,3,1],"b":[2,3,1],"c":[2,3,1]})
A['index_a'] = A.index
B['index_b'] = B.index
merge_df= pd.merge(A, B, left_on=['a', 'b', 'c'], right_on=['a', 'b', 'c'])
Where merge_df is
a b c index_a index_b
0 1 1 1 0 2
1 2 2 2 1 0
2 3 3 3 2 1
Now you can reference the rows from A or B Dataframe
Example
You know that row with index 0 at A is at index 2 in B
NOTE Rows that do not match on neither dataframe will not be shown in merge_df
IIUC use merge
pd.merge(B.reset_index(), A.reset_index(),
left_on = A.columns.tolist(),
right_on = B.columns.tolist()).iloc[:,-1].values
array([1, 2, 0], dtype=int64)

Selecting data from multiple dataframes

my workbook Rule.xlsx has following data.
sheet1:
group ordercode quantity
0 A 1
B 3
1 C 1
E 2
D 1
Sheet 2:
group ordercode quantity
0 x 1
y 3
1 x 1
y 2
z 1
I have created dataframe using below method.
df1 =data.parse('sheet1')
df2=data.parse('sheet2')
my desired result is writing a sequence using these two dataframe.
df3:
group ordercode quantity
0 A 1
B 3
0 x 1
y 3
1 C 1
E 2
D 1
1 x 1
y 2
z 1
one from df1 and one from df2.
I wish to know how I can print the data by selecting group number (eg. group(0), group(1) etc).
any suggestion ?
After some comments solution is:
#create OrderDict of DataFrames
dfs = pd.read_excel('Rule.xlsx', sheet_name=None)
#ordering of DataFrames
order = 'SWC_1380_81,SWC_1382,SWC_1390,SWC_1391,SWM_1380_81'.split(',')
#in loops lookup dictionaries, replace NaNs and create helper column
L = [dfs[x].ffill().assign(g=i) for i, x in enumerate(order)]
#last join together, sorting and last remove helper column
df = pd.concat(L).sort_values(['group','g'])

Resources