Selecting data from multiple dataframes - python-3.x

my workbook Rule.xlsx has following data.
sheet1:
group ordercode quantity
0 A 1
B 3
1 C 1
E 2
D 1
Sheet 2:
group ordercode quantity
0 x 1
y 3
1 x 1
y 2
z 1
I have created dataframe using below method.
df1 =data.parse('sheet1')
df2=data.parse('sheet2')
my desired result is writing a sequence using these two dataframe.
df3:
group ordercode quantity
0 A 1
B 3
0 x 1
y 3
1 C 1
E 2
D 1
1 x 1
y 2
z 1
one from df1 and one from df2.
I wish to know how I can print the data by selecting group number (eg. group(0), group(1) etc).
any suggestion ?

After some comments solution is:
#create OrderDict of DataFrames
dfs = pd.read_excel('Rule.xlsx', sheet_name=None)
#ordering of DataFrames
order = 'SWC_1380_81,SWC_1382,SWC_1390,SWC_1391,SWM_1380_81'.split(',')
#in loops lookup dictionaries, replace NaNs and create helper column
L = [dfs[x].ffill().assign(g=i) for i, x in enumerate(order)]
#last join together, sorting and last remove helper column
df = pd.concat(L).sort_values(['group','g'])

Related

Do I use a loop, df.melt or df.explode to achieve a flattened dataframe?

Can anyone help with some code that will achieve the following transformation? I have tried variations of df.melt, df.explode, and also a looping statement but only get error statements. I think it might need nesting but don't have the experience to do so.
index A B C D
0 X d 4 2
1 Y b 5 2
Where column D represents frequency of column C.
desired output is:
index A B C
0 X d 4
1 X d 4
2 Y b 5
3 Y b 5
If you want to repeat rows, why not use index.repeat?
import pandas as pd
#recreate the sample dataframe
df = pd.DataFrame({"A":["X","Y"],"B":["d","b"],"C":[4,5],"D":[3,2]}, columns=list("ABCD"))
df = df.reindex(df.index.repeat(df["D"])).drop("D", 1).reset_index(drop=True)
print(df)
Sample output
A B C
0 X d 4
1 X d 4
2 X d 4
3 Y b 5
4 Y b 5

Collapse values from multiple rows of a column into an array when all other columns values are same

I have a table with 7 columns where for every few rows, 6 columns remain same and only the 7th changes. I would like to merge all these rows into one row, and combine the value of the 7th column into a list.
So if I have this dataframe:
A B C
0 a 1 2
1 b 3 4
2 c 5 6
3 c 7 6
I would like to convert it to this:
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6
Since the values of column A and C were same in row 2 and 3, they would get collapsed into a single row and the values of B will be combined into a list.
Melt, explode, and pivot don't seem to have such functionality. How can achieve this using Pandas?
Use GroupBy.agg with custom lambda function, last add DataFrame.reindex for same order of columns by original:
f = lambda x: x.tolist() if len(x) > 1 else x
df = df.groupby(['A','C'])['B'].agg(f).reset_index().reindex(df.columns, axis=1)
You can also create columns names dynamic like:
changes = ['B']
cols = df.columns.difference(changes).tolist()
f = lambda x: x.tolist() if len(x) > 1 else x
df = df.groupby(cols)[changes].agg(f).reset_index().reindex(df.columns, axis=1)
print (df)
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6
For all lists in column solution is simplier:
changes = ['B']
cols = df.columns.difference(changes).tolist()
df = df.groupby(cols)[changes].agg(list).reset_index().reindex(df.columns, axis=1)
print (df)
A B C
0 a [1] 2
1 b [3] 4
2 c [5, 7] 6
Here is another approach using pivot_table and applymap:
(df.pivot_table(index='A',aggfunc=list).applymap(lambda x: x[0] if len(set(x))==1 else x)
.reset_index())
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6

How to join several data frames containing different pieces of one data into one?

I have several - let's say three - data frames that contain different rows (sometimes they can overlap) of another data frame. The columns are the same for all three dfs. I want now to create final data frame that will contain all the rows from three mentioned data frames. Moreover I need to generate a column for the final df that will contain information in which one of the first three dfs this particular row is included.
Example below
Original data frame:
original_df = pd.DataFrame(np.array([[1,1],[2,2],[3,3],[4,4],[5,5],[6,6]]), columns = ['label1','label2'])
Three dfs containing different pieces of the original df:
a = original_df.loc[0:1, columns]
b = original_df.loc[2:2, columns]
c = original_df.loc[3:, columns]
I want to get the following data frame:
final_df = pd.DataFrame(np.array([[1,1,'a'],[2,2,'a'],[3,3,'b'],[4,4,'c'],\
[5,5,'c'],[6,6,'c']]), columns = ['label1','label2', 'from which df this row'])
or simply use integers to mark from which df the row is:
final_df = pd.DataFrame(np.array([[1,1,1],[2,2,1],[3,3,2],[4,4,3],\
[5,5,3],[6,6,3]]), columns = ['label1','label2', 'from which df this row'])
Thank you in advance!
See this related post
IIUC, you can use pd.concat with the keys and names arguments
pd.concat(
[a, b, c], keys=['a', 'b', 'c'],
names=['from which df this row']
).reset_index(0)
from which df this row label1 label2
0 a 1 1
1 a 2 2
2 b 3 3
3 c 4 4
4 c 5 5
5 c 6 6
However, I'd recommend that you store those dataframe pieces in a dictionary.
parts = {
'a': original_df.loc[0:1],
'b': original_df.loc[2:2],
'c': original_df.loc[3:]
}
pd.concat(parts, names=['from which df this row']).reset_index(0)
from which df this row label1 label2
0 a 1 1
1 a 2 2
2 b 3 3
3 c 4 4
4 c 5 5
5 c 6 6
And as long as it is stored as a dictionary, you can also use assign like this
pd.concat(d.assign(**{'from which df this row': k}) for k, d in parts.items())
label1 label2 from which df this row
0 1 1 a
1 2 2 a
2 3 3 b
3 4 4 c
4 5 5 c
5 6 6 c
Keep in mind that I used the double-splat ** because you have a column name with spaces. If you had a column name without spaces, we could do
pd.concat(d.assign(WhichDF=k) for k, d in parts.items())
label1 label2 WhichDF
0 1 1 a
1 2 2 a
2 3 3 b
3 4 4 c
4 5 5 c
5 6 6 c
Just create a list and in the end concatenate:
list_df = []
list_df.append(df1)
list_df.append(df2)
list_df.append(df3)
df = pd.concat(liste_df)
Perhaps this can work / add value for you :)
import pandas as pd
# from your post
a = original_df.loc[0:1, columns]
b = original_df.loc[2:2, columns]
c = original_df.loc[3:, columns]
# create new column to label the datasets
a['label'] = 'a'
b['label'] = 'b'
c['label'] = 'c'
# add each df to a list
combined_l = []
combined_l.append(a)
combined_l.append(b)
combined_l.append(c)
# concat all dfs into 1
df = pd.concat(liste_df)

How to get?sort descending order between dataframes using python

My goal here is to print the descending order between dataframe.
I have 5 dataframe and each has column "Quantity". I need to calculate the sum of this column"Quantity" in each dataframe and wish to print the result in decending order in terms of dataframe.
df1:
order quantity
A 1
B 4
C 3
D 2
df2:
order quantity
A 1
B 4
C 4
D 2
df3:
order quantity
A 1
B 4
C 1
D 2
df4:
order quantity
A 1
B 4
C 1
D 2
df5:
order quantity
A 1
B 4
C 1
D 1
my desired result
descending order :
df2,df1,df3,df4,df5
here df3 and df4 are equal and it can be in anyway.
suggestion please.
Use sorted with custom sorted lambda function:
dfs = [df1, df2, df3, df4, df5]
dfs = sorted(dfs, key=lambda x: -x['quantity'].sum())
#another solution
#dfs = sorted(dfs, key=lambda x: x['quantity'].sum(), reverse=True)
print (dfs)
[ order quantity
0 A 1
1 B 4
2 C 4
3 D 2, order quantity
0 A 1
1 B 4
2 C 3
3 D 2, order quantity
0 A 1
1 B 4
2 C 1
3 D 2, order quantity
0 A 1
1 B 4
2 C 1
3 D 2, order quantity
0 A 1
1 B 4
2 C 1
3 D 1]
EDIT:
dfs = {'df1':df1, 'df2': df2, 'df3': df3, 'df4': df4, 'df5': df5}
dfs = [i for i, j in sorted(dfs.items(), key=lambda x: -x[1]['quantity'].sum())]
print (dfs)
['df2', 'df1', 'df3', 'df4', 'df5']
You can use sorted method to sort a dataframe list and sum to get the sum of a column
dfs = [df2,df1,df3,df4,df5]
sorted_dfs = sorted(dfs, key=lambda df: df.quantity.sum(), reverse=True)
Edit:- to print only the name sorted dataframe
df_map = {"df1": df1, "df2":df2, "df3":df3, "df4":df4}
sorted_dfs = sorted(df_map.items(), key=lambda kv: kv[1].quantity.sum(), reverse=True)
print(list(x[0] for x in sorted_dfs))

How do I make a panda frames values across multiple columns, its columns

I have the following dataframe loaded up in Pandas.
print(pandaDf)
id col1 col2 col3
12a a b d
22b d a b
33c c a b
I am trying to convert the values across multiple rows into its columns so the output would be like this :
Desired output:
id a b c d
12a 1 1 0 1
22b 1 1 0 0
33c 1 1 1 0
I have tried adding in a value column where the value = 1 and using a pivot table
pandaDf['value'] = 1
column = ['col1', 'col2', 'col3']
pandaDf.pivot_table(index = 'id', value = 'value', columns = column)
However, the resulting data frame is a multilevel index and the pandaDf.pivot() method does not allow multiple column values.
Please advise about how I could do this with an output of a single level index.
Thanks for taking the time to read this and I apologize if I have made any formatting errors in posting the question. I am still learning the proper stackoverflow syntax.
You can use One-Hot Encoding to solve this problem.
Here is one way to do this pd.get_dummies and some multiindex flatten and sum:
df1 = df.set_index('id')
df_out = pd.get_dummies(df1)
df_out.columns = df_out.columns.str.split('_', expand=True)
df_out = df_out.sum(level=1, axis=1).reset_index()
print(df_out)
Output:
id a c d b
0 12a 1 0 1 1
1 22b 1 0 1 1
2 33c 1 1 0 1
Using get_dummies
pd.get_dummies(df.set_index('id'),prefix='', prefix_sep='').sum(level=0,axis=1)
Out[81]:
a c d b
id
12a 1 0 1 1
22b 1 0 1 1
33c 1 1 0 1

Resources