Selecting data from multiple dataframes

Selecting data from multiple dataframes - python-3.x

my workbook Rule.xlsx has following data.
sheet1:
group ordercode quantity
0 A 1
B 3
1 C 1
E 2
D 1
Sheet 2:
group ordercode quantity
0 x 1
y 3
1 x 1
y 2
z 1
I have created dataframe using below method.
df1 =data.parse('sheet1')
df2=data.parse('sheet2')
my desired result is writing a sequence using these two dataframe.
df3:
group ordercode quantity
0 A 1
B 3
0 x 1
y 3
1 C 1
E 2
D 1
1 x 1
y 2
z 1
one from df1 and one from df2.
I wish to know how I can print the data by selecting group number (eg. group(0), group(1) etc).
any suggestion ?

After some comments solution is:
#create OrderDict of DataFrames
dfs = pd.read_excel('Rule.xlsx', sheet_name=None)
#ordering of DataFrames
order = 'SWC_1380_81,SWC_1382,SWC_1390,SWC_1391,SWM_1380_81'.split(',')
#in loops lookup dictionaries, replace NaNs and create helper column
L = [dfs[x].ffill().assign(g=i) for i, x in enumerate(order)]
#last join together, sorting and last remove helper column
df = pd.concat(L).sort_values(['group','g'])

Related

Do I use a loop, df.melt or df.explode to achieve a flattened dataframe?

Can anyone help with some code that will achieve the following transformation? I have tried variations of df.melt, df.explode, and also a looping statement but only get error statements. I think it might need nesting but don't have the experience to do so.
index A B C D
0 X d 4 2
1 Y b 5 2
Where column D represents frequency of column C.
desired output is:
index A B C
0 X d 4
1 X d 4
2 Y b 5
3 Y b 5

If you want to repeat rows, why not use index.repeat?
import pandas as pd
#recreate the sample dataframe
df = pd.DataFrame({"A":["X","Y"],"B":["d","b"],"C":[4,5],"D":[3,2]}, columns=list("ABCD"))
df = df.reindex(df.index.repeat(df["D"])).drop("D", 1).reset_index(drop=True)
print(df)
Sample output
A B C
0 X d 4
1 X d 4
2 X d 4
3 Y b 5
4 Y b 5

Collapse values from multiple rows of a column into an array when all other columns values are same

I have a table with 7 columns where for every few rows, 6 columns remain same and only the 7th changes. I would like to merge all these rows into one row, and combine the value of the 7th column into a list.
So if I have this dataframe:
A B C
0 a 1 2
1 b 3 4
2 c 5 6
3 c 7 6
I would like to convert it to this:
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6
Since the values of column A and C were same in row 2 and 3, they would get collapsed into a single row and the values of B will be combined into a list.
Melt, explode, and pivot don't seem to have such functionality. How can achieve this using Pandas?

Use GroupBy.agg with custom lambda function, last add DataFrame.reindex for same order of columns by original:
f = lambda x: x.tolist() if len(x) > 1 else x
df = df.groupby(['A','C'])['B'].agg(f).reset_index().reindex(df.columns, axis=1)
You can also create columns names dynamic like:
changes = ['B']
cols = df.columns.difference(changes).tolist()
f = lambda x: x.tolist() if len(x) > 1 else x
df = df.groupby(cols)[changes].agg(f).reset_index().reindex(df.columns, axis=1)
print (df)
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6
For all lists in column solution is simplier:
changes = ['B']
cols = df.columns.difference(changes).tolist()
df = df.groupby(cols)[changes].agg(list).reset_index().reindex(df.columns, axis=1)
print (df)
A B C
0 a [1] 2
1 b [3] 4
2 c [5, 7] 6

Here is another approach using pivot_table and applymap:
(df.pivot_table(index='A',aggfunc=list).applymap(lambda x: x[0] if len(set(x))==1 else x)
.reset_index())
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6

How to join several data frames containing different pieces of one data into one?

I have several - let's say three - data frames that contain different rows (sometimes they can overlap) of another data frame. The columns are the same for all three dfs. I want now to create final data frame that will contain all the rows from three mentioned data frames. Moreover I need to generate a column for the final df that will contain information in which one of the first three dfs this particular row is included.
Example below
Original data frame:
original_df = pd.DataFrame(np.array([[1,1],[2,2],[3,3],[4,4],[5,5],[6,6]]), columns = ['label1','label2'])
Three dfs containing different pieces of the original df:
a = original_df.loc[0:1, columns]
b = original_df.loc[2:2, columns]
c = original_df.loc[3:, columns]
I want to get the following data frame:
final_df = pd.DataFrame(np.array([[1,1,'a'],[2,2,'a'],[3,3,'b'],[4,4,'c'],\
[5,5,'c'],[6,6,'c']]), columns = ['label1','label2', 'from which df this row'])
or simply use integers to mark from which df the row is:
final_df = pd.DataFrame(np.array([[1,1,1],[2,2,1],[3,3,2],[4,4,3],\
[5,5,3],[6,6,3]]), columns = ['label1','label2', 'from which df this row'])
Thank you in advance!

See this related post
IIUC, you can use pd.concat with the keys and names arguments
pd.concat(
[a, b, c], keys=['a', 'b', 'c'],
names=['from which df this row']
).reset_index(0)
from which df this row label1 label2
0 a 1 1
1 a 2 2
2 b 3 3
3 c 4 4
4 c 5 5
5 c 6 6
However, I'd recommend that you store those dataframe pieces in a dictionary.
parts = {
'a': original_df.loc[0:1],
'b': original_df.loc[2:2],
'c': original_df.loc[3:]
}
pd.concat(parts, names=['from which df this row']).reset_index(0)
from which df this row label1 label2
0 a 1 1
1 a 2 2
2 b 3 3
3 c 4 4
4 c 5 5
5 c 6 6
And as long as it is stored as a dictionary, you can also use assign like this
pd.concat(d.assign(**{'from which df this row': k}) for k, d in parts.items())
label1 label2 from which df this row
0 1 1 a
1 2 2 a
2 3 3 b
3 4 4 c
4 5 5 c
5 6 6 c
Keep in mind that I used the double-splat ** because you have a column name with spaces. If you had a column name without spaces, we could do
pd.concat(d.assign(WhichDF=k) for k, d in parts.items())
label1 label2 WhichDF
0 1 1 a
1 2 2 a
2 3 3 b
3 4 4 c
4 5 5 c
5 6 6 c

Just create a list and in the end concatenate:
list_df = []
list_df.append(df1)
list_df.append(df2)
list_df.append(df3)
df = pd.concat(liste_df)

Perhaps this can work / add value for you :)
import pandas as pd
# from your post
a = original_df.loc[0:1, columns]
b = original_df.loc[2:2, columns]
c = original_df.loc[3:, columns]
# create new column to label the datasets
a['label'] = 'a'
b['label'] = 'b'
c['label'] = 'c'
# add each df to a list
combined_l = []
combined_l.append(a)
combined_l.append(b)
combined_l.append(c)
# concat all dfs into 1
df = pd.concat(liste_df)

How to get?sort descending order between dataframes using python

My goal here is to print the descending order between dataframe.
I have 5 dataframe and each has column "Quantity". I need to calculate the sum of this column"Quantity" in each dataframe and wish to print the result in decending order in terms of dataframe.
df1:
order quantity
A 1
B 4
C 3
D 2
df2:
order quantity
A 1
B 4
C 4
D 2
df3:
order quantity
A 1
B 4
C 1
D 2
df4:
order quantity
A 1
B 4
C 1
D 2
df5:
order quantity
A 1
B 4
C 1
D 1
my desired result
descending order :
df2,df1,df3,df4,df5
here df3 and df4 are equal and it can be in anyway.
suggestion please.

Use sorted with custom sorted lambda function:
dfs = [df1, df2, df3, df4, df5]
dfs = sorted(dfs, key=lambda x: -x['quantity'].sum())
#another solution
#dfs = sorted(dfs, key=lambda x: x['quantity'].sum(), reverse=True)
print (dfs)
[ order quantity
0 A 1
1 B 4
2 C 4
3 D 2, order quantity
0 A 1
1 B 4
2 C 3
3 D 2, order quantity
0 A 1
1 B 4
2 C 1
3 D 2, order quantity
0 A 1
1 B 4
2 C 1
3 D 2, order quantity
0 A 1
1 B 4
2 C 1
3 D 1]
EDIT:
dfs = {'df1':df1, 'df2': df2, 'df3': df3, 'df4': df4, 'df5': df5}
dfs = [i for i, j in sorted(dfs.items(), key=lambda x: -x[1]['quantity'].sum())]
print (dfs)
['df2', 'df1', 'df3', 'df4', 'df5']

You can use sorted method to sort a dataframe list and sum to get the sum of a column
dfs = [df2,df1,df3,df4,df5]
sorted_dfs = sorted(dfs, key=lambda df: df.quantity.sum(), reverse=True)
Edit:- to print only the name sorted dataframe
df_map = {"df1": df1, "df2":df2, "df3":df3, "df4":df4}
sorted_dfs = sorted(df_map.items(), key=lambda kv: kv[1].quantity.sum(), reverse=True)
print(list(x[0] for x in sorted_dfs))

How do I make a panda frames values across multiple columns, its columns

I have the following dataframe loaded up in Pandas.
print(pandaDf)
id col1 col2 col3
12a a b d
22b d a b
33c c a b
I am trying to convert the values across multiple rows into its columns so the output would be like this :
Desired output:
id a b c d
12a 1 1 0 1
22b 1 1 0 0
33c 1 1 1 0
I have tried adding in a value column where the value = 1 and using a pivot table
pandaDf['value'] = 1
column = ['col1', 'col2', 'col3']
pandaDf.pivot_table(index = 'id', value = 'value', columns = column)
However, the resulting data frame is a multilevel index and the pandaDf.pivot() method does not allow multiple column values.
Please advise about how I could do this with an output of a single level index.
Thanks for taking the time to read this and I apologize if I have made any formatting errors in posting the question. I am still learning the proper stackoverflow syntax.

You can use One-Hot Encoding to solve this problem.
Here is one way to do this pd.get_dummies and some multiindex flatten and sum:
df1 = df.set_index('id')
df_out = pd.get_dummies(df1)
df_out.columns = df_out.columns.str.split('_', expand=True)
df_out = df_out.sum(level=1, axis=1).reset_index()
print(df_out)
Output:
id a c d b
0 12a 1 0 1 1
1 22b 1 0 1 1
2 33c 1 1 0 1

Using get_dummies
pd.get_dummies(df.set_index('id'),prefix='', prefix_sep='').sum(level=0,axis=1)
Out[81]:
a c d b
id
12a 1 0 1 1
22b 1 0 1 1
33c 1 1 0 1

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Selecting data from multiple dataframes - python-3.x

Related

Do I use a loop, df.melt or df.explode to achieve a flattened dataframe?

Collapse values from multiple rows of a column into an array when all other columns values are same

How to join several data frames containing different pieces of one data into one?

How to get?sort descending order between dataframes using python

How do I make a panda frames values across multiple columns, its columns

Categories

Resources