Working with two data frames with different size in python - python-3.x

I am working with two data frames.
The sample data is as follow:
DF = ['A','B','C','D','E','A','C','B','B']
DF1 = pd.DataFrame({'Team':DF})
DF2 = pd.DataFrame({'Team':['A','B','C','D','E'],'Rating':[1,2,3,4,5]})
i want to add a new column to DF1 as follow:
Team Rating
A 1
B 2
C 3
D 4
E 5
A 1
C 3
B 2
B 2
How can I add a new column?
I used
DF1['Rating']= np.where(DF1['Team']== DF2['Team'],DF2['Rating'],0)
Error : ValueError: Can only compare identically-labeled Series objects
Thanks
ZEP

I think need map by Series created with set_index and if not match get NaNs, so fillna was added for replace to 0:
DF1['Rating']= DF1['Team'].map(DF2.set_index('Team')['Rating']).fillna(0)
print (DF1)
Team Rating
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 A 1
6 C 3
7 B 2
8 B 2
DF = ['A','B','C','D','E','A','C','B','B', 'G']
DF1 = pd.DataFrame({'Team':DF})
DF2 = pd.DataFrame({'Team':['A','B','C','D','E'],'Rating':[1,2,3,4,5]})
DF1['Rating']= DF1['Team'].map(DF2.set_index('Team')['Rating']).fillna(0)
print (DF1)
Team Rating
0 A 1.0
1 B 2.0
2 C 3.0
3 D 4.0
4 E 5.0
5 A 1.0
6 C 3.0
7 B 2.0
8 B 2.0
9 G 0.0 <- G not in DF2['Team']
Detail:
print (DF1['Team'].map(DF2.set_index('Team')['Rating']))
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 1.0
6 3.0
7 2.0
8 2.0
9 NaN
Name: Team, dtype: float64

You can use:
In [54]: DF1['new_col'] = DF1.Team.map(DF2.set_index('Team').Rating)
In [55]: DF1
Out[55]:
Team new_col
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 A 1
6 C 3
7 B 2
8 B 2

i think you can use pd.merge
DF1=pd.merge(DF1,DF2,how='left',on='Team')
DF1
Team Rating
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 A 1
6 C 3
7 B 2
8 B 2

Related

Split value of a row with equal values based on the unique entry

I have a dataframe which looks like this:
Name T1 T2 alpha
A 10 3 30
A 11 5 Nan
A 13 5 Nan
B 5 2 7
B 3 1 Nan
However need to divide the alpha column in equal parts to replace Nan values for each unique names in Name column like this: eg 30 for A becomes 10 each for corresponding row where A is present
Name T1 T2 alpha
A 10 3 10
A 11 5 10
A 13 5 10
B 5 2 3.5
B 3 1 3.5
I tried using explode but it is not working as I want it to look like, any idea here would help
With groupby and transform
df['alpha'] = pd.to_numeric(df['alpha'],errors = 'coerce').fillna(0).groupby(df['Name']).transform('mean')
df
Out[50]:
Name T1 T2 alpha
0 A 10 3 10.0
1 A 11 5 10.0
2 A 13 5 10.0
3 B 5 2 3.5
4 B 3 1 3.5
This'll get the job done:
df['alpha'] = (df['alpha'] / df['T2']).ffill()
Output:
Name T1 T2 alpha
0 A 10 3 10.0
1 A 11 5 10.0
2 A 13 5 10.0
3 B 5 2 3.5
4 B 3 1 3.5

Groupby count of non NaN of another column and a specific calculation of the same columns in pandas

I have a data frame as shown below
ID Class Score1 Score2 Name
1 A 9 7 Xavi
2 B 7 8 Alba
3 A 10 8 Messi
4 A 8 10 Neymar
5 A 7 8 Mbappe
6 C 4 6 Silva
7 C 3 2 Pique
8 B 5 7 Ramos
9 B 6 7 Serge
10 C 8 5 Ayala
11 A NaN 4 Casilas
12 A NaN 4 De_Gea
13 B NaN 2 Seaman
14 C NaN 7 Chilavert
15 B NaN 3 Courtous
From the above, I would like to calculate the number of players with scoer1 less than or equal to 6 in each Class along with count of non NaN rows (Class wise)
Expected output:
Class Total_Number Count_Non_NaN Score1_less_than_6_# Avg_score1
A 6 4 0 8.5
B 5 3 2 6
C 4 3 2 5
tried below code
df2 = df.groupby('Class').agg(Total_Number = ('Score1','size'),
Score1_less_than_6 = ('Score1',lambda x: x.between(0,6).sum()),
Avg_score1 = ('Score1','mean'))
df2 = df2.reset_index()
df2
Groupby and aggregate using a dictionary
df['s'] = df['Score1'].le(6)
df.groupby('Class').agg(**{'total_number': ('Score1', 'size'),
'count_non_nan': ('Score1', 'count'),
'score1_less_than_six': ('s', 'sum'),
'avg_score1': ('Score1', 'mean')})
total_number count_non_nan score1_less_than_six avg_score1
Class
A 6 4 0 8.5
B 5 3 2 6.0
C 4 3 2 5.0
Try:
x = df.groupby("Class", as_index=False).agg(
Total_Number=("Class", "count"),
Count_Non_NaN=("Score1", lambda x: x.notna().sum()),
Score1_less_than_6=("Score1", lambda x: (x <= 6).sum()),
Avg_score1=("Score1", "mean"),
)
print(x)
Prints:
Class Total_Number Count_Non_NaN Score1_less_than_6 Avg_score1
0 A 6 4.0 0.0 8.5
1 B 5 3.0 2.0 6.0
2 C 4 3.0 2.0 5.0

Join with column having the max sequence number

I have a margin table
item margin
0 a 3
1 b 4
2 c 5
and an item table
item sequence
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 c 1
6 c 2
7 c 3
I want to join the two table so that the margin will only be joined to the product with maximum sequence number, the desired outcome is
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
How to achieve this?
Below is the code for margin and item table
import pandas as pd
df_margin=pd.DataFrame({"item":["a","b","c"],"margin":[3,4,5]})
df_item=pd.DataFrame({"item":["a","a","a","b","b","c","c","c"],"sequence":[1,2,3,1,2,1,2,3]})
One option would be to merge then replace extra values with NaN via Series.where:
new_df = df_item.merge(df_margin)
new_df['margin'] = new_df['margin'].where(
new_df.groupby('item')['sequence'].transform('max').eq(new_df['sequence'])
)
Or with loc:
new_df = df_item.merge(df_margin)
new_df.loc[new_df.groupby('item')['sequence']
.transform('max').ne(new_df['sequence']), 'margin'] = np.NAN
Another option would be to assign a temp column to both frames df_item with True where the value is maximal, and df_margin is True everywhere then merge outer and drop the temp column:
new_df = (
df_item.assign(
t=df_item
.groupby('item')['sequence']
.transform('max')
.eq(df_item['sequence'])
).merge(df_margin.assign(t=True), how='outer').drop('t', 1)
)
Both produce new_df:
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
You could do:
df_item.merge(df_item.groupby('item')['sequence'].max().\
reset_index().merge(df_margin), 'left')
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
Breakdown:
df_new = df_item.groupby('item')['sequence'].max().reset_index().merge(df_margin)
df_item.merge(df_new, 'left')

Multiple columns difference of 2 Pandas DataFrame

I am new to Python and Pandas , can someone help me with below report.
I want to report difference of N columns and create new columns with difference value, is it possible to make it dynamic as I have more than 30 columns. (Columns are fixed numbers, rows values can change)
A and B can be Alpha numeric
Use join with sub for difference of DataFrames:
#if columns are strings, first cast it
df1 = df1.astype(int)
df2 = df2.astype(int)
#if first columns are not indices
#df1 = df1.set_index('ID')
#df2 = df2.set_index('ID')
df = df1.join(df2.sub(df1).add_prefix('sum'))
print (df)
A B sumA sumB
ID
0 10 2.0 5 3.0
1 11 3.0 6 5.0
2 12 4.0 7 5.0
Or similar:
df = df1.join(df2.sub(df1), rsuffix='sum')
print (df)
A B Asum Bsum
ID
0 10 2.0 5 3.0
1 11 3.0 6 5.0
2 12 4.0 7 5.0
Detail:
print (df2.sub(df1))
A B
ID
0 5 3.0
1 6 5.0
2 7 5.0
IIUC
df1[['C','D']]=(df2-df1)[['A','B']]
df1
Out[868]:
ID A B C D
0 0 10 2.0 5 3.0
1 1 11 3.0 6 5.0
2 2 12 4.0 7 5.0
df1.assign(B=0)
Out[869]:
ID A B C D
0 0 10 0 5 3.0
1 1 11 0 6 5.0
2 2 12 0 7 5.0
The 'ID' column should really be an index. See the Pandas tutorial on indexing for why this is a good idea.
df1 = df1.set_index('ID')
df2 = df2.set_index('ID')
df = df1.copy()
df[['C', 'D']] = df2 - df1
df['B'] = 0
print(df)
outputs
A B C D
ID
0 10 0 5 3.0
1 11 0 6 5.0
2 12 0 7 5.0

Column name and index of max value

I currently have a pandas dataframe where values between 0 and 1 are saved. I am looking for a function which can provide me the top 5 values of a column, together with the name of the column and the associated index of the values.
Sample Input: data frame with column names a:z, index 1:23, entries are values between 0 and 1
Sample Output: array of 5 highest entries in each column, each with column name and index
Edit:
For the following data frame:
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I would like to get an output like (for example for the first column):
[[8,b,A], [8, c, A], [6,i,A], [5, h, A], [4,g,A]].
consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I'm going to use np.argpartition to separate each column into the 5 smallest and 10 - 5 (also 5) largest
v = df.values
i = df.index.values
k = len(v) - 5
pd.DataFrame(
i[v.argpartition(k, 0)[-k:]],
np.arange(k), df.columns
)
A B C D
0 g f i i
1 b c a d
2 h h f h
3 i b d f
4 c j h g
print(your_dataframe.sort_values(ascending=False)[0:4])

Resources