Split value of a row with equal values based on the unique entry - python-3.x

I have a dataframe which looks like this:
Name T1 T2 alpha
A 10 3 30
A 11 5 Nan
A 13 5 Nan
B 5 2 7
B 3 1 Nan
However need to divide the alpha column in equal parts to replace Nan values for each unique names in Name column like this: eg 30 for A becomes 10 each for corresponding row where A is present
Name T1 T2 alpha
A 10 3 10
A 11 5 10
A 13 5 10
B 5 2 3.5
B 3 1 3.5
I tried using explode but it is not working as I want it to look like, any idea here would help

With groupby and transform
df['alpha'] = pd.to_numeric(df['alpha'],errors = 'coerce').fillna(0).groupby(df['Name']).transform('mean')
df
Out[50]:
Name T1 T2 alpha
0 A 10 3 10.0
1 A 11 5 10.0
2 A 13 5 10.0
3 B 5 2 3.5
4 B 3 1 3.5

This'll get the job done:
df['alpha'] = (df['alpha'] / df['T2']).ffill()
Output:
Name T1 T2 alpha
0 A 10 3 10.0
1 A 11 5 10.0
2 A 13 5 10.0
3 B 5 2 3.5
4 B 3 1 3.5

Related

Groupby count of non NaN of another column and a specific calculation of the same columns in pandas

I have a data frame as shown below
ID Class Score1 Score2 Name
1 A 9 7 Xavi
2 B 7 8 Alba
3 A 10 8 Messi
4 A 8 10 Neymar
5 A 7 8 Mbappe
6 C 4 6 Silva
7 C 3 2 Pique
8 B 5 7 Ramos
9 B 6 7 Serge
10 C 8 5 Ayala
11 A NaN 4 Casilas
12 A NaN 4 De_Gea
13 B NaN 2 Seaman
14 C NaN 7 Chilavert
15 B NaN 3 Courtous
From the above, I would like to calculate the number of players with scoer1 less than or equal to 6 in each Class along with count of non NaN rows (Class wise)
Expected output:
Class Total_Number Count_Non_NaN Score1_less_than_6_# Avg_score1
A 6 4 0 8.5
B 5 3 2 6
C 4 3 2 5
tried below code
df2 = df.groupby('Class').agg(Total_Number = ('Score1','size'),
Score1_less_than_6 = ('Score1',lambda x: x.between(0,6).sum()),
Avg_score1 = ('Score1','mean'))
df2 = df2.reset_index()
df2
Groupby and aggregate using a dictionary
df['s'] = df['Score1'].le(6)
df.groupby('Class').agg(**{'total_number': ('Score1', 'size'),
'count_non_nan': ('Score1', 'count'),
'score1_less_than_six': ('s', 'sum'),
'avg_score1': ('Score1', 'mean')})
total_number count_non_nan score1_less_than_six avg_score1
Class
A 6 4 0 8.5
B 5 3 2 6.0
C 4 3 2 5.0
Try:
x = df.groupby("Class", as_index=False).agg(
Total_Number=("Class", "count"),
Count_Non_NaN=("Score1", lambda x: x.notna().sum()),
Score1_less_than_6=("Score1", lambda x: (x <= 6).sum()),
Avg_score1=("Score1", "mean"),
)
print(x)
Prints:
Class Total_Number Count_Non_NaN Score1_less_than_6 Avg_score1
0 A 6 4.0 0.0 8.5
1 B 5 3.0 2.0 6.0
2 C 4 3.0 2.0 5.0

For and if loop combination takes lot of time in Pandas (Data manipulation)

I have two datasets, each about half a million observations. I am writing the below code and it seems the code never seems to stop executing. I would like to know if there is a better way of doing it. Appreciate inputs.
Below are sample formats of my dataframes. Both dataframes share a set of 'sid' values , meaning all the 'sid' values in 'df2' will have a match in 'df1' 'sid' values. The 'tid' values and consequently the 'rid' values (which are a combination of 'sid' and 'tid' values) may not appear in both sets.
The task is simple. I would like to create the 'tv' column in df2. Wherever the 'rid' in df2 matches with the 'rid' in 'df1', the 'tv' column in df2 takes the corresponding 'tv' value from df1. If it does not match, the 'tv' value in 'df2' will be the median 'tv' value for the matching 'sid' subset in 'df1'.
In fact my original task includes creating a few more similar columns like 'tv' in df2 (based on their values in 'df1' ; these columns exist in 'df1').
I believe as my code contains for loop combined with if else statement and multiple value assignment statements, it is taking forever to execute. Appreciate any inputs.
df1
sid tid rid tv
0 0 0 0-0 9
1 0 1 0-1 8
2 0 3 0-3 4
3 1 5 1-5 2
4 1 7 1-7 3
5 1 9 1-9 14
6 1 10 1-10 24
7 1 11 1-11 13
8 2 14 2-14 2
9 2 16 2-16 5
10 3 17 3-17 6
11 3 18 3-18 8
12 3 20 3-20 5
13 3 21 3-21 11
14 4 23 4-23 6
df2
sid tid rid
0 0 0 0-0
1 0 2 0-2
2 1 3 1-3
3 1 6 1-6
4 1 9 1-9
5 2 10 2-10
6 2 12 2-12
7 3 1 3-1
8 3 15 3-15
9 3 1 3-1
10 4 19 4-19
11 4 22 4-22
rids = [rid.split('-') for rid in df1.rid]
for r in df2.rid:
s,t = r.split('-')
if [s,t] in rids:
df2.loc[df2.rid== r,'tv'] = df1.loc[df1.rid == r,'tv']
else:
df2.loc[df2.rid== r,'tv'] = df1.loc[df1.sid == int(s),'tv'].median()
The expected df2 shall be as follows:
sid tid rid tv
0 0 0 0-0 9.0
1 0 2 0-2 8.0
2 1 3 1-3 13.0
3 1 6 1-6 13.0
4 1 9 1-9 14.0
5 2 10 2-10 3.5
6 2 12 2-12 3.5
7 3 1 3-1 7.0
8 3 15 3-15 7.0
9 3 1 3-1 7.0
10 4 19 4-19 6.0
11 4 22 4-22 6.0
You can left merge on df2 with a subset(because you need only tv column you can also pass the df1 without any subset) of df1 on 'rid' then calculate median and fill values:
out=df2.merge(df1[['rid','tv']],on='rid',how='left')
out['tv']=out['tv_y'].fillna(out['sid'].map(df1.groupby('sid')['tv'].median()))
out= out.drop(['tv_x','tid_y','tv_y'], axis=1)
out = out.rename(columns = {'tid_x': 'tid'})
out
OR
Since you said that:
all the 'sid' values in 'df2' will have a match in 'df1' 'sid' values
So you can also left merge them on ['sid','rid'] and then fillna() value of tv with the median of df1 'tv' column by mapping values using map() method:
out=df2.merge(df1,on=['sid','rid'],how='left')
out['tv']=out['tv_y'].fillna(out['sid'].map(df1.groupby('sid')['tv'].median()))
out= out.drop(['tv_x','tv_y'], axis=1)
out
output of out:
sid tid rid tv
0 0 0 0-0 9.0
1 0 2 0-2 8.0
2 1 3 1-3 13.0
3 1 6 1-6 13.0
4 1 9 1-9 14.0
5 2 10 2-10 3.5
6 2 12 2-12 3.5
7 3 1 3-1 7.0
8 3 15 3-15 7.0
9 3 1 3-1 7.0
10 4 19 4-19 6.0
11 4 22 4-22 6.0
Here is a suggestion without any loops, based on dictionaries:
matching_values = dict(zip(df1['rid'][df1['rid'].isin(df2['rid'])], df1['tv'][df1['rid'].isin(df2['rid'])]))
df2[df2['rid'].isin(df1['rid'])]['tv'] = df2[df2['rid'].isin(df1['rid'])]['rid']
df2[df2['rid'].isin(df1['rid'])]['tv'].replace(matching_values)
median_values = df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])].groupby('sid')['tv'].median().to_dict()
df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['tv'] = df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['sid']
df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['tv'].replace(median_values)
This should do the trick. The logic here is that we first create a dictionary, in which the "rid and "sid" values are the keys and the median and matching "tv" values are the dictionary values. Next, we replace the "tv" values in df2 with the rid and sid keys, respectively, (because they are the dictionary keys) which can thus easily be replaced by the correct tv values by calling .replace().
Don't use for loops in pandas, that is known to be slow. That way you don't get to benefit from all the internal optimizations that have been made.
Try to use the split-apply-combine pattern:
split df1 into sid to calculate the median: df1.groupby('sid')['tv'].median()
join df2 on df1: df2.join(df1.set_index('rid'), on='rid')
fill the NaN values with the median calculated in step 1.
(Haven't tested the code).

Join with column having the max sequence number

I have a margin table
item margin
0 a 3
1 b 4
2 c 5
and an item table
item sequence
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 c 1
6 c 2
7 c 3
I want to join the two table so that the margin will only be joined to the product with maximum sequence number, the desired outcome is
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
How to achieve this?
Below is the code for margin and item table
import pandas as pd
df_margin=pd.DataFrame({"item":["a","b","c"],"margin":[3,4,5]})
df_item=pd.DataFrame({"item":["a","a","a","b","b","c","c","c"],"sequence":[1,2,3,1,2,1,2,3]})
One option would be to merge then replace extra values with NaN via Series.where:
new_df = df_item.merge(df_margin)
new_df['margin'] = new_df['margin'].where(
new_df.groupby('item')['sequence'].transform('max').eq(new_df['sequence'])
)
Or with loc:
new_df = df_item.merge(df_margin)
new_df.loc[new_df.groupby('item')['sequence']
.transform('max').ne(new_df['sequence']), 'margin'] = np.NAN
Another option would be to assign a temp column to both frames df_item with True where the value is maximal, and df_margin is True everywhere then merge outer and drop the temp column:
new_df = (
df_item.assign(
t=df_item
.groupby('item')['sequence']
.transform('max')
.eq(df_item['sequence'])
).merge(df_margin.assign(t=True), how='outer').drop('t', 1)
)
Both produce new_df:
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
You could do:
df_item.merge(df_item.groupby('item')['sequence'].max().\
reset_index().merge(df_margin), 'left')
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
Breakdown:
df_new = df_item.groupby('item')['sequence'].max().reset_index().merge(df_margin)
df_item.merge(df_new, 'left')

pandas groupby extract top percent n data(descending)

I have some data like this
df = pd.DataFrame({'class':['a','a','b','b','a','a','b','c','c'],'score':[3,5,6,7,8,9,10,11,14]})
df
class score
0 a 3
1 a 5
2 b 6
3 b 7
4 a 8
5 a 9
6 b 10
7 c 11
8 c 14
I want to use groupby function extract top n% data(descending by score),i know the nlargest can make it,but the number of every group is different,so i don't know how to do it
I tried this function
top_n = 0.5
g = df.groupby(['class'])['score'].apply(lambda x:x.nlargest(int(round(top_n*len(x))),keep='all')).reset_index()
g
class level_1 score
0 a 5 9
1 a 4 8
2 b 6 10
3 b 3 7
4 c 8 14
but it can not deal with big data(more than 10 million),it is very slow,how do i speed it,thank you!

Working with two data frames with different size in python

I am working with two data frames.
The sample data is as follow:
DF = ['A','B','C','D','E','A','C','B','B']
DF1 = pd.DataFrame({'Team':DF})
DF2 = pd.DataFrame({'Team':['A','B','C','D','E'],'Rating':[1,2,3,4,5]})
i want to add a new column to DF1 as follow:
Team Rating
A 1
B 2
C 3
D 4
E 5
A 1
C 3
B 2
B 2
How can I add a new column?
I used
DF1['Rating']= np.where(DF1['Team']== DF2['Team'],DF2['Rating'],0)
Error : ValueError: Can only compare identically-labeled Series objects
Thanks
ZEP
I think need map by Series created with set_index and if not match get NaNs, so fillna was added for replace to 0:
DF1['Rating']= DF1['Team'].map(DF2.set_index('Team')['Rating']).fillna(0)
print (DF1)
Team Rating
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 A 1
6 C 3
7 B 2
8 B 2
DF = ['A','B','C','D','E','A','C','B','B', 'G']
DF1 = pd.DataFrame({'Team':DF})
DF2 = pd.DataFrame({'Team':['A','B','C','D','E'],'Rating':[1,2,3,4,5]})
DF1['Rating']= DF1['Team'].map(DF2.set_index('Team')['Rating']).fillna(0)
print (DF1)
Team Rating
0 A 1.0
1 B 2.0
2 C 3.0
3 D 4.0
4 E 5.0
5 A 1.0
6 C 3.0
7 B 2.0
8 B 2.0
9 G 0.0 <- G not in DF2['Team']
Detail:
print (DF1['Team'].map(DF2.set_index('Team')['Rating']))
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 1.0
6 3.0
7 2.0
8 2.0
9 NaN
Name: Team, dtype: float64
You can use:
In [54]: DF1['new_col'] = DF1.Team.map(DF2.set_index('Team').Rating)
In [55]: DF1
Out[55]:
Team new_col
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 A 1
6 C 3
7 B 2
8 B 2
i think you can use pd.merge
DF1=pd.merge(DF1,DF2,how='left',on='Team')
DF1
Team Rating
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 A 1
6 C 3
7 B 2
8 B 2

Resources