I was working on a movie sentiment analysis but, code but i'm facing issues in my code related to processing words - nlp

My data set has 42,000 rows. This is the code I used to edit my text before vectorizing it. However the problem is it has a nested for loop which I guess makes it very slow and I'm not being able to use it with more than 1500 rows. Can someone please help out on a better way to do this?
filtered = []
for i in range(2):
rev = re.sub('[^a-zA-Z]', ' ', df['text'][i])
rev = rev.lower()
rev = rev.split()
filtered =[]
for word in rev:
if word not in stopwords.words("english"):
word = PorterStemmer().stem(word)
filtered.append(word)
filtered = " ".join(filtered)
corpus.append(filtered)

I've used line_profiler to measure the speed of the code you posted.
The measurement results are as follows.
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 #profile
9 def profile_nltk():
10 1 435819.0 435819.0 0.3 df = pd.read_csv('IMDB_Dataset.csv') # (50000, 2)
11 1 1.0 1.0 0.0 filtered = []
12 1 247.0 247.0 0.0 reviews = df['review'][:4000]
13 1 0.0 0.0 0.0 corpus = []
14 4001 216341.0 54.1 0.1 for i in range(len(reviews)):
15 4000 221885.0 55.5 0.2 rev = re.sub('[^a-zA-Z]', ' ', df['review'][i])
16 4000 3878.0 1.0 0.0 rev = rev.lower()
17 4000 30209.0 7.6 0.0 rev = rev.split()
18 4000 1097.0 0.3 0.0 filtered = []
19 950808 235589.0 0.2 0.2 for word in rev:
20 946808 115658060.0 122.2 78.2 if word not in stopwords.words("english"):
21 486614 30898223.0 63.5 20.9 word = PorterStemmer().stem(word)
22 486614 149604.0 0.3 0.1 filtered.append(word)
23 4000 11290.0 2.8 0.0 filtered = " ".join(filtered)
24 4000 1429.0 0.4 0.0 corpus.append(filtered)
As #parsa-abbasi pointed out, the process of checking the stopword accounts for about 80% of the total.
The measurement results for the modified script are as follows. The same process has been reduced to about 1/100th of the processing time.
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 #profile
9 def profile_nltk():
10 1 441467.0 441467.0 1.4 df = pd.read_csv('IMDB_Dataset.csv') # (50000, 2)
11 1 1.0 1.0 0.0 filtered = []
12 1 335.0 335.0 0.0 reviews = df['review'][:4000]
13 1 1.0 1.0 0.0 corpus = []
14 1 2696.0 2696.0 0.0 stopwords_set = stopwords.words('english')
15 4001 59013.0 14.7 0.2 for i in range(len(reviews)):
16 4000 186393.0 46.6 0.6 rev = re.sub('[^a-zA-Z]', ' ', df['review'][i])
17 4000 3657.0 0.9 0.0 rev = rev.lower()
18 4000 27357.0 6.8 0.1 rev = rev.split()
19 4000 999.0 0.2 0.0 filtered = []
20 950808 220673.0 0.2 0.7 for word in rev:
21 # if word not in stopwords.words("english"):
22 946808 1201271.0 1.3 3.8 if word not in stopwords_set:
23 486614 29479712.0 60.6 92.8 word = PorterStemmer().stem(word)
24 486614 141242.0 0.3 0.4 filtered.append(word)
25 4000 10412.0 2.6 0.0 filtered = " ".join(filtered)
26 4000 1329.0 0.3 0.0 corpus.append(filtered)
I hope this is helpful.

The most time-consuming part of the written code is the stopwords part.
It will call the library to get the list of stopwords each time the loop iterates.
Therefore it's better to get the stopwords set once and use the same set at each iteration.
I rewrote the code as following (other differences are made just for the sake of the readability):
corpus = []
texts = df['text']
stopwords_set = stopwords.words("english")
stemmer = PorterStemmer()
for i in range(len(texts)):
rev = re.sub('[^a-zA-Z]', ' ', texts[i])
rev = rev.lower()
rev = rev.split()
filtered = []
filtered = [stemmer.stem(word) for word in rev if word not in stopwords_set]
filtered = " ".join(filtered)
corpus.append(filtered)

Related

Python - Improve Speed on value_counts

I am wrangling some raw data and wish to count the instances of certain metrics (column 'Stat') in a chain (denominated by a unique identifer 'c' noted in column 'chain_id'), this is saved to a dict then mapped to a new column (not shown below).
I wish however to:
Improve the speed for the loop, which I have to ~10 it/s over the 34k rows, from an initial ~3.
Improve the structure of the variety of try/except statements, noting that each chain will not always have for example a 'Kick' or 'Mark', etc. in the value_counts() output, so these are required to be 0.
I've searched for other ways on SOF, but none of the existing answer's suit - please ignore the indentation of the for loop, it will not allow me to correct it
import pandas as pd
from tqdm.notebook import tqdm
s = ['Hitout', 'Kick', 'Disposal', 'Centre Clearance', 'Tackle', 'Hitout',
'Hitout To Advantage', 'Free Against', 'Contested Possession', 'Free For',
'Handball', 'Disposal', 'Effective Disposal', 'Stoppage Clearance',
'Uncontested Possession', 'Kick', 'Effective Kick', 'Disposal', 'Effective Disposal',
'Mark', 'Uncontested Possession', 'F 50 Mark', 'Mark On Lead', 'Kick', 'Disposal',
'Shot At Goal', 'Behind', 'Kick In', 'One Percenter', 'Kick', 'Effective Kick',
'Disposal', 'Effective Disposal', 'Rebound 50', 'Spoil', 'One Percenter']
x = ['Hitout', 'RI-1', 'RI-1', 'RI-1', 'RI-1', 'Hitout', 'Hitout', 'RI-7', 'RI-7',
'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7',
'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'CA-27', 'CA-27',
'CA-27', 'CA-27', 'CA-27', 'CA-27', 'CA-27', 'CA-27', 'CA-27']
df = pd.DataFrame({'chain_id':x,'Stat':s})
for c in tqdm(chains):
if c == 'Hitout':
chain_count[c] = 0
hb_count[c] = 0
ki_count[c] = 0
m_count[c] = 0
goal_count[c] = 0
behind_count[c] = 0
cp_count[c] = 0
up_count[c] = 0
t_count[c] = 0
chain_time[c] = 0
else:
temp = df[df['chain_id']==c]['Stat'].value_counts()
try:
chain_count[c] = temp['Disposal']
except:
chain_count[c] = 0
try:
ki_count[c] = temp['Kick']
except:
ki_count[c] = 0
try:
hb_count[c] = temp['Handball']
except:
hb_count[c] = 0
try:
m_count[c] = temp['Mark']
except:
m_count[c] = 0
try:
goal_count[c] = temp['Goal']
except:
goal_count[c] = 0
try:
behind_count[c] = temp['Behind']
except:
behind_count[c] = 0
try:
cp_count[c] = temp['Contested Possession']
except:
cp_count[c] = 0
try:
up_count[c] = temp['Uncontested Possession']
except:
up_count[c] = 0
try:
t_count[c] = temp['Tackle']
except:
t_count[c] = 0
chain_time[c] = time(c)
df['chain_length'] = df['chain_id'].map(chain_count)
df['chain_hb'] = df['chain_id'].map(hb_count)
df['chain_ki'] = df['chain_id'].map(ki_count)
df['chain_m'] = df['chain_id'].map(m_count)
df['chain_goal'] = df['chain_id'].map(goal_count)
df['chain_behind'] = df['chain_id'].map(behind_count)
df['chain_cp'] = df['chain_id'].map(cp_count)
df['chain_up'] = df['chain_id'].map(up_count)
df['chain_t'] = df['chain_id'].map(t_count)
df['chain_time'] = df['chain_id'].map(chain_time)
Edited: Included a sample, with output how it currently works below
Creating a status column which isolates Disposal, Handball and other acts
df['status']=np.select([df['Stat'].eq('Handball'),df['Stat'].eq('Disposal')],['chain_length','chain_hb'],'Notimportant')
Find the frequency of occurance of each of the acts
s=df.join(df.groupby(['chain_id','Stat']).apply(lambda x: pd.get_dummies(x['status'])).fillna(0)).drop(columns=['status','Notimportant'])
Use Transform to sum and cascade the totals
s[['chain_hb','chain_length']]=s.groupby('chain_id')[['chain_hb','chain_length']].transform('sum')
Outcome
chain_id Stat chain_hb chain_length
0 Hitout Hitout 0.0 0.0
1 RI-1 Kick 1.0 0.0
2 RI-1 Disposal 1.0 0.0
3 RI-1 Centre Clearance 1.0 0.0
4 RI-1 Tackle 1.0 0.0
5 Hitout Hitout 0.0 0.0
6 Hitout Hitout To Advantage 0.0 0.0
7 RI-7 Free Against 3.0 1.0
8 RI-7 Contested Possession 3.0 1.0
9 RI-7 Free For 3.0 1.0
10 RI-7 Handball 3.0 1.0
11 RI-7 Disposal 3.0 1.0
12 RI-7 Effective Disposal 3.0 1.0
13 RI-7 Stoppage Clearance 3.0 1.0
14 RI-7 Uncontested Possession 3.0 1.0
15 RI-7 Kick 3.0 1.0
16 RI-7 Effective Kick 3.0 1.0
17 RI-7 Disposal 3.0 1.0
18 RI-7 Effective Disposal 3.0 1.0
19 RI-7 Mark 3.0 1.0
20 RI-7 Uncontested Possession 3.0 1.0
21 RI-7 F 50 Mark 3.0 1.0
22 RI-7 Mark On Lead 3.0 1.0
23 RI-7 Kick 3.0 1.0
24 RI-7 Disposal 3.0 1.0
25 RI-7 Shot At Goal 3.0 1.0
26 RI-7 Behind 3.0 1.0
27 CA-27 Kick In 1.0 0.0
28 CA-27 One Percenter 1.0 0.0
29 CA-27 Kick 1.0 0.0
30 CA-27 Effective Kick 1.0 0.0
31 CA-27 Disposal 1.0 0.0
32 CA-27 Effective Disposal 1.0 0.0
33 CA-27 Rebound 50 1.0 0.0
34 CA-27 Spoil 1.0 0.0
35 CA-27 One Percenter 1.0 0.0

Improving speed when using a for loop for each user groups

Suppose we have following dataset with the output window_num:
index user1 date different_months org_different_months window_num
1690289 2670088 2006-08-01 243.0 243.0 1
1772121 2717874 2005-12-01 0.0 0.0 1
1772123 2717874 2005-12-01 0.0 0.0 1
1772125 2717874 2005-12-01 0.0 0.0 1
1772130 2717874 2005-12-01 0.0 0.0 1
1772136 2717874 2006-01-01 0.0 0.0 1
1772132 2717874 2006-02-01 0.0 2099.0 1
1772134 2717874 2020-08-27 0.0 0.0 4
1772117 2717874 0.0 0.0 4
1772118 2717874 0.0 0.0 4
1772128 2717874 2019-11-01 300.0 300.0 3
1772127 2717874 2011-11-01 2922.0 2922.0 2
1774815 2719456 2006-09-01 0.0 0.0 2
1774809 2719456 2006-10-01 0.0 1949.0 2
1774821 2719456 2020-05-20 0.0 0.0 7
1774803 2719456 0.0 0.0 7
1774806 2719456 0.0 0.0 7
1774819 2719456 2019-08-29 265.0 265.0 6
1774825 2719456 2014-10-01 384.0 384.0 4
1774812 2719456 2005-07-01 427.0 427.0 1
1774816 2719456 2012-02-01 973.0 973.0 3
1774824 2719456 2015-10-20 1409.0 1409.0 5
The user number is represented by user1. The output is the window_num which is generated using different_months and orig_different_months columns. The different_months column is the difference between the date[n] and date[n+1].
Previously, I was using groupby.apply to output window_num, however it became extremely slow when the dataset increased. The code was improved considerably by using the shift functions on the entire dataset to calculate the different_months and orig_different_months column, as well as applying the sort on entire dataset, as seen below:
data = data.sort_values(by=['user','ContractInceptionDateClean'], ascending=[True,True])
#data['user1'] =data['user']
data['different_months'] = (abs((data['ContractInceptionDateClean'].shift(-1)-data['ContractInceptionDateClean'] ).dt.days)).fillna(0)
data.different_months[data['different_months'] < 91] =0
data['shift_different_months']=data['different_months'].shift(1)
data['org_different_months']=data['different_months']
data.loc[((data['different_months'] == 0) | (data['shift_different_months'] == 0)),'different_months']=0
data = salesswindow_cal(data,list(data.user.unique()))
The code that I am currently struggling to improve the speed on is shown below:
def salesswindow_cal(data_,users):
temp = pd.DataFrame()
for u in range(0,len(users)):
df=data_[data_['user']==users[u]]
df['different_months'].values[0]= df['org_different_months'].values[0]
df['window_num']=(df['different_months'].diff() != 0).cumsum()
temp= pd.concat([df,temp],axis=0)
return pd.DataFrame(temp)
A rule of thumb is not to loop through the users and extract df = data_[data_['user']==user]. Instead do groupby:
for u, df in data_.gropuby('user'):
do_some_stuff
Another issue is not to concatenate data iteratively
data_out = []
for user, df in data.groupby('user'):
do_some_stuff
data_out.append(sub_data)
out = pd.concat(data_out)
In your case, you can do a function and groupby().apply() and pandas will concatenate the data for you.
def group_func(df):
d = df.copy()
d['different_months'].values[0] = d['org_different_months'].value[0]
d['window_num'] = (d['different_months'].diff().ne(0).cumsum()
return d
data.groupby('user').apply(group_func)
Update:
Let's try this vectorized approach, which modifies your data inplace
# update the first `different_months`
mask = ~data['user'].duplicated()
data.loc[mask, 'different_months'] == data.loc[mask, 'orginal_different_months']
groups = data.groupby('user')
data['diff'] = groups['different_months'].diff().ne(0)
data['window_num'] = groups['diff'].cumsum()

how to alter column of a dataframe with different values and by various condition in python?

I have a data frame where I want to alter the column "conf" with different values according to the condition satisfied.
df=pd.DataFrame({"conf":[100,100,100,100],
"i":[-2,3,-3,10],
"o":[12,13,14,16],
"n":[6,4,6,1],
"id":[1.4,2,1.3,1.7],
"od":[2,3,2.5,8],
"nd":[2,3,2.4,-0.9],
"iwr":[60,60,45,65],
"owr":[65,88,90,78],
"nwr":[67,63,60,60]})
df
i want to do this but i am doing it in the wrong way beacuse i am passing series to if condition, which is not the right way.
def column_alter(df):
if (((df["id"]>1.5) & (df["iwr"]>1.5)):
if (((df["od"]>1.5) & (df["nd"]>1.5)) & ((df["owr"]>60) & (df["nwr"]>60))):
df["conf"]= df["conf"]
else:
df["conf"]= df["conf"]*0.5
else:
if (((df["od"]>1.5) & (df["nd"]>1.5)) & ((df["owr"]>60) &(df["nwr"]>60))):
df["conf"]= df["conf"]
else:
df["conf"]= df["conf"]*0.25
return df
Required Output: I want to return whole dataframe with modified Conf value i.e [100,100,25,50]
For else is possible invert mask by ~, chain mask by & and multiple with DataFrame.loc:
m0 = (df["id"]>1.5) & (df["iwr"]>1.5)
m1 = (df["od"]>1.5) & (df["nd"]>1.5) & (df["owr"]>60) & (df["nwr"]>60)
df.loc[m0 & ~m1, "conf"] *= 0.5
df.loc[~m0 & ~m1, "conf"] *= 0.25
print (df)
conf i o n id od nd iwr owr nwr
0 100.0 -2 12 6 1.4 2.0 2.0 60 65 67
1 100.0 3 13 4 2.0 3.0 3.0 60 88 63
2 25.0 -3 14 6 1.3 2.5 2.4 45 90 60
3 50.0 10 16 1 1.7 8.0 -0.9 65 78 60
Another solution with numpy.select:
m0 = (df["id"]>1.5) & (df["iwr"]>1.5)
m1 = (df["od"]>1.5) & (df["nd"]>1.5) & (df["owr"]>60) & (df["nwr"]>60)
df['conf'] *= np.select([m0 & ~m1, ~m0 & ~m1], [0.5, 0.25], default=1)
#long alternative
#df['conf'] = np.select([m0 & ~m1, ~m0 & ~m1],
[df['conf'] * 0.5, df['conf'] * 0.25], default=df['conf'])
print (df)
conf i o n id od nd iwr owr nwr
0 100.0 -2 12 6 1.4 2.0 2.0 60 65 67
1 100.0 3 13 4 2.0 3.0 3.0 60 88 63
2 25.0 -3 14 6 1.3 2.5 2.4 45 90 60
3 50.0 10 16 1 1.7 8.0 -0.9 65 78 60

Python Pandas: How to insert a new column which is a sum of next 'n' (can be a fraction also) values of another column?

I've got a DataFrame, let's say the name is 'test' storing data as below:
Week Stock(In Number of Weeks) Demand (In Units)
0 W01 2.4 37
1 W02 3.6 33
2 W03 2.0 46
3 W04 5.8 45
4 W05 4.6 56
5 W06 3.0 38
6 W07 5.0 45
7 W08 7.5 54
8 W09 4.3 35
9 W10 2.2 38
10 W11 2.0 50
11 W12 6.0 37
I want to insert a new column in this dataframe which for every row, is the sum of "No. of weeks" rows of column "Demand(In Units)".
That is, in the case of this dataframe,
for 0th row that new column should be the sum of 2.4 rows of column "Demand(In Units)" which would be 37+33+ 0.4*46
for 1st row, the value should be 33+46+45+ 0.6*56
for 2nd row, it should be 46+45
.
.
.
for 7th row, it should be 54+35+38+50+37 (since number of rows left are smaller than the value 7.5, all the remaining rows get summed up)
.
.
.
and so on.
Effectively, I want my dataframe to have a new column as follows:
Week Stock(In Number of Weeks) Demand (In Units) Stock (In Units)
0 W01 2.4 37 88.4
1 W02 3.6 33 157.6
2 W03 2.0 46 91.0
3 W04 5.8 45 266.0
4 W05 4.6 56 214.0
5 W06 3.0 38 137.0
6 W07 5.0 45 222.0
7 W08 7.5 54 214.0
8 W09 4.3 35 160.0
9 W10 2.2 38 95.4
10 W11 2.0 50 87.0
11 W12 6.0 37 37.0
Can somebody suggest some way to achieve this?
I can achieve it through iterating over each row but it would be very slow for millions of rows which I want to process at a time.
The code which I am using right now is:
for i in range(len(test)):
if int(np.floor(test.loc[i, 'Stock(In Number of Weeks)'])) >= len(test[i:]):
number_of_full_rows = len(test[i:])
fraction_of_last_row = 0
y = 0
else:
number_of_full_rows = int(np.floor(test.loc[i, 'Stock(In Number of Weeks)']))
fraction_of_last_row = test.loc[i, 'Stock(In Number of Weeks)'] - number_of_full_rows
y = test.loc[i+number_of_full_rows, 'Demand (In Units)'] * fraction_of_last_row
x = np.sum(test[i:i+number_of_full_rows]['Demand (In Units)'])
test.loc[i, 'Stock (In Units)'] = x+y
I tried with some test data:
def func(r, col):
n = int(r['Stock(In Number of Weeks)'])
f = float(r['Stock(In Number of Weeks)'] - n)
i = r.name # row index value
z = np.zeros(len(df)) #initialize all zeros
v = np.hstack((np.ones(n), np.array(f))) # vecotor of ones and fraction part
e = min(len(v), len(z[i:]))
z[i:i+e] = v[:len(z[i:])] #change z starting at index until lenght
r['Stock (In Units)'] = col # z #compute scalar product
return r
df = df.apply(lambda r: func(df['Demand (In Units)'].values, r), axis=1)

Cannot assign row in pandas.Dataframe

I am trying to calculate the mean of the rows of a DataFrame which have the same value on a specified column col. However I'm stuck at assigning a row of the pandas DataFrame.
Here's my code:
def code(data, col):
""" Finds average value of all rows that have identical col values from column col .
Returns new Pandas.DataFrame with the data
"""
values = pd.unique(data[col])
rows = len(values)
res = pd.DataFrame(np.zeros(shape = (rows, len(data.columns))), columns = data.columns)
for i, v in enumerate(values):
e = data[data[col] == v].mean().to_frame().transpose()
res[i:i+1] = e
return res
The problem is that the code only works for the first row, and puts NaN values on the next rows. I have checked the value of e and confirmed it to be good, so there is a problem with the assignment res[i:i+1] = e. I have also tried to do res.iloc[i] = e but i get "ValueError: Incompatible indexer with Series" Is there an alternate way to do this? It seems very straight forward and I'm baffled why it doesn't work...
E.g:
wdata
Out[78]:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 0 0 0.0 -2.320000e-07 -4.862400e-08
1 1 0 0 0.1 -1.000000e-04 1.000000e-04
2 1 0 0 0.2 -1.000000e-03 1.000000e-03
3 1 0 0 0.3 -1.000000e-02 1.000000e-02
4 1 1 1 0.0 3.554000e-07 -2.012000e-07
5 1 2 2 0.0 5.353000e-08 -1.684000e-07
6 1 3 3 0.0 9.369400e-08 -2.121400e-08
7 1 4 4 0.0 3.286200e-08 -2.093600e-08
8 1 5 5 0.0 8.978600e-08 -3.262000e-07
9 1 6 6 0.0 3.624800e-08 -2.507600e-08
10 1 7 7 0.0 2.957000e-08 -1.993200e-08
11 1 8 8 0.0 7.732600e-08 -3.773200e-08
12 1 9 9 0.0 9.300000e-08 -3.521200e-08
13 1 10 10 0.0 8.468000e-09 -6.990000e-09
14 1 11 11 0.0 1.434200e-11 -1.200000e-11
15 2 0 0 0.0 8.118000e-11 -5.254000e-11
16 2 1 1 0.0 9.322000e-11 -1.359200e-10
17 2 2 2 0.0 1.944000e-10 -2.409400e-10
18 2 3 3 0.0 7.756000e-11 -8.556000e-11
19 2 4 4 0.0 1.260000e-11 -8.618000e-12
20 2 5 5 0.0 7.122000e-12 -1.402000e-13
21 2 6 6 0.0 6.224000e-11 -2.760000e-11
22 2 7 7 0.0 1.133400e-08 -6.566000e-09
23 2 8 8 0.0 6.600000e-13 -1.808000e-11
24 2 9 9 0.0 6.861000e-08 -4.063400e-08
25 2 10 10 0.0 2.743800e-10 -1.336000e-10
Expected output:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.00074 0.00074
0 2 5.5 5.5 0 6.792247e-09 -4.023330e-09
Instead, what i get is:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.00074 0.00074
0 NaN NaN NaN NaN NaN NaN
For example, this code results in:
In[81]: wdata[wdata['Die'] == 2].mean().to_frame().transpose()
Out[81]:
Die Subsite Algorithm Vt1 It1 Ignd
0 2 5.5 5.5 0 6.792247e-09 -4.023330e-09
For me works:
def code(data, col):
""" Finds average value of all rows that have identical col values from column col .
Returns new Pandas.DataFrame with the data
"""
values = pd.unique(data[col])
rows = len(values)
res = pd.DataFrame(columns = data.columns)
for i, v in enumerate(values):
e = data[data[col] == v].mean()
res.loc[i,:] = e
return res
col = 'Die'
print (code(data, col))
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.000739957 0.000739939
1 2 5 5 0 7.34067e-09 -4.35482e-09
but same output has groupby with aggregate mean:
print (data.groupby(col, as_index=False).mean())
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -7.399575e-04 7.399392e-04
1 2 5.0 5.0 0.00 7.340669e-09 -4.354818e-09
A few minutes after I posted the question I solved it by adding a .values to e.
e = data[data[col] == v].mean().to_frame().transpose().values
However it turns out that what I wanted to do is already done by Pandas. Thanks MaxU!
df.groupBy(col).mean()

Resources