Python - Improve Speed on value_counts - python-3.x

I am wrangling some raw data and wish to count the instances of certain metrics (column 'Stat') in a chain (denominated by a unique identifer 'c' noted in column 'chain_id'), this is saved to a dict then mapped to a new column (not shown below).
I wish however to:
Improve the speed for the loop, which I have to ~10 it/s over the 34k rows, from an initial ~3.
Improve the structure of the variety of try/except statements, noting that each chain will not always have for example a 'Kick' or 'Mark', etc. in the value_counts() output, so these are required to be 0.
I've searched for other ways on SOF, but none of the existing answer's suit - please ignore the indentation of the for loop, it will not allow me to correct it
import pandas as pd
from tqdm.notebook import tqdm
s = ['Hitout', 'Kick', 'Disposal', 'Centre Clearance', 'Tackle', 'Hitout',
'Hitout To Advantage', 'Free Against', 'Contested Possession', 'Free For',
'Handball', 'Disposal', 'Effective Disposal', 'Stoppage Clearance',
'Uncontested Possession', 'Kick', 'Effective Kick', 'Disposal', 'Effective Disposal',
'Mark', 'Uncontested Possession', 'F 50 Mark', 'Mark On Lead', 'Kick', 'Disposal',
'Shot At Goal', 'Behind', 'Kick In', 'One Percenter', 'Kick', 'Effective Kick',
'Disposal', 'Effective Disposal', 'Rebound 50', 'Spoil', 'One Percenter']
x = ['Hitout', 'RI-1', 'RI-1', 'RI-1', 'RI-1', 'Hitout', 'Hitout', 'RI-7', 'RI-7',
'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7',
'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'CA-27', 'CA-27',
'CA-27', 'CA-27', 'CA-27', 'CA-27', 'CA-27', 'CA-27', 'CA-27']
df = pd.DataFrame({'chain_id':x,'Stat':s})
for c in tqdm(chains):
if c == 'Hitout':
chain_count[c] = 0
hb_count[c] = 0
ki_count[c] = 0
m_count[c] = 0
goal_count[c] = 0
behind_count[c] = 0
cp_count[c] = 0
up_count[c] = 0
t_count[c] = 0
chain_time[c] = 0
else:
temp = df[df['chain_id']==c]['Stat'].value_counts()
try:
chain_count[c] = temp['Disposal']
except:
chain_count[c] = 0
try:
ki_count[c] = temp['Kick']
except:
ki_count[c] = 0
try:
hb_count[c] = temp['Handball']
except:
hb_count[c] = 0
try:
m_count[c] = temp['Mark']
except:
m_count[c] = 0
try:
goal_count[c] = temp['Goal']
except:
goal_count[c] = 0
try:
behind_count[c] = temp['Behind']
except:
behind_count[c] = 0
try:
cp_count[c] = temp['Contested Possession']
except:
cp_count[c] = 0
try:
up_count[c] = temp['Uncontested Possession']
except:
up_count[c] = 0
try:
t_count[c] = temp['Tackle']
except:
t_count[c] = 0
chain_time[c] = time(c)
df['chain_length'] = df['chain_id'].map(chain_count)
df['chain_hb'] = df['chain_id'].map(hb_count)
df['chain_ki'] = df['chain_id'].map(ki_count)
df['chain_m'] = df['chain_id'].map(m_count)
df['chain_goal'] = df['chain_id'].map(goal_count)
df['chain_behind'] = df['chain_id'].map(behind_count)
df['chain_cp'] = df['chain_id'].map(cp_count)
df['chain_up'] = df['chain_id'].map(up_count)
df['chain_t'] = df['chain_id'].map(t_count)
df['chain_time'] = df['chain_id'].map(chain_time)
Edited: Included a sample, with output how it currently works below

Creating a status column which isolates Disposal, Handball and other acts
df['status']=np.select([df['Stat'].eq('Handball'),df['Stat'].eq('Disposal')],['chain_length','chain_hb'],'Notimportant')
Find the frequency of occurance of each of the acts
s=df.join(df.groupby(['chain_id','Stat']).apply(lambda x: pd.get_dummies(x['status'])).fillna(0)).drop(columns=['status','Notimportant'])
Use Transform to sum and cascade the totals
s[['chain_hb','chain_length']]=s.groupby('chain_id')[['chain_hb','chain_length']].transform('sum')
Outcome
chain_id Stat chain_hb chain_length
0 Hitout Hitout 0.0 0.0
1 RI-1 Kick 1.0 0.0
2 RI-1 Disposal 1.0 0.0
3 RI-1 Centre Clearance 1.0 0.0
4 RI-1 Tackle 1.0 0.0
5 Hitout Hitout 0.0 0.0
6 Hitout Hitout To Advantage 0.0 0.0
7 RI-7 Free Against 3.0 1.0
8 RI-7 Contested Possession 3.0 1.0
9 RI-7 Free For 3.0 1.0
10 RI-7 Handball 3.0 1.0
11 RI-7 Disposal 3.0 1.0
12 RI-7 Effective Disposal 3.0 1.0
13 RI-7 Stoppage Clearance 3.0 1.0
14 RI-7 Uncontested Possession 3.0 1.0
15 RI-7 Kick 3.0 1.0
16 RI-7 Effective Kick 3.0 1.0
17 RI-7 Disposal 3.0 1.0
18 RI-7 Effective Disposal 3.0 1.0
19 RI-7 Mark 3.0 1.0
20 RI-7 Uncontested Possession 3.0 1.0
21 RI-7 F 50 Mark 3.0 1.0
22 RI-7 Mark On Lead 3.0 1.0
23 RI-7 Kick 3.0 1.0
24 RI-7 Disposal 3.0 1.0
25 RI-7 Shot At Goal 3.0 1.0
26 RI-7 Behind 3.0 1.0
27 CA-27 Kick In 1.0 0.0
28 CA-27 One Percenter 1.0 0.0
29 CA-27 Kick 1.0 0.0
30 CA-27 Effective Kick 1.0 0.0
31 CA-27 Disposal 1.0 0.0
32 CA-27 Effective Disposal 1.0 0.0
33 CA-27 Rebound 50 1.0 0.0
34 CA-27 Spoil 1.0 0.0
35 CA-27 One Percenter 1.0 0.0

Related

I was working on a movie sentiment analysis but, code but i'm facing issues in my code related to processing words

My data set has 42,000 rows. This is the code I used to edit my text before vectorizing it. However the problem is it has a nested for loop which I guess makes it very slow and I'm not being able to use it with more than 1500 rows. Can someone please help out on a better way to do this?
filtered = []
for i in range(2):
rev = re.sub('[^a-zA-Z]', ' ', df['text'][i])
rev = rev.lower()
rev = rev.split()
filtered =[]
for word in rev:
if word not in stopwords.words("english"):
word = PorterStemmer().stem(word)
filtered.append(word)
filtered = " ".join(filtered)
corpus.append(filtered)
I've used line_profiler to measure the speed of the code you posted.
The measurement results are as follows.
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 #profile
9 def profile_nltk():
10 1 435819.0 435819.0 0.3 df = pd.read_csv('IMDB_Dataset.csv') # (50000, 2)
11 1 1.0 1.0 0.0 filtered = []
12 1 247.0 247.0 0.0 reviews = df['review'][:4000]
13 1 0.0 0.0 0.0 corpus = []
14 4001 216341.0 54.1 0.1 for i in range(len(reviews)):
15 4000 221885.0 55.5 0.2 rev = re.sub('[^a-zA-Z]', ' ', df['review'][i])
16 4000 3878.0 1.0 0.0 rev = rev.lower()
17 4000 30209.0 7.6 0.0 rev = rev.split()
18 4000 1097.0 0.3 0.0 filtered = []
19 950808 235589.0 0.2 0.2 for word in rev:
20 946808 115658060.0 122.2 78.2 if word not in stopwords.words("english"):
21 486614 30898223.0 63.5 20.9 word = PorterStemmer().stem(word)
22 486614 149604.0 0.3 0.1 filtered.append(word)
23 4000 11290.0 2.8 0.0 filtered = " ".join(filtered)
24 4000 1429.0 0.4 0.0 corpus.append(filtered)
As #parsa-abbasi pointed out, the process of checking the stopword accounts for about 80% of the total.
The measurement results for the modified script are as follows. The same process has been reduced to about 1/100th of the processing time.
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 #profile
9 def profile_nltk():
10 1 441467.0 441467.0 1.4 df = pd.read_csv('IMDB_Dataset.csv') # (50000, 2)
11 1 1.0 1.0 0.0 filtered = []
12 1 335.0 335.0 0.0 reviews = df['review'][:4000]
13 1 1.0 1.0 0.0 corpus = []
14 1 2696.0 2696.0 0.0 stopwords_set = stopwords.words('english')
15 4001 59013.0 14.7 0.2 for i in range(len(reviews)):
16 4000 186393.0 46.6 0.6 rev = re.sub('[^a-zA-Z]', ' ', df['review'][i])
17 4000 3657.0 0.9 0.0 rev = rev.lower()
18 4000 27357.0 6.8 0.1 rev = rev.split()
19 4000 999.0 0.2 0.0 filtered = []
20 950808 220673.0 0.2 0.7 for word in rev:
21 # if word not in stopwords.words("english"):
22 946808 1201271.0 1.3 3.8 if word not in stopwords_set:
23 486614 29479712.0 60.6 92.8 word = PorterStemmer().stem(word)
24 486614 141242.0 0.3 0.4 filtered.append(word)
25 4000 10412.0 2.6 0.0 filtered = " ".join(filtered)
26 4000 1329.0 0.3 0.0 corpus.append(filtered)
I hope this is helpful.
The most time-consuming part of the written code is the stopwords part.
It will call the library to get the list of stopwords each time the loop iterates.
Therefore it's better to get the stopwords set once and use the same set at each iteration.
I rewrote the code as following (other differences are made just for the sake of the readability):
corpus = []
texts = df['text']
stopwords_set = stopwords.words("english")
stemmer = PorterStemmer()
for i in range(len(texts)):
rev = re.sub('[^a-zA-Z]', ' ', texts[i])
rev = rev.lower()
rev = rev.split()
filtered = []
filtered = [stemmer.stem(word) for word in rev if word not in stopwords_set]
filtered = " ".join(filtered)
corpus.append(filtered)

Improving speed when using a for loop for each user groups

Suppose we have following dataset with the output window_num:
index user1 date different_months org_different_months window_num
1690289 2670088 2006-08-01 243.0 243.0 1
1772121 2717874 2005-12-01 0.0 0.0 1
1772123 2717874 2005-12-01 0.0 0.0 1
1772125 2717874 2005-12-01 0.0 0.0 1
1772130 2717874 2005-12-01 0.0 0.0 1
1772136 2717874 2006-01-01 0.0 0.0 1
1772132 2717874 2006-02-01 0.0 2099.0 1
1772134 2717874 2020-08-27 0.0 0.0 4
1772117 2717874 0.0 0.0 4
1772118 2717874 0.0 0.0 4
1772128 2717874 2019-11-01 300.0 300.0 3
1772127 2717874 2011-11-01 2922.0 2922.0 2
1774815 2719456 2006-09-01 0.0 0.0 2
1774809 2719456 2006-10-01 0.0 1949.0 2
1774821 2719456 2020-05-20 0.0 0.0 7
1774803 2719456 0.0 0.0 7
1774806 2719456 0.0 0.0 7
1774819 2719456 2019-08-29 265.0 265.0 6
1774825 2719456 2014-10-01 384.0 384.0 4
1774812 2719456 2005-07-01 427.0 427.0 1
1774816 2719456 2012-02-01 973.0 973.0 3
1774824 2719456 2015-10-20 1409.0 1409.0 5
The user number is represented by user1. The output is the window_num which is generated using different_months and orig_different_months columns. The different_months column is the difference between the date[n] and date[n+1].
Previously, I was using groupby.apply to output window_num, however it became extremely slow when the dataset increased. The code was improved considerably by using the shift functions on the entire dataset to calculate the different_months and orig_different_months column, as well as applying the sort on entire dataset, as seen below:
data = data.sort_values(by=['user','ContractInceptionDateClean'], ascending=[True,True])
#data['user1'] =data['user']
data['different_months'] = (abs((data['ContractInceptionDateClean'].shift(-1)-data['ContractInceptionDateClean'] ).dt.days)).fillna(0)
data.different_months[data['different_months'] < 91] =0
data['shift_different_months']=data['different_months'].shift(1)
data['org_different_months']=data['different_months']
data.loc[((data['different_months'] == 0) | (data['shift_different_months'] == 0)),'different_months']=0
data = salesswindow_cal(data,list(data.user.unique()))
The code that I am currently struggling to improve the speed on is shown below:
def salesswindow_cal(data_,users):
temp = pd.DataFrame()
for u in range(0,len(users)):
df=data_[data_['user']==users[u]]
df['different_months'].values[0]= df['org_different_months'].values[0]
df['window_num']=(df['different_months'].diff() != 0).cumsum()
temp= pd.concat([df,temp],axis=0)
return pd.DataFrame(temp)
A rule of thumb is not to loop through the users and extract df = data_[data_['user']==user]. Instead do groupby:
for u, df in data_.gropuby('user'):
do_some_stuff
Another issue is not to concatenate data iteratively
data_out = []
for user, df in data.groupby('user'):
do_some_stuff
data_out.append(sub_data)
out = pd.concat(data_out)
In your case, you can do a function and groupby().apply() and pandas will concatenate the data for you.
def group_func(df):
d = df.copy()
d['different_months'].values[0] = d['org_different_months'].value[0]
d['window_num'] = (d['different_months'].diff().ne(0).cumsum()
return d
data.groupby('user').apply(group_func)
Update:
Let's try this vectorized approach, which modifies your data inplace
# update the first `different_months`
mask = ~data['user'].duplicated()
data.loc[mask, 'different_months'] == data.loc[mask, 'orginal_different_months']
groups = data.groupby('user')
data['diff'] = groups['different_months'].diff().ne(0)
data['window_num'] = groups['diff'].cumsum()

Creating python data frame from list of dictionary

I have the following data:
sentences = [{'mary':'N', 'jane':'N', 'can':'M', 'see':'V','will':'N'},
{'spot':'N','will':'M','see':'V','mary':'N'},
{'will':'M','jane':'N','spot':'V','mary':'N'},
{'mary':'N','will':'M','pat':'V','spot':'N'}]
I want to create a data frame where each key (from the pairs above) will be the column name and each value (from above) will be the index of the row. The values in the data frame will be counting of each matching point between the key and the value.
The expected result should be:
df = pd.DataFrame([(4,0,0),
(2,0,0),
(0,1,0),
(0,0,2),
(1,3,0),
(2,0,1),
(0,0,1)],
index=['mary', 'jane', 'can', 'see', 'will', 'spot', 'pat'],
columns=('N','M','V'))
Use value_counts per columns in DataFrame.apply, replace missing values, convert to integers and last transpose by DataFrame.T:
df = df.apply(pd.value_counts).fillna(0).astype(int).T
print (df)
M N V
mary 0 3 1
jane 0 2 0
can 1 0 0
see 0 0 2
will 3 1 0
spot 0 2 1
pat 0 0 1
Or use DataFrame.stack with SeriesGroupBy.value_counts and Series.unstack:
df = df.stack().groupby(level=1).value_counts().unstack(fill_value=0)
print (df)
M N V
can 1 0 0
jane 0 2 0
mary 0 3 1
pat 0 0 1
see 0 0 2
spot 0 2 1
will 3 1 0
pd.DataFrame(sentences).T.stack().groupby(level=0).value_counts().unstack().fillna(0)
M N V
can 1.0 0.0 0.0
jane 0.0 2.0 0.0
mary 0.0 3.0 1.0
pat 0.0 0.0 1.0
see 0.0 0.0 2.0
spot 0.0 2.0 1.0
will 3.0 1.0 0.0
Cast as int if needed to.
pd.DataFrame(sentences).T.stack().groupby(level=0).value_counts().unstack().fillna(0).cast("int")

Replace a column value with number using pandas

For the following dataset, I can replace column 1 with the numeric value easily.
df['1'].replace(['A', 'B', 'C', 'D'], [0, 1, 2, 3], inplace=True)
But if I have 3600 or more than that different values in a column, how can I replace it with the numeric values without writing the value of the column.
Please let me know. I don't understand how to do that. If anybody has any solution please share with me.
Thanks in advance.
import pandas as pd
df = pd.DataFrame({1:['A','B','C','C','D','A'],
2:[0.6,0.9,5,4,7,1,],
3:[0.3,1,0.7,8,2,4]})
print(df)
1 2 3
0 A 0.6 0.3
1 B 0.9 1.0
2 C 5.0 0.7
3 C 4.0 8.0
4 D 7.0 2.0
5 A 1.0 4.0
np.where makes it easy.
import numpy as np
df[1] = np.where(df[1]=="A", "0",
np.where(df[1]=="B", "1",
np.where(df[1]=="C","2",
np.where(df[1]=="D","3",np.nan))))
print(df)
1 2 3
0 0 0.6 0.3
1 1 0.9 1.0
2 2 5.0 0.7
3 2 4.0 8.0
4 3 7.0 2.0
5 0 1.0 4.0
But if you have a lot of categories, you might want to think about other ways.
import string
upper=list(string.ascii_uppercase)
a=pd.DataFrame({'Alp':upper})
print(a)
Alp
0 A
1 B
2 C
3 D
4 E
5 F
6 G
7 H
8 I
9 J
.
.
19 T
20 U
21 V
22 W
23 X
24 Y
25 Z
for k in np.arange(0,26):
a=a.replace(to_replace =upper[k],value =k)
print(a)
Alp
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
.
.
.
21 21
22 22
23 23
24 24
25 25
If there is many values for replace you can use factorize:
df[1] = pd.factorize(df[1])[0] + 1
print (df)
1 2 3
0 1 0.6 0.3
1 2 0.9 1.0
2 3 5.0 0.7
3 3 4.0 8.0
4 4 7.0 2.0
5 1 1.0 4.0
You could do something like
df.loc[df['1'] == 'A','1'] = 0
df.loc[df['1'] == 'B','1'] = 1
### Or
keys = df['1'].unique().tolist()
i = 0
for key in keys
df.loc[df['1'] == key,'1'] = i
i = i+1

Cannot assign row in pandas.Dataframe

I am trying to calculate the mean of the rows of a DataFrame which have the same value on a specified column col. However I'm stuck at assigning a row of the pandas DataFrame.
Here's my code:
def code(data, col):
""" Finds average value of all rows that have identical col values from column col .
Returns new Pandas.DataFrame with the data
"""
values = pd.unique(data[col])
rows = len(values)
res = pd.DataFrame(np.zeros(shape = (rows, len(data.columns))), columns = data.columns)
for i, v in enumerate(values):
e = data[data[col] == v].mean().to_frame().transpose()
res[i:i+1] = e
return res
The problem is that the code only works for the first row, and puts NaN values on the next rows. I have checked the value of e and confirmed it to be good, so there is a problem with the assignment res[i:i+1] = e. I have also tried to do res.iloc[i] = e but i get "ValueError: Incompatible indexer with Series" Is there an alternate way to do this? It seems very straight forward and I'm baffled why it doesn't work...
E.g:
wdata
Out[78]:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 0 0 0.0 -2.320000e-07 -4.862400e-08
1 1 0 0 0.1 -1.000000e-04 1.000000e-04
2 1 0 0 0.2 -1.000000e-03 1.000000e-03
3 1 0 0 0.3 -1.000000e-02 1.000000e-02
4 1 1 1 0.0 3.554000e-07 -2.012000e-07
5 1 2 2 0.0 5.353000e-08 -1.684000e-07
6 1 3 3 0.0 9.369400e-08 -2.121400e-08
7 1 4 4 0.0 3.286200e-08 -2.093600e-08
8 1 5 5 0.0 8.978600e-08 -3.262000e-07
9 1 6 6 0.0 3.624800e-08 -2.507600e-08
10 1 7 7 0.0 2.957000e-08 -1.993200e-08
11 1 8 8 0.0 7.732600e-08 -3.773200e-08
12 1 9 9 0.0 9.300000e-08 -3.521200e-08
13 1 10 10 0.0 8.468000e-09 -6.990000e-09
14 1 11 11 0.0 1.434200e-11 -1.200000e-11
15 2 0 0 0.0 8.118000e-11 -5.254000e-11
16 2 1 1 0.0 9.322000e-11 -1.359200e-10
17 2 2 2 0.0 1.944000e-10 -2.409400e-10
18 2 3 3 0.0 7.756000e-11 -8.556000e-11
19 2 4 4 0.0 1.260000e-11 -8.618000e-12
20 2 5 5 0.0 7.122000e-12 -1.402000e-13
21 2 6 6 0.0 6.224000e-11 -2.760000e-11
22 2 7 7 0.0 1.133400e-08 -6.566000e-09
23 2 8 8 0.0 6.600000e-13 -1.808000e-11
24 2 9 9 0.0 6.861000e-08 -4.063400e-08
25 2 10 10 0.0 2.743800e-10 -1.336000e-10
Expected output:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.00074 0.00074
0 2 5.5 5.5 0 6.792247e-09 -4.023330e-09
Instead, what i get is:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.00074 0.00074
0 NaN NaN NaN NaN NaN NaN
For example, this code results in:
In[81]: wdata[wdata['Die'] == 2].mean().to_frame().transpose()
Out[81]:
Die Subsite Algorithm Vt1 It1 Ignd
0 2 5.5 5.5 0 6.792247e-09 -4.023330e-09
For me works:
def code(data, col):
""" Finds average value of all rows that have identical col values from column col .
Returns new Pandas.DataFrame with the data
"""
values = pd.unique(data[col])
rows = len(values)
res = pd.DataFrame(columns = data.columns)
for i, v in enumerate(values):
e = data[data[col] == v].mean()
res.loc[i,:] = e
return res
col = 'Die'
print (code(data, col))
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.000739957 0.000739939
1 2 5 5 0 7.34067e-09 -4.35482e-09
but same output has groupby with aggregate mean:
print (data.groupby(col, as_index=False).mean())
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -7.399575e-04 7.399392e-04
1 2 5.0 5.0 0.00 7.340669e-09 -4.354818e-09
A few minutes after I posted the question I solved it by adding a .values to e.
e = data[data[col] == v].mean().to_frame().transpose().values
However it turns out that what I wanted to do is already done by Pandas. Thanks MaxU!
df.groupBy(col).mean()

Resources