Improving speed when using a for loop for each user groups - python-3.x

Suppose we have following dataset with the output window_num:
index user1 date different_months org_different_months window_num
1690289 2670088 2006-08-01 243.0 243.0 1
1772121 2717874 2005-12-01 0.0 0.0 1
1772123 2717874 2005-12-01 0.0 0.0 1
1772125 2717874 2005-12-01 0.0 0.0 1
1772130 2717874 2005-12-01 0.0 0.0 1
1772136 2717874 2006-01-01 0.0 0.0 1
1772132 2717874 2006-02-01 0.0 2099.0 1
1772134 2717874 2020-08-27 0.0 0.0 4
1772117 2717874 0.0 0.0 4
1772118 2717874 0.0 0.0 4
1772128 2717874 2019-11-01 300.0 300.0 3
1772127 2717874 2011-11-01 2922.0 2922.0 2
1774815 2719456 2006-09-01 0.0 0.0 2
1774809 2719456 2006-10-01 0.0 1949.0 2
1774821 2719456 2020-05-20 0.0 0.0 7
1774803 2719456 0.0 0.0 7
1774806 2719456 0.0 0.0 7
1774819 2719456 2019-08-29 265.0 265.0 6
1774825 2719456 2014-10-01 384.0 384.0 4
1774812 2719456 2005-07-01 427.0 427.0 1
1774816 2719456 2012-02-01 973.0 973.0 3
1774824 2719456 2015-10-20 1409.0 1409.0 5
The user number is represented by user1. The output is the window_num which is generated using different_months and orig_different_months columns. The different_months column is the difference between the date[n] and date[n+1].
Previously, I was using groupby.apply to output window_num, however it became extremely slow when the dataset increased. The code was improved considerably by using the shift functions on the entire dataset to calculate the different_months and orig_different_months column, as well as applying the sort on entire dataset, as seen below:
data = data.sort_values(by=['user','ContractInceptionDateClean'], ascending=[True,True])
#data['user1'] =data['user']
data['different_months'] = (abs((data['ContractInceptionDateClean'].shift(-1)-data['ContractInceptionDateClean'] ).dt.days)).fillna(0)
data.different_months[data['different_months'] < 91] =0
data['shift_different_months']=data['different_months'].shift(1)
data['org_different_months']=data['different_months']
data.loc[((data['different_months'] == 0) | (data['shift_different_months'] == 0)),'different_months']=0
data = salesswindow_cal(data,list(data.user.unique()))
The code that I am currently struggling to improve the speed on is shown below:
def salesswindow_cal(data_,users):
temp = pd.DataFrame()
for u in range(0,len(users)):
df=data_[data_['user']==users[u]]
df['different_months'].values[0]= df['org_different_months'].values[0]
df['window_num']=(df['different_months'].diff() != 0).cumsum()
temp= pd.concat([df,temp],axis=0)
return pd.DataFrame(temp)

A rule of thumb is not to loop through the users and extract df = data_[data_['user']==user]. Instead do groupby:
for u, df in data_.gropuby('user'):
do_some_stuff
Another issue is not to concatenate data iteratively
data_out = []
for user, df in data.groupby('user'):
do_some_stuff
data_out.append(sub_data)
out = pd.concat(data_out)
In your case, you can do a function and groupby().apply() and pandas will concatenate the data for you.
def group_func(df):
d = df.copy()
d['different_months'].values[0] = d['org_different_months'].value[0]
d['window_num'] = (d['different_months'].diff().ne(0).cumsum()
return d
data.groupby('user').apply(group_func)
Update:
Let's try this vectorized approach, which modifies your data inplace
# update the first `different_months`
mask = ~data['user'].duplicated()
data.loc[mask, 'different_months'] == data.loc[mask, 'orginal_different_months']
groups = data.groupby('user')
data['diff'] = groups['different_months'].diff().ne(0)
data['window_num'] = groups['diff'].cumsum()

Related

I was working on a movie sentiment analysis but, code but i'm facing issues in my code related to processing words

My data set has 42,000 rows. This is the code I used to edit my text before vectorizing it. However the problem is it has a nested for loop which I guess makes it very slow and I'm not being able to use it with more than 1500 rows. Can someone please help out on a better way to do this?
filtered = []
for i in range(2):
rev = re.sub('[^a-zA-Z]', ' ', df['text'][i])
rev = rev.lower()
rev = rev.split()
filtered =[]
for word in rev:
if word not in stopwords.words("english"):
word = PorterStemmer().stem(word)
filtered.append(word)
filtered = " ".join(filtered)
corpus.append(filtered)
I've used line_profiler to measure the speed of the code you posted.
The measurement results are as follows.
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 #profile
9 def profile_nltk():
10 1 435819.0 435819.0 0.3 df = pd.read_csv('IMDB_Dataset.csv') # (50000, 2)
11 1 1.0 1.0 0.0 filtered = []
12 1 247.0 247.0 0.0 reviews = df['review'][:4000]
13 1 0.0 0.0 0.0 corpus = []
14 4001 216341.0 54.1 0.1 for i in range(len(reviews)):
15 4000 221885.0 55.5 0.2 rev = re.sub('[^a-zA-Z]', ' ', df['review'][i])
16 4000 3878.0 1.0 0.0 rev = rev.lower()
17 4000 30209.0 7.6 0.0 rev = rev.split()
18 4000 1097.0 0.3 0.0 filtered = []
19 950808 235589.0 0.2 0.2 for word in rev:
20 946808 115658060.0 122.2 78.2 if word not in stopwords.words("english"):
21 486614 30898223.0 63.5 20.9 word = PorterStemmer().stem(word)
22 486614 149604.0 0.3 0.1 filtered.append(word)
23 4000 11290.0 2.8 0.0 filtered = " ".join(filtered)
24 4000 1429.0 0.4 0.0 corpus.append(filtered)
As #parsa-abbasi pointed out, the process of checking the stopword accounts for about 80% of the total.
The measurement results for the modified script are as follows. The same process has been reduced to about 1/100th of the processing time.
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 #profile
9 def profile_nltk():
10 1 441467.0 441467.0 1.4 df = pd.read_csv('IMDB_Dataset.csv') # (50000, 2)
11 1 1.0 1.0 0.0 filtered = []
12 1 335.0 335.0 0.0 reviews = df['review'][:4000]
13 1 1.0 1.0 0.0 corpus = []
14 1 2696.0 2696.0 0.0 stopwords_set = stopwords.words('english')
15 4001 59013.0 14.7 0.2 for i in range(len(reviews)):
16 4000 186393.0 46.6 0.6 rev = re.sub('[^a-zA-Z]', ' ', df['review'][i])
17 4000 3657.0 0.9 0.0 rev = rev.lower()
18 4000 27357.0 6.8 0.1 rev = rev.split()
19 4000 999.0 0.2 0.0 filtered = []
20 950808 220673.0 0.2 0.7 for word in rev:
21 # if word not in stopwords.words("english"):
22 946808 1201271.0 1.3 3.8 if word not in stopwords_set:
23 486614 29479712.0 60.6 92.8 word = PorterStemmer().stem(word)
24 486614 141242.0 0.3 0.4 filtered.append(word)
25 4000 10412.0 2.6 0.0 filtered = " ".join(filtered)
26 4000 1329.0 0.3 0.0 corpus.append(filtered)
I hope this is helpful.
The most time-consuming part of the written code is the stopwords part.
It will call the library to get the list of stopwords each time the loop iterates.
Therefore it's better to get the stopwords set once and use the same set at each iteration.
I rewrote the code as following (other differences are made just for the sake of the readability):
corpus = []
texts = df['text']
stopwords_set = stopwords.words("english")
stemmer = PorterStemmer()
for i in range(len(texts)):
rev = re.sub('[^a-zA-Z]', ' ', texts[i])
rev = rev.lower()
rev = rev.split()
filtered = []
filtered = [stemmer.stem(word) for word in rev if word not in stopwords_set]
filtered = " ".join(filtered)
corpus.append(filtered)

Creating python data frame from list of dictionary

I have the following data:
sentences = [{'mary':'N', 'jane':'N', 'can':'M', 'see':'V','will':'N'},
{'spot':'N','will':'M','see':'V','mary':'N'},
{'will':'M','jane':'N','spot':'V','mary':'N'},
{'mary':'N','will':'M','pat':'V','spot':'N'}]
I want to create a data frame where each key (from the pairs above) will be the column name and each value (from above) will be the index of the row. The values in the data frame will be counting of each matching point between the key and the value.
The expected result should be:
df = pd.DataFrame([(4,0,0),
(2,0,0),
(0,1,0),
(0,0,2),
(1,3,0),
(2,0,1),
(0,0,1)],
index=['mary', 'jane', 'can', 'see', 'will', 'spot', 'pat'],
columns=('N','M','V'))
Use value_counts per columns in DataFrame.apply, replace missing values, convert to integers and last transpose by DataFrame.T:
df = df.apply(pd.value_counts).fillna(0).astype(int).T
print (df)
M N V
mary 0 3 1
jane 0 2 0
can 1 0 0
see 0 0 2
will 3 1 0
spot 0 2 1
pat 0 0 1
Or use DataFrame.stack with SeriesGroupBy.value_counts and Series.unstack:
df = df.stack().groupby(level=1).value_counts().unstack(fill_value=0)
print (df)
M N V
can 1 0 0
jane 0 2 0
mary 0 3 1
pat 0 0 1
see 0 0 2
spot 0 2 1
will 3 1 0
pd.DataFrame(sentences).T.stack().groupby(level=0).value_counts().unstack().fillna(0)
M N V
can 1.0 0.0 0.0
jane 0.0 2.0 0.0
mary 0.0 3.0 1.0
pat 0.0 0.0 1.0
see 0.0 0.0 2.0
spot 0.0 2.0 1.0
will 3.0 1.0 0.0
Cast as int if needed to.
pd.DataFrame(sentences).T.stack().groupby(level=0).value_counts().unstack().fillna(0).cast("int")

Pandas deposits and withdrawals over a time period with n-number of people

I'm trying to dynamically build a format in which I want to display number of deposits compared to withdrawals in a timeline chart. Whenever a deposit is done, the graph will go up, and when a withdrawal is done the graph goes down.
This is how far I've gotten:
df.head()
name Deposits Withdrawals
Peter 2019-03-07 2019-03-11
Peter 2019-03-08 2019-03-19
Peter 2019-03-12 2019-05-22
Peter 2019-03-12 2019-10-31
Peter 2019-03-14 2019-04-05
Here is the data manipulation to show the net movements for one person; Peter.
x = pd.Series(df.groupby('Deposits').size())
y = pd.Series(df.groupby('Withdrawals').size())
balance = pd.DataFrame({'net_mov': x.sub(y, fill_value=0)})
balance = balance.assign(Peter=balance.net_mov.cumsum())
print(balance)
net_mov Peter
2019-03-07 1 1
2019-03-08 1 2
2019-03-11 -1 1
2019-03-12 2 3
2019-03-14 1 4
This works perfectly fine, and this is the format that I want to have. Now let's say I want to extend on this and not just list Peters deposits and withdrawals, but I want to add n-number of people. Lets assume that my dataframe looks like this:
df2.head()
name Deposits Withdrawals
Peter 2019-03-07 2019-03-11
Anna 2019-03-08 2019-03-19
Anna 2019-03-12 2019-05-22
Peter 2019-03-12 2019-10-31
Simon 2019-03-14 2019-04-05
The format I'm aiming for is this. I don't know how to group everything, and I don't know which names or how many columns there will be beforehand, so I can't hardcode names or number of columns. It has to be generate dynamically.
net_mov1 Peter net_mov2 Anna net_mov3 Simon
2019-03-07 1 1 1 1 2 2
2019-03-08 1 2 2 3 -1 1
2019-03-11 -1 1 0 3 2 3
2019-03-12 2 3 -2 1 4 7
2019-03-14 1 4 3 4 -1 6
UPDATE:
First off, thanks for the help. I'm getting closer to my goal. This is the progress:
x = pd.Series(df.groupby(['Created', 'name']).size())
y = pd.Series(df.groupby(['Finished', 'name']).size())
balance = pd.DataFrame({'net_mov': x.sub(y, fill_value=0)})
balance = balance.assign(balance=balance.groupby('name').net_mov.cumsum())
balance_byname = balance.groupby('name')
balance_byname.get_group("Peter")
Output:
net_mov balance
name Created Finished
Peter 2017-07-03 2017-07-06 1 1
2017-07-10 1 2
2017-07-13 0 2
2017-07-14 1 3
... ... ...
2020-07-29 2020-07-15 0 4581
2020-07-17 0 4581
2020-07-20 0 4581
2020-07-21 -1 4580
[399750 rows x 2 columns]
This is of course too many rows, the dataset I'm working with has around 2500 rows.
I've tried to unstack it but that creates problems on it's own.
Given df:
name Deposits Withdrawals
Peter 2019-03-07 2019-03-11
Anna 2019-03-08 2019-03-19
Anna 2019-03-12 2019-05-22
Peter 2019-03-12 2019-10-31
Simon 2019-03-14 2019-04-05
You can melt dataframe, indicate deposits by 1 and withdravals by -1, and then pivot:
df = pd.DataFrame(\
{'name': {0: 'Peter', 1: 'Anna', 2: 'Anna', 3: 'Peter', 4: 'Simon'},
'Deposits': {0: '2019-03-07',
1: '2019-03-08',
2: '2019-03-12',
3: '2019-03-12',
4: '2019-03-14'},
'Withdrawals': {0: '2019-03-11',
1: '2019-03-19',
2: '2019-05-22',
3: '2019-10-31',
4: '2019-04-05'}})
df2 = df.melt('name')\
.assign(variable = lambda x: x.variable.map({'Deposits':1,'Withdrawals':-1}))\
#.pivot('value','name','variable').fillna(0)\
#use pivot_table with sum aggregate, because there may be duplicates in data
.pivot_table('variable','value','name', aggfunc = 'sum').fillna(0)\
.rename(columns = lambda c: f'{c} netmov' )
Above will give net change of balance:
name Anna netmov Peter netmov Simon netmov
value
2019-03-07 0.0 1.0 0.0
2019-03-08 1.0 0.0 0.0
2019-03-11 0.0 -1.0 0.0
2019-03-12 1.0 1.0 0.0
2019-03-14 0.0 0.0 1.0
2019-03-19 -1.0 0.0 0.0
2019-04-05 0.0 0.0 -1.0
2019-05-22 -1.0 0.0 0.0
2019-10-31 0.0 -1.0 0.0
Finally calculate balance using cumulative sum and concatenate it with previously calculated net changes:
df2 = pd.concat([df2,df2.cumsum().rename(columns = lambda c: c.split()[0] + ' balance')], axis = 1)\
.sort_index(axis=1)
result:
name Anna balance Anna netmov ... Simon balance Simon netmov
value ...
2019-03-07 0.0 0.0 ... 0.0 0.0
2019-03-08 1.0 1.0 ... 0.0 0.0
2019-03-11 1.0 0.0 ... 0.0 0.0
2019-03-12 2.0 1.0 ... 0.0 0.0
2019-03-14 2.0 0.0 ... 1.0 1.0
2019-03-19 1.0 -1.0 ... 1.0 0.0
2019-04-05 1.0 0.0 ... 0.0 -1.0
2019-05-22 0.0 -1.0 ... 0.0 0.0
2019-10-31 0.0 0.0 ... 0.0 0.0
[9 rows x 6 columns]
Try making use of pandas MultiIndex. This is almost the same code copied from your question BUT
including the column name into the groupby argument
adding a .groupby('name') call in the last line
With the code:
x = pd.Series(df.groupby(['Deposits', 'name']).size())
y = pd.Series(df.groupby(['Withdrawals', 'name']).size())
balance = pd.DataFrame({'net_mov': x.sub(y, fill_value=0)})
balance = balance.assign(balance=balance.groupby('name').net_mov.cumsum())
The groupby in the lastline effectively tells pandas to treat each name as a separate dataframe before applying cumsum, so movements will be kept to each account.
Now you can keep it in this shape with only two columns and the name as a second level in the rows MultiIndex. You can set a groupby object by calling
balance_byname = balance.groupby('name') # notice there is no aggregation nor transformation
To be used whenever you need to access only one account with .get_group() https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.get_group.html#pandas.core.groupby.GroupBy.get_group
OR---
You can also add a new line at the end
balance = balance.unstack('name')
Which will give a shape similar to what you ask in the expected output. This will, however, possibly create a number of 'NaN' by having all dates by all names. This can drastically increase the memory usage IF there are many dates and many bames, with each name having movements only in a few dates.

Python/Pandas Subtract Only if Value is not 0

I'm starting with data that looks something like this, but with a lot more rows:
Location Sample a b c d e f g h i
1 w 14.6 0 0 0 0 0 0 0 16.8
2 x 0 13.6 0 0 0 0 0 0 16.5
3 y 0 0 15.5 0 0 0 0 0 16.9
4 z 0 0 0 0 14.3 0 0 0 15.7
...
The data is indexed by the first two columns. I need to subtract the values in column i from each of the values in a - h, adding a new column to the right of the data frame for each original column. However, if there is a zero in the first column, I want it to stay zero instead of subtracting. For example, if my code worked I would have the following columns added to the data frame on the right
Location Sample ... a2 b2 c2 d2 e2 f2 g2 h2
1 w ... -2.2 0 0 0 0 0 0 0
2 x ... 0 -2.9 0 0 0 0 0 0
3 y ... 0 0 -1.4 0 0 0 0 0
4 z ... 0 0 0 0 -1.4 0 0 0
...
I'm trying to use where in pandas to only subtract the value in column i if the value in the current column is not zero using the following code:
import pandas as pd
normalizer = i
columns = list(df.columns.values)
for column in columns:
if column == normalizer: continue
newcol = gene + "2"
df[newcol] = df.where(df[column] == 0,
df[column] - df[normalizer], axis = 0)
I'm using a for loop because the number of columns will not always be the same, and the column that is being subtracted will have a different name using different data sets.
I'm getting this error: "ValueError: Wrong number of items passed 9, placement implies 1".
I think the subtraction is causing the issue, but I can't figure out how to change it to make it work. Any assistance would be greatly appreciated.
Thanks in advance.
Method 1 (pretty fast: roughly 3 times faster than method 2)
1. Select columns that is relavent
2. Do subtraction
3. Elementwise mutiplication with a 0, 1 matrix that constructed before the substraction. Each element in (df_ref > 0) is 0 if it was originally 0 and 1 otherwise.
ith_col = df["i"]
subdf = df.iloc[:, 2:-1] # a - h columns
df_temp = subdf.sub(ith_col, axis=0).multiply(subdf > 0).add(0)
df_temp.columns = ['a2', 'b2', 'c2', 'd2', 'e2', 'f2', 'g2', 'h2'] # rename columns
df_desired = pd.concat([df, df_temp], axis=1)
Note in this method, the 0 is negative. Thus, we have an extra add(0) in the end. Yes, a 0 can be negative. :P
Method 2 (more readable)
1. Find the greater than 0 part with a condition.
2. Select rows that is relavent
3. Substract
4. Fill in 0.
ith_col = df["i"]
df[df > 0].iloc[:,2:-1].sub(ith_col, axis=0).fillna(0)
The second method is pretty similar to #Wen's answer. Credits to him :P
Speed comparison of two methods (tested on Python 3 and pandas 0.20)
%timeit subdf.sub(ith_col, axis=0).multiply(subdf > 0).add(0)
688 µs ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df[df > 0].iloc[:,2:-1].sub(ith_col, axis=0).fillna(0)
2.97 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Reference:
DataFrame.multiply perform elementwise multiplication with another data frame.
Using mask + fillna
df.iloc[:,2:-1]=df.iloc[:,2:-1].mask(df.iloc[:,2:-1]==0).sub(df['i'],0).fillna(0)
df
Out[116]:
Location Sample a b c d e f g h i
0 1 w -2.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 16.8
1 2 x 0.0 -2.9 0.0 0.0 0.0 0.0 0.0 0.0 16.5
2 3 y 0.0 0.0 -1.4 0.0 0.0 0.0 0.0 0.0 16.9
3 4 z 0.0 0.0 0.0 0.0 -1.4 0.0 0.0 0.0 15.7
Update
normalizer = ['i','Location','Sample']
df.loc[:,~df.columns.isin(normalizer)]=df.loc[:,~df.columns.isin(normalizer)].mask(df.loc[:,~df.columns.isin(normalizer)]==0).sub(df['i'],0).fillna(0)

Cannot assign row in pandas.Dataframe

I am trying to calculate the mean of the rows of a DataFrame which have the same value on a specified column col. However I'm stuck at assigning a row of the pandas DataFrame.
Here's my code:
def code(data, col):
""" Finds average value of all rows that have identical col values from column col .
Returns new Pandas.DataFrame with the data
"""
values = pd.unique(data[col])
rows = len(values)
res = pd.DataFrame(np.zeros(shape = (rows, len(data.columns))), columns = data.columns)
for i, v in enumerate(values):
e = data[data[col] == v].mean().to_frame().transpose()
res[i:i+1] = e
return res
The problem is that the code only works for the first row, and puts NaN values on the next rows. I have checked the value of e and confirmed it to be good, so there is a problem with the assignment res[i:i+1] = e. I have also tried to do res.iloc[i] = e but i get "ValueError: Incompatible indexer with Series" Is there an alternate way to do this? It seems very straight forward and I'm baffled why it doesn't work...
E.g:
wdata
Out[78]:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 0 0 0.0 -2.320000e-07 -4.862400e-08
1 1 0 0 0.1 -1.000000e-04 1.000000e-04
2 1 0 0 0.2 -1.000000e-03 1.000000e-03
3 1 0 0 0.3 -1.000000e-02 1.000000e-02
4 1 1 1 0.0 3.554000e-07 -2.012000e-07
5 1 2 2 0.0 5.353000e-08 -1.684000e-07
6 1 3 3 0.0 9.369400e-08 -2.121400e-08
7 1 4 4 0.0 3.286200e-08 -2.093600e-08
8 1 5 5 0.0 8.978600e-08 -3.262000e-07
9 1 6 6 0.0 3.624800e-08 -2.507600e-08
10 1 7 7 0.0 2.957000e-08 -1.993200e-08
11 1 8 8 0.0 7.732600e-08 -3.773200e-08
12 1 9 9 0.0 9.300000e-08 -3.521200e-08
13 1 10 10 0.0 8.468000e-09 -6.990000e-09
14 1 11 11 0.0 1.434200e-11 -1.200000e-11
15 2 0 0 0.0 8.118000e-11 -5.254000e-11
16 2 1 1 0.0 9.322000e-11 -1.359200e-10
17 2 2 2 0.0 1.944000e-10 -2.409400e-10
18 2 3 3 0.0 7.756000e-11 -8.556000e-11
19 2 4 4 0.0 1.260000e-11 -8.618000e-12
20 2 5 5 0.0 7.122000e-12 -1.402000e-13
21 2 6 6 0.0 6.224000e-11 -2.760000e-11
22 2 7 7 0.0 1.133400e-08 -6.566000e-09
23 2 8 8 0.0 6.600000e-13 -1.808000e-11
24 2 9 9 0.0 6.861000e-08 -4.063400e-08
25 2 10 10 0.0 2.743800e-10 -1.336000e-10
Expected output:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.00074 0.00074
0 2 5.5 5.5 0 6.792247e-09 -4.023330e-09
Instead, what i get is:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.00074 0.00074
0 NaN NaN NaN NaN NaN NaN
For example, this code results in:
In[81]: wdata[wdata['Die'] == 2].mean().to_frame().transpose()
Out[81]:
Die Subsite Algorithm Vt1 It1 Ignd
0 2 5.5 5.5 0 6.792247e-09 -4.023330e-09
For me works:
def code(data, col):
""" Finds average value of all rows that have identical col values from column col .
Returns new Pandas.DataFrame with the data
"""
values = pd.unique(data[col])
rows = len(values)
res = pd.DataFrame(columns = data.columns)
for i, v in enumerate(values):
e = data[data[col] == v].mean()
res.loc[i,:] = e
return res
col = 'Die'
print (code(data, col))
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.000739957 0.000739939
1 2 5 5 0 7.34067e-09 -4.35482e-09
but same output has groupby with aggregate mean:
print (data.groupby(col, as_index=False).mean())
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -7.399575e-04 7.399392e-04
1 2 5.0 5.0 0.00 7.340669e-09 -4.354818e-09
A few minutes after I posted the question I solved it by adding a .values to e.
e = data[data[col] == v].mean().to_frame().transpose().values
However it turns out that what I wanted to do is already done by Pandas. Thanks MaxU!
df.groupBy(col).mean()

Resources