pandas: Using built-in and customized aggregation function together? - python-3.x

I used the following code:
s=df.groupby('version').agg({'duration':['mean','std'],'ts':['min','max']).reset_index()
s.columns=s.columns.map("_".join)
The results work fine.
Then I tried to add one more aggregate function quantile(.25)
s=df.groupby('version').agg({'duration':['mean','std', quantile(.25)],'ts':['min','max']}).reset_index()
s.columns=s.columns.map("_".join)
Then get the following error:
NameError Traceback (most recent call last)
<ipython-input-22-d4857cf7740e> in <module>()
----> 1 s=df.groupby('version').agg({'duration':['mean','std', quantile(.25)],'ts':['min','max']}).reset_index()
2 s.columns=s.columns.map("_".join)
3 s
NameError: name 'quantile' is not defined
What would be the proper way to achieve this? Thanks!

You can wrap the quantile operation in a lambda function:
> import pandas as pd
> import numpy as np
> df = pd.DataFrame({'Duration': np.random.rand(10),
'Version': [1,1,2,2,3,3,4,4,4,4]})
> df
Duration Version
0 0.843479 1
1 0.028724 1
2 0.605053 2
3 0.548231 2
4 0.223244 3
5 0.883418 3
6 0.772413 4
7 0.100166 4
8 0.865734 4
9 0.865839 4
> df.groupby('Version').agg({'Duration' : [min, max, lambda x: x.quantile(.25)]}
Duration
min max <lambda>
Version
1 0.028724 0.843479 0.232413
2 0.548231 0.605053 0.562437
3 0.223244 0.883418 0.388287
4 0.100166 0.865839 0.604351

Related

How to use multiprocessing pool for Pandas apply function

I want to use pool for Pandas data frames.
I tried as follows, but the following error occurs.
Can't I use pool for Series?
from multiprocessing import pool
split = np.array_split(split,4)
pool = Pool(processes=4)
df = pd.concat(pool.map(split['Test'].apply(lambda x : test(x)), split))
pool.close()
pool.join()
The error message is as follows.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list indices must be integers or slices, not str
Try:
import pandas as pd
import numpy as np
import multiprocessing as mp
def test(x):
return x * 2
if __name__ == '__main__':
# Demo dataframe
df = pd.DataFrame({'Test': range(100)})
# Extract the Series and split into chunk
split = np.array_split(df['Test'], 4)
# Parallel processing
with mp.Pool(4) as pool:
data = pool.map(test, split)
# Concatenate results
out = pd.concat(data)
Output:
>>> df
Test
0 0
1 1
2 2
3 3
4 4
.. ...
95 95
96 96
97 97
98 98
99 99
[100 rows x 1 columns]
>>> out
0 0
1 2
2 4
3 6
4 8
...
95 190
96 192
97 194
98 196
99 198
Name: Test, Length: 100, dtype: int64

Operands could not be broadcast together with shapes Error for Pandas Dataframe

I have looked through the other answers for Operand errors and none seem to fit this example.
The mathematics/equation works, either coding in X values or importing from the DataFrame.
Using the same equation in an np.where expression causes the operand error.
import csv
import pandas as pd
from pandas import DataFrame
import numpy as np
data= pd.read_csv('miniDF.csv')
df=pd.DataFrame(data, columns=['X','Z'])
df['y']=df['Z']*0.01
df['y']=(14.6413819224756*(df['X']**0.5)+64.4092780704338*(np.log(df['X'])**-2)
+1675.7498523727*(np.exp(-df['X']))+3.07221083927051*np.cos(df['X']))
print(df)
df['y']=np.where(df['Z']>=(14.6413819224756*(df['X']**0.5)+64.4092780704338*(np.log(df['X'])**-2)
+1675.7498523727*(np.exp(-df['X']))+3.07221083927051*np.cos(df['X']),8,9))
print(df)
The values in my Dataframe, the output from the first print(df) and the error are as follows.
X Z y
0 1.4 1 999.999293
1 2.0 2000 380.275104
2 3.0 3 159.114194
3 4.0 4 91.481930
4 5.0 5 69.767368
5 6.0 6 63.030212
6 7.0 70 59.591631
7 8.0 8 56.422723
8 9.0 9 54.673108
9 10.0 10 55.946732
Traceback (most recent call last):
File "/Users/willhutchins/Desktop/minitest.py", line 17, in <module>
df['y']=np.where(df['Z']>=(14.6413819224756*(df['X']**0.5)+64.4092780704338*(np.log(df['X'])**-2)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/ops/__init__.py", line 1229, in wrapper
res = na_op(values, other)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/ops/__init__.py", line 1115, in na_op
result = method(y)
ValueError: operands could not be broadcast together with shapes (10,) (3,)
The answer was simply a misplaced Parenthesis as answered here:
[https://stackoverflow.com/questions/63046213/complex-curve-equation-giving-error-in-np-where-usage][1]
Corrected code is:
df['y']=np.where(df['Z']>=(14.6413819224756*(df['X']**0.5)+64.4092780704338*(np.log(df['X'])**-2)
+1675.7498523727*(np.exp(-df['X']))+3.07221083927051*np.cos(df['X'])),8,9)

Confusion Matrix : RecursionError

I had been trying to replicated an online tutorial for plotting confusion matrix but got recursion error, tried resetting the recursion limit but still the error persists. The code is a below:
log = LogisticRegression()
log.fit(x_train,y_train)
pred_log = log.predict(x_train)
confusion_matrix(y_train,pred_log)
The error I got is :
---------------------------------------------------------------------------
RecursionError Traceback (most recent call last)
<ipython-input-57-4b8fbe47e72d> in <module>
----> 1 (confusion_matrix(y_train,pred_log))
<ipython-input-48-92d5242f8580> in confusion_matrix(test_data, pred_data)
1 def confusion_matrix(test_data,pred_data):
----> 2 c_mat = confusion_matrix(test_data,pred_data)
3 return pd.DataFrame(c_mat)
... last 1 frames repeated, from the frame below ...
<ipython-input-48-92d5242f8580> in confusion_matrix(test_data, pred_data)
1 def confusion_matrix(test_data,pred_data):
----> 2 c_mat = confusion_matrix(test_data,pred_data)
3 return pd.DataFrame(c_mat)
RecursionError: maximum recursion depth exceeded
The shape of the train and test data is as below
x_train.shape,y_train.shape,x_test.shape,y_test.shape
# ((712, 7), (712,), (179, 7), (179,))
Tried with: sys.setrecursionlimit(1500)
But still no resolution.
Looks like you are recursively calling the same function. Try changing the outer function name.
1 def confusion_matrix(test_data,pred_data):
----> 2 c_mat = confusion_matrix(test_data,pred_data)
3 return pd.DataFrame(c_mat)
To
def confusion_matrix_pd_convertor(test_data,pred_data):
c_mat = confusion_matrix(test_data,pred_data)
return pd.DataFrame(c_mat)
log = LogisticRegression()
log.fit(x_train,y_train)
pred_log = log.predict(x_train)
confusion_matrix_pd_convertor(y_train,pred_log)

Python pandas search in DataFrame MultiIndex

This is what I do:
import pandas as pd
t = pd.DataFrame(data={'i1':[0,0,0,0,1,1,1,1,2,2,2,2],
'i2':[0,1,2,3,0,1,2,3,0,1,2,3],
'x':[1.,2.,3.,4.,5.,6.,7.,8.,9.,10.,11.,12.]})
t.set_index(['i1','i2'], inplace=True)
t.sort_index(inplace=True)
t.index.values.searchsorted( (1,1) )
This is the error I get:
> Traceback (most recent call last): File "<stdin>", line 1, in
> <module> TypeError: '<' not supported between instances of 'tuple' and
> 'int'
Please, help me to understand what I am doing wrong.
Index values are tuples: type(t.index.values[0]) gives correctly <class 'tuple'> and I give as input to searchsorted a tuple. So where does the 'tuple' to 'int' comparison come from?
>>> print(t)
x
i1 i2
0 0 1.0
1 2.0
2 3.0
3 4.0
1 0 5.0
1 6.0
2 7.0
3 8.0
2 0 9.0
1 10.0
2 11.0
3 12.0
searchsorted doesn't work with tuples. There's an open issue on github for Multiarray searchsorted fails
On that issue, one of the participants suggests using get_indexer
With your code
t.index.get_indexer([(1,1)])[0]
# outputs:
5
I have found a solution:
>>> t.index.get_loc( (1,1) )
5
This solution is ~200 times faster that using t.index.get_indexer:
>>> import time
>>> time.clock()
168.56
>>> for i in range(10000): a = t.index.get_indexer([(1,1)])[0]
...
>>> time.clock()
176.76
>>> (176.76 - 168.56) / 10000
0.0008199999999999989 # 820e-6 sec per call
>>> time.clock()
176.76
>>> for i in range(1000000): a = t.index.get_loc( (1,1) )
...
>>> time.clock()
180.94
>>> (180.94-176.76)/1000000
4.1800000000000066e-06 # 4.2e-6 sec per call

Preprocessing string data in pandas dataframe

I have a user review dataset. I have loaded this dataset and now i want to preprocess the user reviews(i.e. removing stopwords, punctuations, convert to lower case, removing salutations etc.) before fitting it to a classifier but i am getting error. Here is my code:
import pandas as pd
import numpy as np
df=pd.read_json("C:/Users/ABC/Downloads/Compressed/reviews_Musical_Instruments_5.json/Musical_Instruments_5.json",lines=True)
dataset=df.filter(['overall','reviewText'],axis=1)
def cleanText(text):
"""
removes punctuation, stopwords and returns lowercase text in a list
of single words
"""
text = (text.lower() for text in text)
from bs4 import BeautifulSoup
text = BeautifulSoup(text).get_text()
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = tokenizer.tokenize(text)
from nltk.corpus import stopwords
clean = [word for word in text if word not in
stopwords.words('english')]
return clean
dataset['reviewText']=dataset['reviewText'].apply(cleanText)
dataset['reviewText']
I am getting these errors:
TypeError Traceback (most recent call last)
<ipython-input-68-f42f70ec46e5> in <module>()
----> 1 dataset['reviewText']=dataset['reviewText'].apply(cleanText)
2 dataset['reviewText']
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-64-5c6792de405c> in cleanText(text)
10 from nltk.tokenize import RegexpTokenizer
11 tokenizer = RegexpTokenizer(r'\w+')
---> 12 text = tokenizer.tokenize(text)
13
14 from nltk.corpus import stopwords
~\Anaconda3\lib\site-packages\nltk\tokenize\regexp.py in tokenize(self, text)
127 # If our regexp matches tokens, use re.findall:
128 else:
--> 129 return self._regexp.findall(text)
130
131 def span_tokenize(self, text):
TypeError: expected string or bytes-like object
and
TypeError Traceback (most recent call last)
<ipython-input-70-f42f70ec46e5> in <module>()
----> 1 dataset['reviewText']=dataset['reviewText'].apply(cleanText)
2 dataset['reviewText']
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-69-5c6792de405c> in cleanText(text)
10 from nltk.tokenize import RegexpTokenizer
11 tokenizer = RegexpTokenizer(r'\w+')
---> 12 text = tokenizer.tokenize(text)
13
14 from nltk.corpus import stopwords
~\Anaconda3\lib\site-packages\nltk\tokenize\regexp.py in tokenize(self, text)
127 # If our regexp matches tokens, use re.findall:
128 else:
--> 129 return self._regexp.findall(text)
130
131 def span_tokenize(self, text):
TypeError: expected string or bytes-like object
Please suggest corrections in this function for my data or suggest a new function for data cleaning.
Here is my data:
overall reviewText
0 5 Not much to write about here, but it does exac...
1 5 The product does exactly as it should and is q...
2 5 The primary job of this device is to block the...
3 5 Nice windscreen protects my MXL mic and preven...
4 5 This pop filter is great. It looks and perform...
5 5 So good that I bought another one. Love the h...
6 5 I have used monster cables for years, and with...
7 3 I now use this cable to run from the output of...
8 5 Perfect for my Epiphone Sheraton II. Monster ...
9 5 Monster makes the best cables and a lifetime w...
10 5 Monster makes a wide array of cables, includin...
11 4 I got it to have it if I needed it. I have fou...
12 3 If you are not use to using a large sustaining...
13 5 I love it, I used this for my Yamaha ypt-230 a...
14 5 I bought this to use in my home studio to cont...
15 2 I bought this to use with my keyboard. I wasn'...
print(df)
overall reviewText
0 5 Not much to write about here, but it does exac...
1 5 The product does exactly as it should and is q...
2 5 The primary job of this device is to block the...
3 5 Nice windscreen protects my MXL mic and preven...
4 5 This pop filter is great. It looks and perform...
5 5 So good that I bought another one. Love the h...
6 5 I have used monster cables for years, and with...
7 3 I now use this cable to run from the output of...
8 5 Perfect for my Epiphone Sheraton II. Monster ...
9 5 Monster makes the best cables and a lifetime w...
10 5 Monster makes a wide array of cables, includin...
11 4 I got it to have it if I needed it. I have fou...
12 3 If you are not use to using a large sustaining...
13 5 I love it, I used this for my Yamaha ypt-230 a...
14 5 I bought this to use in my home studio to cont...
15 2 I bought this to use with my keyboard. I wasn'...
To convert into lowercase
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x : str.lower(x))
To remove punctuation and numbers
import re
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x : " ".join(re.findall('[\w]+',x)))
To remove stopwords, you can either install stopwords or create your own stopword list and use it with a function
from stop_words import get_stop_words
stop_words = get_stop_words('en')
def remove_stopWords(s):
'''For removing stop words
'''
s = ' '.join(word for word in s.split() if word not in stop_words)
return s
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x: remove_stopWords(x))

Resources