How to use multiprocessing pool for Pandas apply function - python-3.x

I want to use pool for Pandas data frames.
I tried as follows, but the following error occurs.
Can't I use pool for Series?
from multiprocessing import pool
split = np.array_split(split,4)
pool = Pool(processes=4)
df = pd.concat(pool.map(split['Test'].apply(lambda x : test(x)), split))
pool.close()
pool.join()
The error message is as follows.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list indices must be integers or slices, not str

Try:
import pandas as pd
import numpy as np
import multiprocessing as mp
def test(x):
return x * 2
if __name__ == '__main__':
# Demo dataframe
df = pd.DataFrame({'Test': range(100)})
# Extract the Series and split into chunk
split = np.array_split(df['Test'], 4)
# Parallel processing
with mp.Pool(4) as pool:
data = pool.map(test, split)
# Concatenate results
out = pd.concat(data)
Output:
>>> df
Test
0 0
1 1
2 2
3 3
4 4
.. ...
95 95
96 96
97 97
98 98
99 99
[100 rows x 1 columns]
>>> out
0 0
1 2
2 4
3 6
4 8
...
95 190
96 192
97 194
98 196
99 198
Name: Test, Length: 100, dtype: int64

Related

TypeError: 'float' object cannot be interpreted as an integer on linspace

TypeError Traceback (most recent call last)
d:\website\SpeechProcessForMachineLearning-master\SpeechProcessForMachineLearning-master\speech_process.ipynb Cell 15' in <cell line: 1>()
-->1 plot_freq(signal, sample_rate)
d:\website\SpeechProcessForMachineLearning-master\SpeechProcessForMachineLearning-master\speech_process.ipynb Cell 10' in plot_freq(signal, sample_rate, fft_size)
2 def plot_freq(signal, sample_rate, fft_size=512):
3 xf = np.fft.rfft(signal, fft_size) / fft_size
----> 4 freq = np.linspace(0, sample_rate/2, fft_size/2 + 1)
5 xfp = 20 * np.log10(np.clip(np.abs(xf), 1e-20, 1e100))
6 plt.figure(figsize=(20, 5))
File <__array_function__ internals>:5, in linspace(*args, **kwargs)
File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\function_base.py:120, in linspace(start, stop, num, endpoint, retstep, dtype, axis)
23 #array_function_dispatch(_linspace_dispatcher)
24 def linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None,
25 axis=0):
26 """
27 Return evenly spaced numbers over a specified interval.
28
(...)
118
119 """
--> 120 num = operator.index(num)
121 if num < 0:
122 raise ValueError("Number of samples, %s, must be non-negative." % num)
TypeError: 'float' object cannot be interpreted as an integer
What solution about this problem?

Divide pandas data frame into test and train based on unique ID

I want to split it into two data frames,(train, and test) using the values in the id column. The split should be such that in the first data frame I have 70% of the (unique) ids and in the second data frame, I have 30% of the ids. The ids should be randomly split.
I have multiple values corresponding to one id.
The below script I was trying:
Training_data, Test_data = sklearn.model_selection.train_test_split(data, data['ID_sample'].unique(), train_size=0.30, test_size=0.70, random_state=5)
Sorted the issue in the following way
samplelist = data["ID_sample"].unique()
training_samp, test_samp = sklearn.model_selection.train_test_split(samplelist, train_size=0.7, test_size=0.3, random_state=5, shuffle=True)
training_data = data[data['ID_sample'].isin(training_samp)]
test_data = data[data['ID_sample'].isin(test_samp)]
I am not sklearn expert, but I know little about it and since ages when this came into existence similar questions I see all new-comer's used to ask.
Anyway, here is how you can solve it, you can opt to import import train_test_split from sklearn.model_selection and accomplish it. I have just created a random data and applied that same.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(100, 2))
>>> df
0 1
0 -1.214487 0.455726
1 -0.898623 0.268778
2 0.262315 -0.009964
3 0.612664 0.786531
4 1.249646 -1.020366
.. ... ...
95 -0.171218 1.083018
96 0.122685 -2.214143
97 -1.420504 0.469372
98 0.061177 0.465881
99 -0.262667 -0.406031
[100 rows x 2 columns]
>>> from sklearn.model_selection import train_test_split
>>> train, test = train_test_split(df, test_size=0.3)
Here is your first dataframe train
>>> train
0 1
26 -2.651343 -0.864565
17 0.106959 -0.763388
78 -0.398269 -0.501073
25 1.452795 1.290227
47 0.158705 -1.123697
.. ... ...
29 -1.909144 -0.732514
7 0.641331 -1.336896
43 0.769139 2.816528
59 -0.683185 0.442875
11 -0.543988 -0.183677
[70 rows x 2 columns]
Here its second test dataframe:
>>> test
0 1
30 -1.562427 -1.448936
24 0.638780 1.868500
70 -0.572035 1.615093
72 0.660455 -0.331350
82 0.295644 -0.403386
22 0.942676 -0.814718
15 -0.208757 -0.112564
45 1.069752 -1.894040
18 0.600265 0.599571
93 -0.853163 1.646843
91 -1.172471 -1.488513
10 0.728550 1.637609
36 -0.040357 2.050128
4 1.249646 -1.020366
60 -0.907925 -0.290945
34 0.029384 0.452658
38 1.566204 -1.171910
33 -1.009491 0.105400
62 0.930651 -0.124938
42 0.401900 -0.472175
80 1.266980 -0.221378
95 -0.171218 1.083018
74 -0.160058 -1.383118
28 1.257940 0.604513
87 -0.136468 -0.109718
27 1.909935 -0.712136
81 -1.449828 -1.823526
61 0.176301 -0.885127
53 -0.593061 1.547997
57 -0.527212 0.781028
In your case, it should ideally be working as below, however, you don't need to define train_size if you are defining test_size or vice versa.
>>> train, test = train_test_split(data['ID_sample'], test_size=0.3)
OR
>>> train, test = train_test_split(data['ID_sample'], test_size=0.3, random_state=5)
OR
This returns an arraylist ...
>>> train, test = train_test_split(data['ID_sample'].unique(), test_size=0.30, random_state=5)

Create linear model to check correlation tokenize error

I have data like the sample below, which has 4 continuous columns [x0 to x3] and a binary column y. y has two values 1.0 and 0.0. I’m trying to check for correlation between the binary column y and one of the continuous columns x0, using the CatConCor function below, but I’m getting the error message below. The function creates a linear regression model and calcs the p value for the residuals with and without the categorical variable. If anyone can please point out the issue or how to fix it, it would be very much appreciated.
Data:
x_r x0 x1 x2 x3 y
0 0 0.466726 0.030126 0.998330 0.892770 0.0
1 1 0.173168 0.525810 -0.079341 -0.112151 0.0
2 2 -0.854467 0.770712 0.929614 -0.224779 0.0
3 3 -0.370574 0.568183 -0.928269 0.843253 0.0
4 4 -0.659431 -0.948491 -0.091534 0.706157 0.0
Code:
import numpy as np
import pandas as pd
from time import time
import scipy.stats as stats
from IPython.display import display # Allows the use of display() for DataFrames
# Pretty display for notebooks
%matplotlib inline
###########################################
# Suppress matplotlib user warnings
# Necessary for newer version of matplotlib
import warnings
warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib")
#
# Display inline matplotlib plots with IPython
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')
###########################################
import matplotlib.pyplot as plt
import matplotlib.cm as cm
# correlation between categorical variable and continuous variable
def CatConCor(df,catVar,conVar):
import statsmodels.api as sm
from statsmodels.formula.api import ols
# subsetting data for one categorical column and one continuous column
data2=df.copy()[[catVar,conVar]]
data2[catVar]=data2[catVar].astype('category')
mod = ols(conVar+'~'+catVar,
data=data2).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
if aov_table['PR(>F)'][0] < 0.05:
print('Correlated p='+str(aov_table['PR(>F)'][0]))
else:
print('Uncorrelated p='+str(aov_table['PR(>F)'][0]))
# checking for correlation between categorical and continuous variables
CatConCor(df=train_df,catVar='y',conVar='x0')
Error:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-6-80f83b8c8e14> in <module>()
1 # checking for correlation between categorical and continuous variables
2
----> 3 CatConCor(df=train_df,catVar='y',conVar='x0')
<ipython-input-2-35404ba1d697> in CatConCor(df, catVar, conVar)
103
104 mod = ols(conVar+'~'+catVar,
--> 105 data=data2).fit()
106
107 aov_table = sm.stats.anova_lm(mod, typ=2)
~/anaconda2/envs/py36/lib/python3.6/site-packages/statsmodels/base/model.py in from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
153
154 tmp = handle_formula_data(data, None, formula, depth=eval_env,
--> 155 missing=missing)
156 ((endog, exog), missing_idx, design_info) = tmp
157
~/anaconda2/envs/py36/lib/python3.6/site-packages/statsmodels/formula/formulatools.py in handle_formula_data(Y, X, formula, depth, missing)
63 if data_util._is_using_pandas(Y, None):
64 result = dmatrices(formula, Y, depth, return_type='dataframe',
---> 65 NA_action=na_action)
66 else:
67 result = dmatrices(formula, Y, depth, return_type='dataframe',
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/highlevel.py in dmatrices(formula_like, data, eval_env, NA_action, return_type)
308 eval_env = EvalEnvironment.capture(eval_env, reference=1)
309 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 310 NA_action, return_type)
311 if lhs.shape[1] == 0:
312 raise PatsyError("model is missing required outcome variables")
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
163 return iter([data])
164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 165 NA_action)
166 if design_infos is not None:
167 return build_design_matrices(design_infos, data,
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
60 "ascii-only, or else upgrade to Python 3.")
61 if isinstance(formula_like, str):
---> 62 formula_like = ModelDesc.from_formula(formula_like)
63 # fallthrough
64 if isinstance(formula_like, ModelDesc):
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/desc.py in from_formula(cls, tree_or_string)
162 tree = tree_or_string
163 else:
--> 164 tree = parse_formula(tree_or_string)
165 value = Evaluator().eval(tree, require_evalexpr=False)
166 assert isinstance(value, cls)
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/parse_formula.py in parse_formula(code, extra_operators)
146 tree = infix_parse(_tokenize_formula(code, operator_strings),
147 operators,
--> 148 _atomic_token_types)
149 if not isinstance(tree, ParseNode) or tree.type != "~":
150 tree = ParseNode("~", None, [tree], tree.origin)
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/infix_parser.py in infix_parse(tokens, operators, atomic_types, trace)
208
209 want_noun = True
--> 210 for token in token_source:
211 if c.trace:
212 print("Reading next token (want_noun=%r)" % (want_noun,))
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/parse_formula.py in _tokenize_formula(code, operator_strings)
92 else:
93 it.push_back((pytype, token_string, origin))
---> 94 yield _read_python_expr(it, end_tokens)
95
96 def test__tokenize_formula():
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/parse_formula.py in _read_python_expr(it, end_tokens)
42 origins = []
43 bracket_level = 0
---> 44 for pytype, token_string, origin in it:
45 assert bracket_level >= 0
46 if bracket_level == 0 and token_string in end_tokens:
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/util.py in next(self)
330 else:
331 # May raise StopIteration
--> 332 return six.advance_iterator(self._it)
333 __next__ = next
334
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/tokens.py in python_tokenize(code)
33 break
34 origin = Origin(code, start, end)
---> 35 assert pytype not in (tokenize.NL, tokenize.NEWLINE)
36 if pytype == tokenize.ERRORTOKEN:
37 raise PatsyError("error tokenizing input "
AssertionError:
Upgrading patsy to 0.5.1 fixed the issue. I found the tip here:
https://github.com/statsmodels/statsmodels/issues/5343

Applying function to pandas dataframe

I have a pandas dataframe called 'tourdata' consisting of 676k rows of data. Two of the columns are latitude and longitude.
Using the reverse_geocode package I want to convert these coordinates to a country data.
When I call :
import reverse_geocode as rg
tourdata['Country'] = rg.search((row[tourdata['latitude']],row[tourdata['longitude']]))
I get the error :
ValueErrorTraceback (most recent call last)
in ()
1 coordinates = (tourdata['latitude'],tourdata['longitude']),
----> 2 tourdata['Country'] = rg.search((row[tourdata['latitude']],row[tourdata['longitude']]))
~/anaconda/envs/py3/lib/python3.6/site-packages/reverse_geocode/init.py
in search(coordinates)
114 """
115 gd = GeocodeData()
--> 116 return gd.query(coordinates)
117
118
~/anaconda/envs/py3/lib/python3.6/site-packages/reverse_geocode/init.py
in query(self, coordinates)
46 except ValueError as e:
47 logging.info('Unable to parse coordinates: {}'.format(coordinates))
---> 48 raise e
49 else:
50 results = [self.locations[index] for index in indices]
~/anaconda/envs/py3/lib/python3.6/site-packages/reverse_geocode/init.py
in query(self, coordinates)
43 """
44 try:
---> 45 distances, indices = self.tree.query(coordinates, k=1)
46 except ValueError as e:
47 logging.info('Unable to parse coordinates: {}'.format(coordinates))
ckdtree.pyx in scipy.spatial.ckdtree.cKDTree.query()
ValueError: x must consist of vectors of length 2 but has shape (2,
676701)
To test that the package is working :
coordinates = (tourdata['latitude'][0],tourdata['longitude'][0]),
results = (rg.search(coordinates))
print(results)
Outputs :
[{'country_code': 'AT', 'city': 'Wartmannstetten', 'country': 'Austria'}]
Any help with this appreciated. Ideally I'd like to access the resulting dictionary and apply only the country code to the Country column.
The search method expects a list of coordinates. To obtain a single data point you can use "get" method.
Try :
tourdata['country'] = tourdata.apply(lambda x: rg.get((x['latitude'], x['longitude'])), axis=1)
It works fine for me :
import pandas as pd
tourdata = pd.DataFrame({'latitude':[0.3, 2, 0.6], 'longitude':[12, 5, 0.8]})
tourdata['country'] = tourdata.apply(lambda x: rg.get((x['latitude'], x['longitude'])), axis=1)
tourdata['country']
Output :
0 {'country': 'Gabon', 'city': 'Booué', 'country...
1 {'country': 'Sao Tome and Principe', 'city': '...
2 {'country': 'Ghana', 'city': 'Mumford', 'count...
Name: country, dtype: object

Preprocessing string data in pandas dataframe

I have a user review dataset. I have loaded this dataset and now i want to preprocess the user reviews(i.e. removing stopwords, punctuations, convert to lower case, removing salutations etc.) before fitting it to a classifier but i am getting error. Here is my code:
import pandas as pd
import numpy as np
df=pd.read_json("C:/Users/ABC/Downloads/Compressed/reviews_Musical_Instruments_5.json/Musical_Instruments_5.json",lines=True)
dataset=df.filter(['overall','reviewText'],axis=1)
def cleanText(text):
"""
removes punctuation, stopwords and returns lowercase text in a list
of single words
"""
text = (text.lower() for text in text)
from bs4 import BeautifulSoup
text = BeautifulSoup(text).get_text()
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = tokenizer.tokenize(text)
from nltk.corpus import stopwords
clean = [word for word in text if word not in
stopwords.words('english')]
return clean
dataset['reviewText']=dataset['reviewText'].apply(cleanText)
dataset['reviewText']
I am getting these errors:
TypeError Traceback (most recent call last)
<ipython-input-68-f42f70ec46e5> in <module>()
----> 1 dataset['reviewText']=dataset['reviewText'].apply(cleanText)
2 dataset['reviewText']
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-64-5c6792de405c> in cleanText(text)
10 from nltk.tokenize import RegexpTokenizer
11 tokenizer = RegexpTokenizer(r'\w+')
---> 12 text = tokenizer.tokenize(text)
13
14 from nltk.corpus import stopwords
~\Anaconda3\lib\site-packages\nltk\tokenize\regexp.py in tokenize(self, text)
127 # If our regexp matches tokens, use re.findall:
128 else:
--> 129 return self._regexp.findall(text)
130
131 def span_tokenize(self, text):
TypeError: expected string or bytes-like object
and
TypeError Traceback (most recent call last)
<ipython-input-70-f42f70ec46e5> in <module>()
----> 1 dataset['reviewText']=dataset['reviewText'].apply(cleanText)
2 dataset['reviewText']
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-69-5c6792de405c> in cleanText(text)
10 from nltk.tokenize import RegexpTokenizer
11 tokenizer = RegexpTokenizer(r'\w+')
---> 12 text = tokenizer.tokenize(text)
13
14 from nltk.corpus import stopwords
~\Anaconda3\lib\site-packages\nltk\tokenize\regexp.py in tokenize(self, text)
127 # If our regexp matches tokens, use re.findall:
128 else:
--> 129 return self._regexp.findall(text)
130
131 def span_tokenize(self, text):
TypeError: expected string or bytes-like object
Please suggest corrections in this function for my data or suggest a new function for data cleaning.
Here is my data:
overall reviewText
0 5 Not much to write about here, but it does exac...
1 5 The product does exactly as it should and is q...
2 5 The primary job of this device is to block the...
3 5 Nice windscreen protects my MXL mic and preven...
4 5 This pop filter is great. It looks and perform...
5 5 So good that I bought another one. Love the h...
6 5 I have used monster cables for years, and with...
7 3 I now use this cable to run from the output of...
8 5 Perfect for my Epiphone Sheraton II. Monster ...
9 5 Monster makes the best cables and a lifetime w...
10 5 Monster makes a wide array of cables, includin...
11 4 I got it to have it if I needed it. I have fou...
12 3 If you are not use to using a large sustaining...
13 5 I love it, I used this for my Yamaha ypt-230 a...
14 5 I bought this to use in my home studio to cont...
15 2 I bought this to use with my keyboard. I wasn'...
print(df)
overall reviewText
0 5 Not much to write about here, but it does exac...
1 5 The product does exactly as it should and is q...
2 5 The primary job of this device is to block the...
3 5 Nice windscreen protects my MXL mic and preven...
4 5 This pop filter is great. It looks and perform...
5 5 So good that I bought another one. Love the h...
6 5 I have used monster cables for years, and with...
7 3 I now use this cable to run from the output of...
8 5 Perfect for my Epiphone Sheraton II. Monster ...
9 5 Monster makes the best cables and a lifetime w...
10 5 Monster makes a wide array of cables, includin...
11 4 I got it to have it if I needed it. I have fou...
12 3 If you are not use to using a large sustaining...
13 5 I love it, I used this for my Yamaha ypt-230 a...
14 5 I bought this to use in my home studio to cont...
15 2 I bought this to use with my keyboard. I wasn'...
To convert into lowercase
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x : str.lower(x))
To remove punctuation and numbers
import re
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x : " ".join(re.findall('[\w]+',x)))
To remove stopwords, you can either install stopwords or create your own stopword list and use it with a function
from stop_words import get_stop_words
stop_words = get_stop_words('en')
def remove_stopWords(s):
'''For removing stop words
'''
s = ' '.join(word for word in s.split() if word not in stop_words)
return s
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x: remove_stopWords(x))

Resources