How to use multiprocessing pool for Pandas apply function

How to use multiprocessing pool for Pandas apply function - python-3.x

I want to use pool for Pandas data frames.
I tried as follows, but the following error occurs.
Can't I use pool for Series?
from multiprocessing import pool
split = np.array_split(split,4)
pool = Pool(processes=4)
df = pd.concat(pool.map(split['Test'].apply(lambda x : test(x)), split))
pool.close()
pool.join()
The error message is as follows.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list indices must be integers or slices, not str

Try:
import pandas as pd
import numpy as np
import multiprocessing as mp
def test(x):
return x * 2
if __name__ == '__main__':
# Demo dataframe
df = pd.DataFrame({'Test': range(100)})
# Extract the Series and split into chunk
split = np.array_split(df['Test'], 4)
# Parallel processing
with mp.Pool(4) as pool:
data = pool.map(test, split)
# Concatenate results
out = pd.concat(data)
Output:
>>> df
Test
0 0
1 1
2 2
3 3
4 4
.. ...
95 95
96 96
97 97
98 98
99 99
[100 rows x 1 columns]
>>> out
0 0
1 2
2 4
3 6
4 8
...
95 190
96 192
97 194
98 196
99 198
Name: Test, Length: 100, dtype: int64

Related

TypeError: 'float' object cannot be interpreted as an integer on linspace

TypeError Traceback (most recent call last)
d:\website\SpeechProcessForMachineLearning-master\SpeechProcessForMachineLearning-master\speech_process.ipynb Cell 15' in <cell line: 1>()
-->1 plot_freq(signal, sample_rate)
d:\website\SpeechProcessForMachineLearning-master\SpeechProcessForMachineLearning-master\speech_process.ipynb Cell 10' in plot_freq(signal, sample_rate, fft_size)
2 def plot_freq(signal, sample_rate, fft_size=512):
3 xf = np.fft.rfft(signal, fft_size) / fft_size
----> 4 freq = np.linspace(0, sample_rate/2, fft_size/2 + 1)
5 xfp = 20 * np.log10(np.clip(np.abs(xf), 1e-20, 1e100))
6 plt.figure(figsize=(20, 5))
File <__array_function__ internals>:5, in linspace(*args, **kwargs)
File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\function_base.py:120, in linspace(start, stop, num, endpoint, retstep, dtype, axis)
23 #array_function_dispatch(_linspace_dispatcher)
24 def linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None,
25 axis=0):
26 """
27 Return evenly spaced numbers over a specified interval.
28
(...)
118
119 """
--> 120 num = operator.index(num)
121 if num < 0:
122 raise ValueError("Number of samples, %s, must be non-negative." % num)
TypeError: 'float' object cannot be interpreted as an integer
What solution about this problem?

Divide pandas data frame into test and train based on unique ID

I want to split it into two data frames,(train, and test) using the values in the id column. The split should be such that in the first data frame I have 70% of the (unique) ids and in the second data frame, I have 30% of the ids. The ids should be randomly split.
I have multiple values corresponding to one id.
The below script I was trying:
Training_data, Test_data = sklearn.model_selection.train_test_split(data, data['ID_sample'].unique(), train_size=0.30, test_size=0.70, random_state=5)

Sorted the issue in the following way
samplelist = data["ID_sample"].unique()
training_samp, test_samp = sklearn.model_selection.train_test_split(samplelist, train_size=0.7, test_size=0.3, random_state=5, shuffle=True)
training_data = data[data['ID_sample'].isin(training_samp)]
test_data = data[data['ID_sample'].isin(test_samp)]

I am not sklearn expert, but I know little about it and since ages when this came into existence similar questions I see all new-comer's used to ask.
Anyway, here is how you can solve it, you can opt to import import train_test_split from sklearn.model_selection and accomplish it. I have just created a random data and applied that same.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(100, 2))
>>> df
0 1
0 -1.214487 0.455726
1 -0.898623 0.268778
2 0.262315 -0.009964
3 0.612664 0.786531
4 1.249646 -1.020366
.. ... ...
95 -0.171218 1.083018
96 0.122685 -2.214143
97 -1.420504 0.469372
98 0.061177 0.465881
99 -0.262667 -0.406031
[100 rows x 2 columns]
>>> from sklearn.model_selection import train_test_split
>>> train, test = train_test_split(df, test_size=0.3)
Here is your first dataframe train
>>> train
0 1
26 -2.651343 -0.864565
17 0.106959 -0.763388
78 -0.398269 -0.501073
25 1.452795 1.290227
47 0.158705 -1.123697
.. ... ...
29 -1.909144 -0.732514
7 0.641331 -1.336896
43 0.769139 2.816528
59 -0.683185 0.442875
11 -0.543988 -0.183677
[70 rows x 2 columns]
Here its second test dataframe:
>>> test
0 1
30 -1.562427 -1.448936
24 0.638780 1.868500
70 -0.572035 1.615093
72 0.660455 -0.331350
82 0.295644 -0.403386
22 0.942676 -0.814718
15 -0.208757 -0.112564
45 1.069752 -1.894040
18 0.600265 0.599571
93 -0.853163 1.646843
91 -1.172471 -1.488513
10 0.728550 1.637609
36 -0.040357 2.050128
4 1.249646 -1.020366
60 -0.907925 -0.290945
34 0.029384 0.452658
38 1.566204 -1.171910
33 -1.009491 0.105400
62 0.930651 -0.124938
42 0.401900 -0.472175
80 1.266980 -0.221378
95 -0.171218 1.083018
74 -0.160058 -1.383118
28 1.257940 0.604513
87 -0.136468 -0.109718
27 1.909935 -0.712136
81 -1.449828 -1.823526
61 0.176301 -0.885127
53 -0.593061 1.547997
57 -0.527212 0.781028
In your case, it should ideally be working as below, however, you don't need to define train_size if you are defining test_size or vice versa.
>>> train, test = train_test_split(data['ID_sample'], test_size=0.3)
OR
>>> train, test = train_test_split(data['ID_sample'], test_size=0.3, random_state=5)
OR
This returns an arraylist ...
>>> train, test = train_test_split(data['ID_sample'].unique(), test_size=0.30, random_state=5)

Create linear model to check correlation tokenize error

I have data like the sample below, which has 4 continuous columns [x0 to x3] and a binary column y. y has two values 1.0 and 0.0. I’m trying to check for correlation between the binary column y and one of the continuous columns x0, using the CatConCor function below, but I’m getting the error message below. The function creates a linear regression model and calcs the p value for the residuals with and without the categorical variable. If anyone can please point out the issue or how to fix it, it would be very much appreciated.
Data:
x_r x0 x1 x2 x3 y
0 0 0.466726 0.030126 0.998330 0.892770 0.0
1 1 0.173168 0.525810 -0.079341 -0.112151 0.0
2 2 -0.854467 0.770712 0.929614 -0.224779 0.0
3 3 -0.370574 0.568183 -0.928269 0.843253 0.0
4 4 -0.659431 -0.948491 -0.091534 0.706157 0.0
Code:
import numpy as np
import pandas as pd
from time import time
import scipy.stats as stats
from IPython.display import display # Allows the use of display() for DataFrames
# Pretty display for notebooks
%matplotlib inline
###########################################
# Suppress matplotlib user warnings
# Necessary for newer version of matplotlib
import warnings
warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib")
#
# Display inline matplotlib plots with IPython
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')
###########################################
import matplotlib.pyplot as plt
import matplotlib.cm as cm
# correlation between categorical variable and continuous variable
def CatConCor(df,catVar,conVar):
import statsmodels.api as sm
from statsmodels.formula.api import ols
# subsetting data for one categorical column and one continuous column
data2=df.copy()[[catVar,conVar]]
data2[catVar]=data2[catVar].astype('category')
mod = ols(conVar+'~'+catVar,
data=data2).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
if aov_table['PR(>F)'][0] < 0.05:
print('Correlated p='+str(aov_table['PR(>F)'][0]))
else:
print('Uncorrelated p='+str(aov_table['PR(>F)'][0]))
# checking for correlation between categorical and continuous variables
CatConCor(df=train_df,catVar='y',conVar='x0')
Error:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-6-80f83b8c8e14> in <module>()
1 # checking for correlation between categorical and continuous variables
2
----> 3 CatConCor(df=train_df,catVar='y',conVar='x0')
<ipython-input-2-35404ba1d697> in CatConCor(df, catVar, conVar)
103
104 mod = ols(conVar+'~'+catVar,
--> 105 data=data2).fit()
106
107 aov_table = sm.stats.anova_lm(mod, typ=2)
~/anaconda2/envs/py36/lib/python3.6/site-packages/statsmodels/base/model.py in from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
153
154 tmp = handle_formula_data(data, None, formula, depth=eval_env,
--> 155 missing=missing)
156 ((endog, exog), missing_idx, design_info) = tmp
157
~/anaconda2/envs/py36/lib/python3.6/site-packages/statsmodels/formula/formulatools.py in handle_formula_data(Y, X, formula, depth, missing)
63 if data_util._is_using_pandas(Y, None):
64 result = dmatrices(formula, Y, depth, return_type='dataframe',
---> 65 NA_action=na_action)
66 else:
67 result = dmatrices(formula, Y, depth, return_type='dataframe',
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/highlevel.py in dmatrices(formula_like, data, eval_env, NA_action, return_type)
308 eval_env = EvalEnvironment.capture(eval_env, reference=1)
309 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 310 NA_action, return_type)
311 if lhs.shape[1] == 0:
312 raise PatsyError("model is missing required outcome variables")
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
163 return iter([data])
164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 165 NA_action)
166 if design_infos is not None:
167 return build_design_matrices(design_infos, data,
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
60 "ascii-only, or else upgrade to Python 3.")
61 if isinstance(formula_like, str):
---> 62 formula_like = ModelDesc.from_formula(formula_like)
63 # fallthrough
64 if isinstance(formula_like, ModelDesc):
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/desc.py in from_formula(cls, tree_or_string)
162 tree = tree_or_string
163 else:
--> 164 tree = parse_formula(tree_or_string)
165 value = Evaluator().eval(tree, require_evalexpr=False)
166 assert isinstance(value, cls)
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/parse_formula.py in parse_formula(code, extra_operators)
146 tree = infix_parse(_tokenize_formula(code, operator_strings),
147 operators,
--> 148 _atomic_token_types)
149 if not isinstance(tree, ParseNode) or tree.type != "~":
150 tree = ParseNode("~", None, [tree], tree.origin)
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/infix_parser.py in infix_parse(tokens, operators, atomic_types, trace)
208
209 want_noun = True
--> 210 for token in token_source:
211 if c.trace:
212 print("Reading next token (want_noun=%r)" % (want_noun,))
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/parse_formula.py in _tokenize_formula(code, operator_strings)
92 else:
93 it.push_back((pytype, token_string, origin))
---> 94 yield _read_python_expr(it, end_tokens)
95
96 def test__tokenize_formula():
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/parse_formula.py in _read_python_expr(it, end_tokens)
42 origins = []
43 bracket_level = 0
---> 44 for pytype, token_string, origin in it:
45 assert bracket_level >= 0
46 if bracket_level == 0 and token_string in end_tokens:
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/util.py in next(self)
330 else:
331 # May raise StopIteration
--> 332 return six.advance_iterator(self._it)
333 __next__ = next
334
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/tokens.py in python_tokenize(code)
33 break
34 origin = Origin(code, start, end)
---> 35 assert pytype not in (tokenize.NL, tokenize.NEWLINE)
36 if pytype == tokenize.ERRORTOKEN:
37 raise PatsyError("error tokenizing input "
AssertionError:

Upgrading patsy to 0.5.1 fixed the issue. I found the tip here:
https://github.com/statsmodels/statsmodels/issues/5343

Applying function to pandas dataframe

I have a pandas dataframe called 'tourdata' consisting of 676k rows of data. Two of the columns are latitude and longitude.
Using the reverse_geocode package I want to convert these coordinates to a country data.
When I call :
import reverse_geocode as rg
tourdata['Country'] = rg.search((row[tourdata['latitude']],row[tourdata['longitude']]))
I get the error :
ValueErrorTraceback (most recent call last)
in ()
1 coordinates = (tourdata['latitude'],tourdata['longitude']),
----> 2 tourdata['Country'] = rg.search((row[tourdata['latitude']],row[tourdata['longitude']]))
~/anaconda/envs/py3/lib/python3.6/site-packages/reverse_geocode/init.py
in search(coordinates)
114 """
115 gd = GeocodeData()
--> 116 return gd.query(coordinates)
117
118
~/anaconda/envs/py3/lib/python3.6/site-packages/reverse_geocode/init.py
in query(self, coordinates)
46 except ValueError as e:
47 logging.info('Unable to parse coordinates: {}'.format(coordinates))
---> 48 raise e
49 else:
50 results = [self.locations[index] for index in indices]
~/anaconda/envs/py3/lib/python3.6/site-packages/reverse_geocode/init.py
in query(self, coordinates)
43 """
44 try:
---> 45 distances, indices = self.tree.query(coordinates, k=1)
46 except ValueError as e:
47 logging.info('Unable to parse coordinates: {}'.format(coordinates))
ckdtree.pyx in scipy.spatial.ckdtree.cKDTree.query()
ValueError: x must consist of vectors of length 2 but has shape (2,
676701)
To test that the package is working :
coordinates = (tourdata['latitude'][0],tourdata['longitude'][0]),
results = (rg.search(coordinates))
print(results)
Outputs :
[{'country_code': 'AT', 'city': 'Wartmannstetten', 'country': 'Austria'}]
Any help with this appreciated. Ideally I'd like to access the resulting dictionary and apply only the country code to the Country column.

The search method expects a list of coordinates. To obtain a single data point you can use "get" method.
Try :
tourdata['country'] = tourdata.apply(lambda x: rg.get((x['latitude'], x['longitude'])), axis=1)
It works fine for me :
import pandas as pd
tourdata = pd.DataFrame({'latitude':[0.3, 2, 0.6], 'longitude':[12, 5, 0.8]})
tourdata['country'] = tourdata.apply(lambda x: rg.get((x['latitude'], x['longitude'])), axis=1)
tourdata['country']
Output :
0 {'country': 'Gabon', 'city': 'Booué', 'country...
1 {'country': 'Sao Tome and Principe', 'city': '...
2 {'country': 'Ghana', 'city': 'Mumford', 'count...
Name: country, dtype: object

Preprocessing string data in pandas dataframe

I have a user review dataset. I have loaded this dataset and now i want to preprocess the user reviews(i.e. removing stopwords, punctuations, convert to lower case, removing salutations etc.) before fitting it to a classifier but i am getting error. Here is my code:
import pandas as pd
import numpy as np
df=pd.read_json("C:/Users/ABC/Downloads/Compressed/reviews_Musical_Instruments_5.json/Musical_Instruments_5.json",lines=True)
dataset=df.filter(['overall','reviewText'],axis=1)
def cleanText(text):
"""
removes punctuation, stopwords and returns lowercase text in a list
of single words
"""
text = (text.lower() for text in text)
from bs4 import BeautifulSoup
text = BeautifulSoup(text).get_text()
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = tokenizer.tokenize(text)
from nltk.corpus import stopwords
clean = [word for word in text if word not in
stopwords.words('english')]
return clean
dataset['reviewText']=dataset['reviewText'].apply(cleanText)
dataset['reviewText']
I am getting these errors:
TypeError Traceback (most recent call last)
<ipython-input-68-f42f70ec46e5> in <module>()
----> 1 dataset['reviewText']=dataset['reviewText'].apply(cleanText)
2 dataset['reviewText']
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-64-5c6792de405c> in cleanText(text)
10 from nltk.tokenize import RegexpTokenizer
11 tokenizer = RegexpTokenizer(r'\w+')
---> 12 text = tokenizer.tokenize(text)
13
14 from nltk.corpus import stopwords
~\Anaconda3\lib\site-packages\nltk\tokenize\regexp.py in tokenize(self, text)
127 # If our regexp matches tokens, use re.findall:
128 else:
--> 129 return self._regexp.findall(text)
130
131 def span_tokenize(self, text):
TypeError: expected string or bytes-like object
and
TypeError Traceback (most recent call last)
<ipython-input-70-f42f70ec46e5> in <module>()
----> 1 dataset['reviewText']=dataset['reviewText'].apply(cleanText)
2 dataset['reviewText']
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-69-5c6792de405c> in cleanText(text)
10 from nltk.tokenize import RegexpTokenizer
11 tokenizer = RegexpTokenizer(r'\w+')
---> 12 text = tokenizer.tokenize(text)
13
14 from nltk.corpus import stopwords
~\Anaconda3\lib\site-packages\nltk\tokenize\regexp.py in tokenize(self, text)
127 # If our regexp matches tokens, use re.findall:
128 else:
--> 129 return self._regexp.findall(text)
130
131 def span_tokenize(self, text):
TypeError: expected string or bytes-like object
Please suggest corrections in this function for my data or suggest a new function for data cleaning.
Here is my data:
overall reviewText
0 5 Not much to write about here, but it does exac...
1 5 The product does exactly as it should and is q...
2 5 The primary job of this device is to block the...
3 5 Nice windscreen protects my MXL mic and preven...
4 5 This pop filter is great. It looks and perform...
5 5 So good that I bought another one. Love the h...
6 5 I have used monster cables for years, and with...
7 3 I now use this cable to run from the output of...
8 5 Perfect for my Epiphone Sheraton II. Monster ...
9 5 Monster makes the best cables and a lifetime w...
10 5 Monster makes a wide array of cables, includin...
11 4 I got it to have it if I needed it. I have fou...
12 3 If you are not use to using a large sustaining...
13 5 I love it, I used this for my Yamaha ypt-230 a...
14 5 I bought this to use in my home studio to cont...
15 2 I bought this to use with my keyboard. I wasn'...

print(df)
overall reviewText
0 5 Not much to write about here, but it does exac...
1 5 The product does exactly as it should and is q...
2 5 The primary job of this device is to block the...
3 5 Nice windscreen protects my MXL mic and preven...
4 5 This pop filter is great. It looks and perform...
5 5 So good that I bought another one. Love the h...
6 5 I have used monster cables for years, and with...
7 3 I now use this cable to run from the output of...
8 5 Perfect for my Epiphone Sheraton II. Monster ...
9 5 Monster makes the best cables and a lifetime w...
10 5 Monster makes a wide array of cables, includin...
11 4 I got it to have it if I needed it. I have fou...
12 3 If you are not use to using a large sustaining...
13 5 I love it, I used this for my Yamaha ypt-230 a...
14 5 I bought this to use in my home studio to cont...
15 2 I bought this to use with my keyboard. I wasn'...
To convert into lowercase
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x : str.lower(x))
To remove punctuation and numbers
import re
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x : " ".join(re.findall('[\w]+',x)))
To remove stopwords, you can either install stopwords or create your own stopword list and use it with a function
from stop_words import get_stop_words
stop_words = get_stop_words('en')
def remove_stopWords(s):
'''For removing stop words
'''
s = ' '.join(word for word in s.split() if word not in stop_words)
return s
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x: remove_stopWords(x))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to use multiprocessing pool for Pandas apply function - python-3.x

Related

TypeError: 'float' object cannot be interpreted as an integer on linspace

Divide pandas data frame into test and train based on unique ID

Create linear model to check correlation tokenize error

Applying function to pandas dataframe

Preprocessing string data in pandas dataframe

Categories

Resources