Preprocessing string data in pandas dataframe - python-3.x

I have a user review dataset. I have loaded this dataset and now i want to preprocess the user reviews(i.e. removing stopwords, punctuations, convert to lower case, removing salutations etc.) before fitting it to a classifier but i am getting error. Here is my code:
import pandas as pd
import numpy as np
df=pd.read_json("C:/Users/ABC/Downloads/Compressed/reviews_Musical_Instruments_5.json/Musical_Instruments_5.json",lines=True)
dataset=df.filter(['overall','reviewText'],axis=1)
def cleanText(text):
"""
removes punctuation, stopwords and returns lowercase text in a list
of single words
"""
text = (text.lower() for text in text)
from bs4 import BeautifulSoup
text = BeautifulSoup(text).get_text()
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = tokenizer.tokenize(text)
from nltk.corpus import stopwords
clean = [word for word in text if word not in
stopwords.words('english')]
return clean
dataset['reviewText']=dataset['reviewText'].apply(cleanText)
dataset['reviewText']
I am getting these errors:
TypeError Traceback (most recent call last)
<ipython-input-68-f42f70ec46e5> in <module>()
----> 1 dataset['reviewText']=dataset['reviewText'].apply(cleanText)
2 dataset['reviewText']
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-64-5c6792de405c> in cleanText(text)
10 from nltk.tokenize import RegexpTokenizer
11 tokenizer = RegexpTokenizer(r'\w+')
---> 12 text = tokenizer.tokenize(text)
13
14 from nltk.corpus import stopwords
~\Anaconda3\lib\site-packages\nltk\tokenize\regexp.py in tokenize(self, text)
127 # If our regexp matches tokens, use re.findall:
128 else:
--> 129 return self._regexp.findall(text)
130
131 def span_tokenize(self, text):
TypeError: expected string or bytes-like object
and
TypeError Traceback (most recent call last)
<ipython-input-70-f42f70ec46e5> in <module>()
----> 1 dataset['reviewText']=dataset['reviewText'].apply(cleanText)
2 dataset['reviewText']
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-69-5c6792de405c> in cleanText(text)
10 from nltk.tokenize import RegexpTokenizer
11 tokenizer = RegexpTokenizer(r'\w+')
---> 12 text = tokenizer.tokenize(text)
13
14 from nltk.corpus import stopwords
~\Anaconda3\lib\site-packages\nltk\tokenize\regexp.py in tokenize(self, text)
127 # If our regexp matches tokens, use re.findall:
128 else:
--> 129 return self._regexp.findall(text)
130
131 def span_tokenize(self, text):
TypeError: expected string or bytes-like object
Please suggest corrections in this function for my data or suggest a new function for data cleaning.
Here is my data:
overall reviewText
0 5 Not much to write about here, but it does exac...
1 5 The product does exactly as it should and is q...
2 5 The primary job of this device is to block the...
3 5 Nice windscreen protects my MXL mic and preven...
4 5 This pop filter is great. It looks and perform...
5 5 So good that I bought another one. Love the h...
6 5 I have used monster cables for years, and with...
7 3 I now use this cable to run from the output of...
8 5 Perfect for my Epiphone Sheraton II. Monster ...
9 5 Monster makes the best cables and a lifetime w...
10 5 Monster makes a wide array of cables, includin...
11 4 I got it to have it if I needed it. I have fou...
12 3 If you are not use to using a large sustaining...
13 5 I love it, I used this for my Yamaha ypt-230 a...
14 5 I bought this to use in my home studio to cont...
15 2 I bought this to use with my keyboard. I wasn'...

print(df)
overall reviewText
0 5 Not much to write about here, but it does exac...
1 5 The product does exactly as it should and is q...
2 5 The primary job of this device is to block the...
3 5 Nice windscreen protects my MXL mic and preven...
4 5 This pop filter is great. It looks and perform...
5 5 So good that I bought another one. Love the h...
6 5 I have used monster cables for years, and with...
7 3 I now use this cable to run from the output of...
8 5 Perfect for my Epiphone Sheraton II. Monster ...
9 5 Monster makes the best cables and a lifetime w...
10 5 Monster makes a wide array of cables, includin...
11 4 I got it to have it if I needed it. I have fou...
12 3 If you are not use to using a large sustaining...
13 5 I love it, I used this for my Yamaha ypt-230 a...
14 5 I bought this to use in my home studio to cont...
15 2 I bought this to use with my keyboard. I wasn'...
To convert into lowercase
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x : str.lower(x))
To remove punctuation and numbers
import re
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x : " ".join(re.findall('[\w]+',x)))
To remove stopwords, you can either install stopwords or create your own stopword list and use it with a function
from stop_words import get_stop_words
stop_words = get_stop_words('en')
def remove_stopWords(s):
'''For removing stop words
'''
s = ' '.join(word for word in s.split() if word not in stop_words)
return s
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x: remove_stopWords(x))

Related

How to use multiprocessing pool for Pandas apply function

I want to use pool for Pandas data frames.
I tried as follows, but the following error occurs.
Can't I use pool for Series?
from multiprocessing import pool
split = np.array_split(split,4)
pool = Pool(processes=4)
df = pd.concat(pool.map(split['Test'].apply(lambda x : test(x)), split))
pool.close()
pool.join()
The error message is as follows.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list indices must be integers or slices, not str
Try:
import pandas as pd
import numpy as np
import multiprocessing as mp
def test(x):
return x * 2
if __name__ == '__main__':
# Demo dataframe
df = pd.DataFrame({'Test': range(100)})
# Extract the Series and split into chunk
split = np.array_split(df['Test'], 4)
# Parallel processing
with mp.Pool(4) as pool:
data = pool.map(test, split)
# Concatenate results
out = pd.concat(data)
Output:
>>> df
Test
0 0
1 1
2 2
3 3
4 4
.. ...
95 95
96 96
97 97
98 98
99 99
[100 rows x 1 columns]
>>> out
0 0
1 2
2 4
3 6
4 8
...
95 190
96 192
97 194
98 196
99 198
Name: Test, Length: 100, dtype: int64

How to remove from the Pandas series the characters contained in the list (or series)

Good afternoon. Checking the text in the column, I came across characters that I didn't need:
"|,|.|
|2|5|0|1|6|ё|–|8|3|-|c|t|r|l|+|e|n|g|i|w|s|k|z|«|(|)|»|—|9|7|?|o|b|a|/|f|v|:|%|4|!|;|h|y|u|d|&|j|p|x|m|і|№|ұ|…|қ|$|_|[|]|“|”|ғ|||​|>|−|„|*|¬|ү|ң|#|©|―|q|→|’|∙|·| |ә| |ө|š|é|=|­|×|″|⇑|⇐|⇒|‑|′|\|<|#|'|˚| |ü|̇|̆|•|½|¾|ń|¤|һ|ý|{|}| |‘|ā|í||ī|‎|ќ|ђ|°|‚|ѓ|џ|ļ|▶|新|千|歳|空|港|全|日|機|が|曲|り|き|れ|ず|に|雪|突|っ|込|む|ニ|ュ|ー|ス|¼|ù|~|ə|ў|ҳ|ό||€|🙂|¸|⠀|ä|¯|ツ|ї|ş|è|`|́|ҹ|®|²|‪|ç| |☑|️|‼|ú|‒||👊|🏽|👁|ó|±|ñ|ł|ش|ا|ه|ن|م|›|
|£||||º
Text encoding - UTF8.
How do I correctly remove all these characters from a specific column (series) of a Pandas data frame?
I try
template = bad_symbols[0].str.cat(sep='|')
print(template)
template = re.compile(template, re.UNICODE)
test = label_data['text'].str.replace(template, '', regex=True)
And I get the following error:
"|,|.|
|2|5|0|1|6|ё|–|8|3|-|c|t|r|l|+|e|n|g|i|w|s|k|z|«|(|)|»|—|9|7|?|o|b|a|/|f|v|:|%|4|!|;|h|y|u|d|&|j|p|x|m|і|№|ұ|…|қ|$|_|[|]|“|”|ғ|||​|>|−|„|*|¬|ү|ң|#|©|―|q|→|’|∙|·| |ә| |ө|š|é|=|­|×|″|⇑|⇐|⇒|‑|′|\|<|#|'|˚| |ü|̇|̆|•|½|¾|ń|¤|һ|ý|{|}| |‘|ā|í||ī|‎|ќ|ђ|°|‚|ѓ|џ|ļ|▶|新|千|歳|空|港|全|日|機|が|曲|り|き|れ|ず|に|雪|突|っ|込|む|ニ|ュ|ー|ス|¼|ù|~|ə|ў|ҳ|ό||€|🙂|¸|⠀|ä|¯|ツ|ї|ş|è|`|́|ҹ|®|²|‪|ç| |☑|️|‼|ú|‒||👊|🏽|👁|ó|±|ñ|ł|ش|ا|ه|ن|م|›|
|£||||º
---------------------------------------------------------------------------
error Traceback (most recent call last)
<ipython-input-105-36817f343a8a> in <module>
5 print(template)
6
----> 7 template = re.compile(template, re.UNICODE)
8
9 test = label_data['text'].str.replace(template, '', regex=True)
5 frames
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
643 if not item or item[0][0] is AT:
644 raise source.error("nothing to repeat",
--> 645 source.tell() - here + len(this))
646 if item[0][0] in _REPEATCODES:
647 raise source.error("multiple repeat",
error: nothing to repeat at position 36 (line 2, column 30)
You need to escape your characters, use re.escape:
import re
template = '|'.join(map(re.escape, bad_symbols[0]))
Then, not need to compile, pandas will handle it for you:
test = label_data['text'].str.replace(template, '', regex=True, flags=re.UNICODE)

How can I make a list of three sentences to a string?

I have a target word and the left and right context that I have to join together. I am using pandas and I try to join the sentences, and the target word, together into a list, which I can then turn into a string so that it would work with my vectorizer. Basically I am just trying to turn a list of three sentences to a string.
This is the error that I get:
AttributeError Traceback (most recent call last)
<ipython-input-195-ae09731d3572> in <module>()
3
4 vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,2))
----> 5 feature_matrix=vectorizer.fit_transform(trainTexts)
6 print("shape=",feature_matrix.shape)
3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py in _preprocess(doc, accent_function, lower)
66 """
67 if lower:
---> 68 doc = doc.lower()
69 if accent_function is not None:
70 doc = accent_function(doc)
AttributeError: 'list' object has no attribute 'lower'
I have tried using .joinand .split but they are not working for me so I am doing something wrong.
import sys
import csv
import random
csv.field_size_limit(sys.maxsize)
trainLabels = []
trainTexts = []
with open ("myTsvFile.tsv") as train:
trainData = [row for row in csv.reader(train, delimiter='\t')]
random.shuffle(trainData)
for example in trainData:
trainLabels.append(example[1])
trainTexts.append(example[3:6])
The indexes example[3:6] means that the 3 is left context 4 is target word and 5 right context.
print('Text:', trainTexts[3])
print('Label:', trainLabels[1])
edited the few printed lines from the code:
['Visa electron käy aika monessa paikassa luottokortista . Mukaanlukien ', 'Paypal', ' , mikä avaa taas lisää ovia .']
['Nyt pistän pääni pölkyllä : ', 'WinForms', ' on ihan ok .']

How to fix this ValueError?

I am trying to run a python code, mostly based on NLTK book, for ngram POS Tagging a Gujarati language text from my GujaratiTextCorpus. I encountered a ValueError.
I am working with Python 3.7.3 in Windows 10. I use jupyter notebook through anaconda. I am a beginner in using python. I studied the answers available on stackoverflow. com to fix my ValueError, but could not solve it.
import nltk
f = open('C:\\Users\\BHOGAYATA\\Documents\\GujaratiPosTagging\\cts260.txt', encoding = 'utf8')
raw = f.read()
train2_sents = nltk.sent_tokenize(raw)
text2 = nltk.Text(train2_sents)
train2_sents
import nltk
f = open('C:\\Users\\BHOGAYATA\\Documents\\GujaratiPosTagging\\txt42_sents.txt', encoding = 'utf8')
raw = f.read()
bs_sents = nltk.sent_tokenize(raw)
text3 = nltk.Text(bs_sents)
bs_sents
unigram_tagger = nltk.UnigramTagger(train2_sents)
unigram_tagger.tag(bs_sents)
I expected that the words of the two Gujarati sentences would be POS Tagged. I found the following error messages:
ValueError
Traceback (most recent call last)
<ipython-input-3-5fae0b92393e> in <module>
11 text3 = nltk.Text(bs_sents)
12 bs_sents
---> 13 unigram_tagger = nltk.UnigramTagger(train2_sents)
14 unigram_tagger.tag(bs_sents)
15
~\Anaconda3\lib\site-packages\nltk\tag\sequential.py in __init__(self, train, model, backoff, cutoff, verbose)
344
345 def __init__(self, train=None, model=None, backoff=None, cutoff=0, verbose=False):
--> 346 NgramTagger.__init__(self, 1, train, model, backoff, cutoff, verbose)
347
348 def encode_json_obj(self):
~\Anaconda3\lib\site-packages\nltk\tag\sequential.py in __init__(self, n, train, model, backoff, cutoff, verbose)
293
294 if train:
--> 295 self._train(train, cutoff, verbose)
296
297 def encode_json_obj(self):
~\Anaconda3\lib\site-packages\nltk\tag\sequential.py in _train(self, tagged_corpus, cutoff, verbose)
181 fd = ConditionalFreqDist()
182 for sentence in tagged_corpus:
--> 183 tokens, tags = zip(*sentence)
184 for index, (token, tag) in enumerate(sentence):
185 # Record the event.
ValueError: not enough values to unpack (expected 2, got 1)
It says the variable you are passing have one output but you are expecting two..
Ex:
for a, b in [("a", "b")]:
print("a:", a, "b:", b)
This will work
for a, b in [("a")]:
print("a:", a, "b:", b)
This will not work
Edit:
Look at your UnigramTagger
For first argument it takes a list of tagged sentences of type
list(list(tuple(str, str)))
You are giving train2_sents of type
list(tuple(str,str)
Where your
list(tuple(str,str) is same as train2_sents

fuzzy lookup between 2 series/columns of nonidentical lengths

I am trying to do a fuzzy lookup between 2 series/columns between df1 and df2 where df1 is the dictionary file(to be used as a base) and df2 is the target file(to be looked up on)
import pandas as pd
df1 = pd.DataFrame(data ={'Brand_var':['Altmeister Bitter','Altos Las Hormigas Argentinian Wine','Amadeus Contri Sparkling Wine','Amadeus Cream Liqueur','Amadeus Sparkling Sparkling Wine']})
df2 = pd.DataFrame(data = {'Product':['1960 Altmeister 330ML CAN METAL','Hormi 12 Yr Bottle','test']})
I looked up for some solutions in SO, unfortunately dont seem to find a solution.
Used:
df3 = df2['ProductLongDesc'].apply(lambda x: difflib.get_close_matches(x, df1['Brand_var'])[0])
also :
df3 = df2['Product'].apply(lambda x: difflib.get_close_matches(x, df1['Brand_var']))
The first one gives me an index error and the second one gives me just the indexes.
My desired output is to print a mapping between df1 item and df2 items using a fuzzy lookup and printing both Brand_var and Product for their respective matches.
Desired Output:
Brand_var Product
Altmeister Bitter 1960 Altmeister 330ML CAN METAL
Altos Las Hormigas Argentinian Wine Hormi 12 Yr Bottle
For the non matching items ex: test in df2, can be ignored.
Note: The matching string name also could be non identical, as in it can have 1 or 2 letter missing in it. :(
Thank you in advance for taking your time out for this issue. :)
If you install fuzzywuzzy, you still stay with a problem how to choose proper heuristic to select right prouct and cut those products which are selected incorrectly (explanation below)
install fuzzywuzzy:
pip install fuzzywuzzy
fuzzywuzzy has several methods for a ratio calculation (examples on github). You face the problem: how to choose the best? I tried them on your data, but all of them faliled.
Code:
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
# df1 = ...
# df2 = ...
def get_top_by_ratio(x, df2):
product_values = df2.Product.values
# compare two strings by characters
ratio = np.array([fuzz.partial_ratio(x, val) for val in product_values])
argmax = np.argmax(ratio)
rating = ratio[argmax]
linked_product = product_values[argmax]
return rating, linked_product
Aplly this function to your data:
partial_ratio = (df1.Brand_var.apply(lambda x: get_top_by_ratio(x, df2))
.apply(pd.Series) # convert returned Series of tuples into pd.DataFrame
.rename(columns={0: 'ratio', 1: 'Product'})) # just rename columns
print(partial_ratio)
Out:
0 65 1960 Altmeister 330ML CAN METAL # Altmeister Bitter
1 50 test # Altos Las Hormigas Argentinian Wine
2 33 test
3 50 test
4 50 test
That's not good. Other ratio methods as fuzz.ratio, fuzz.token_sort_ratio etc. had failed too.
So I guess extend heuristic to compare words not only characters might help. Define a function that will create vocabulary from your data, encode all the sentences and use more sophisticated heuristic looking for words too:
def create_vocab(df1, df2):
# Leave 0 index free for unknow words
all_words = set((df1.Brand_var.str.cat(sep=' ') + df2.Product.str.cat(sep=' ')).split())
vocab = dict([(i + 1, w) for i, w in enumerate(all_words)])
return vocab
def encode(string, vocab):
"""This function encodes a sting with vocabulary"""
return [vocab[w] if w in vocab else 0 for w in string.split()]
Define new heuristic:
def get_top_with_heuristic(x, df2, vocab):
product_values = df2.Product.values
# compare two strings by characters
ratio_per_char = np.array([fuzz.partial_ratio(x, val) for val in product_values])
# compare two string by words
ratio_per_word = np.array([fuzz.partial_ratio(x, encode(val, vocab)) for val in product_values])
ratio = ratio_per_char + ratio_per_word
argmax = np.argmax(ratio)
rating = ratio[argmax]
linked_product = product_values[argmax]
return rating, linked_product
Create vocabulary, apply sophisticated heuristic to the data:
vocab = create_vocab(df1, df2)
heuristic_rating = (df1.Brand_var.apply(lambda x: get_top_with_heuristic(x, df2, vocab))
.apply(pd.Series)
.rename(columns={0: 'ratio', 1: 'Product'}))
print(heuristic_rating)
Out:
ratio Product
0 73 1960 Altmeister 330ML CAN METAL # Altmeister Bitter
1 61 Hormi 12 Yr Bottle # Altos Las Hormigas Argentinian Wine
2 45 Hormi 12 Yr Bottle
3 50 test
4 50 test
It seems to be correct! Concatenate this dataframe to df1, change index:
result_heuristic = pd.concat((df1, heuristic_rating), axis=1).set_index('Brand_var')
print(result_heuristic)
Out:
ratio Product
Brand_var
Altmeister Bitter 73 1960 Altmeister 330ML CAN METAL
Altos Las Hormigas Argentinian Wine 61 Hormi 12 Yr Bottle
Amadeus Contri Sparkling Wine 45 Hormi 12 Yr Bottle
Amadeus Cream Liqueur 50 test
Amadeus Sparkling Sparkling Wine 50 test
Now you should choose some rule of the thumb to cut incorrect data. For this example ratio <= 50 works good, but you probably need some research to define best heuristic and correct threshold. Also you will get some errors anyway. Choose acceptable error rate ,i.e 2%, 5% ... and improve your algorithm until you reach it (This task is similar to validation of machine learning classification algorithms).
Cut incorrect "predictions":
result = result_heuristic[result_heuristic.ratio > 50][['Product']]
print(result)
Out: Product
Brand_var
Altmeister Bitter 1960 Altmeister 330ML CAN METAL
Altos Las Hormigas Argentinian Wine Hormi 12 Yr Bottle
Hope it helps!
P.S. of course, this algorithm is very very slow, when you'optimize' it you should do some optimizations, for example, cache the diffs etc.

Resources