How can I make a list of three sentences to a string? - string

I have a target word and the left and right context that I have to join together. I am using pandas and I try to join the sentences, and the target word, together into a list, which I can then turn into a string so that it would work with my vectorizer. Basically I am just trying to turn a list of three sentences to a string.
This is the error that I get:
AttributeError Traceback (most recent call last)
<ipython-input-195-ae09731d3572> in <module>()
3
4 vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,2))
----> 5 feature_matrix=vectorizer.fit_transform(trainTexts)
6 print("shape=",feature_matrix.shape)
3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py in _preprocess(doc, accent_function, lower)
66 """
67 if lower:
---> 68 doc = doc.lower()
69 if accent_function is not None:
70 doc = accent_function(doc)
AttributeError: 'list' object has no attribute 'lower'
I have tried using .joinand .split but they are not working for me so I am doing something wrong.
import sys
import csv
import random
csv.field_size_limit(sys.maxsize)
trainLabels = []
trainTexts = []
with open ("myTsvFile.tsv") as train:
trainData = [row for row in csv.reader(train, delimiter='\t')]
random.shuffle(trainData)
for example in trainData:
trainLabels.append(example[1])
trainTexts.append(example[3:6])
The indexes example[3:6] means that the 3 is left context 4 is target word and 5 right context.
print('Text:', trainTexts[3])
print('Label:', trainLabels[1])
edited the few printed lines from the code:
['Visa electron käy aika monessa paikassa luottokortista . Mukaanlukien ', 'Paypal', ' , mikä avaa taas lisää ovia .']
['Nyt pistän pääni pölkyllä : ', 'WinForms', ' on ihan ok .']

Related

How to remove from the Pandas series the characters contained in the list (or series)

Good afternoon. Checking the text in the column, I came across characters that I didn't need:
"|,|.|
|2|5|0|1|6|ё|–|8|3|-|c|t|r|l|+|e|n|g|i|w|s|k|z|«|(|)|»|—|9|7|?|o|b|a|/|f|v|:|%|4|!|;|h|y|u|d|&|j|p|x|m|і|№|ұ|…|қ|$|_|[|]|“|”|ғ|||​|>|−|„|*|¬|ү|ң|#|©|―|q|→|’|∙|·| |ә| |ө|š|é|=|­|×|″|⇑|⇐|⇒|‑|′|\|<|#|'|˚| |ü|̇|̆|•|½|¾|ń|¤|һ|ý|{|}| |‘|ā|í||ī|‎|ќ|ђ|°|‚|ѓ|џ|ļ|▶|新|千|歳|空|港|全|日|機|が|曲|り|き|れ|ず|に|雪|突|っ|込|む|ニ|ュ|ー|ス|¼|ù|~|ə|ў|ҳ|ό||€|🙂|¸|⠀|ä|¯|ツ|ї|ş|è|`|́|ҹ|®|²|‪|ç| |☑|️|‼|ú|‒||👊|🏽|👁|ó|±|ñ|ł|ش|ا|ه|ن|م|›|
|£||||º
Text encoding - UTF8.
How do I correctly remove all these characters from a specific column (series) of a Pandas data frame?
I try
template = bad_symbols[0].str.cat(sep='|')
print(template)
template = re.compile(template, re.UNICODE)
test = label_data['text'].str.replace(template, '', regex=True)
And I get the following error:
"|,|.|
|2|5|0|1|6|ё|–|8|3|-|c|t|r|l|+|e|n|g|i|w|s|k|z|«|(|)|»|—|9|7|?|o|b|a|/|f|v|:|%|4|!|;|h|y|u|d|&|j|p|x|m|і|№|ұ|…|қ|$|_|[|]|“|”|ғ|||​|>|−|„|*|¬|ү|ң|#|©|―|q|→|’|∙|·| |ә| |ө|š|é|=|­|×|″|⇑|⇐|⇒|‑|′|\|<|#|'|˚| |ü|̇|̆|•|½|¾|ń|¤|һ|ý|{|}| |‘|ā|í||ī|‎|ќ|ђ|°|‚|ѓ|џ|ļ|▶|新|千|歳|空|港|全|日|機|が|曲|り|き|れ|ず|に|雪|突|っ|込|む|ニ|ュ|ー|ス|¼|ù|~|ə|ў|ҳ|ό||€|🙂|¸|⠀|ä|¯|ツ|ї|ş|è|`|́|ҹ|®|²|‪|ç| |☑|️|‼|ú|‒||👊|🏽|👁|ó|±|ñ|ł|ش|ا|ه|ن|م|›|
|£||||º
---------------------------------------------------------------------------
error Traceback (most recent call last)
<ipython-input-105-36817f343a8a> in <module>
5 print(template)
6
----> 7 template = re.compile(template, re.UNICODE)
8
9 test = label_data['text'].str.replace(template, '', regex=True)
5 frames
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
643 if not item or item[0][0] is AT:
644 raise source.error("nothing to repeat",
--> 645 source.tell() - here + len(this))
646 if item[0][0] in _REPEATCODES:
647 raise source.error("multiple repeat",
error: nothing to repeat at position 36 (line 2, column 30)
You need to escape your characters, use re.escape:
import re
template = '|'.join(map(re.escape, bad_symbols[0]))
Then, not need to compile, pandas will handle it for you:
test = label_data['text'].str.replace(template, '', regex=True, flags=re.UNICODE)

sklearn DictVectorizer() throwing error with a dictionary as input

I'm fairly new to sklearn's DictVectorizer, and am trying to create a function where DictVectorizer will output feature names from a list of bigrams that I have used to form a from a feature dictionary. The input to my function is a string, and the function should return a list consisting of a formed into dictionaries (something like this).
def features (str) -> List[Dict[Text, Union[Text, int]]]:
# my feature dictionary should have 'bigram' as the key, and the values will be the bigrams themselves. your feature dict needs to have "bigram" as a key
# bigram: a form of "w[i]-w[i+1]"
# This is my bigram list (as structured above)
bigrams: List[Dict[Text, Union[Text, int]]] = []
# here is my code:
bigrams = {'bigram':i for j in sentence for i in zip(j.split(" ").
[:-1], j.split(" ")[1:])}
return bigrams
vect = DictVectorizer(sparse=False)
text = str()
feature_catalog = features(text)
vect.fit(feature_catalog)
print(sorted(vectorizer.get_feature_names_out()))
Everything works fine until the code advances to the DictVectorizer blocks (hidden in the class itself). This is what I get:
AttributeError Traceback (most recent call last)
/var/folders/pl/k80fpf9s4f9_3rp8hnpw5x0m0000gq/T/ipykernel_3804/266218402.py in <module>
22 features = get_feature(text)
23
---> 24 vectorizer.fit(features)
25
26 print(sorted(vectorizer.get_feature_names()))
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/feature_extraction/_dict_vectorizer.py in fit(self, X, y)
159
160 for x in X:
--> 161 for f, v in x.items():
162 if isinstance(v, str):
163 feature_name = "%s%s%s" % (f, self.separator, v)
AttributeError: 'str' object has no attribute 'items'
Any ideas? This ultimately going to be used as part of a larger processsing effort on a corpus.

TypeError: sequence item 0: expected str instance, NoneType found

Issue : In the above code, i have used two specific print statements to do the same thing. While the first one does its job, the second one is throwing an exception while being executed. I have brain stormed it a lot but not able to find exactly where the None type object is coming from inside join:
import numpy as np
from sklearn import preprocessing
input_labels=['red','black','red','green','black','yellow','white']
encoder=preprocessing.LabelEncoder()
encoder.fit(input_labels)
print("\nLabel Mapping:")
for i,item in enumerate(encoder.classes_):
print(item, '--->',i)
print("\nLabel Mapping:",''.join(print(item, '--->',i) for i,item in
enumerate(encoder.classes_)))
Here is the output:
Label Mapping:
black ---> 0
green ---> 1
red ---> 2
white ---> 3
yellow ---> 4
Traceback (most recent call last):
File "C:\Users\satyaranjan.rout\workspace\archival script\bokehtest.py", line 12, in <module>
Label Mapping:
black ---> 0
green ---> 1
red ---> 2
white ---> 3
yellow ---> 4
print("\nLabel Mapping:"),''.join(print(item, '--->',i) for i,item in enumerate(encoder.classes_))
TypeError: sequence item 0: expected str instance, NoneType found
Question : Both code blocks (line 8,9,10) and line 12 are doing the same functions . What is the issue here with the one liner(line 12) for which it is returning Nonetype object from with in join . If i want to remove it, what replacement can be performed?
Change the line
print("\nLabel Mapping:",''.join(print(item, '--->',i) for i,item in enumerate(encoder.classes_)))
into:
print("\nLabel Mapping:",''.join('%s--->%s' % (item, i) for i,item in enumerate(encoder.classes_)))
The return of print function is None so your code tries to join None elements, that is why it gives an error. When you convert format them as string, the problem should be solved.

Preprocessing string data in pandas dataframe

I have a user review dataset. I have loaded this dataset and now i want to preprocess the user reviews(i.e. removing stopwords, punctuations, convert to lower case, removing salutations etc.) before fitting it to a classifier but i am getting error. Here is my code:
import pandas as pd
import numpy as np
df=pd.read_json("C:/Users/ABC/Downloads/Compressed/reviews_Musical_Instruments_5.json/Musical_Instruments_5.json",lines=True)
dataset=df.filter(['overall','reviewText'],axis=1)
def cleanText(text):
"""
removes punctuation, stopwords and returns lowercase text in a list
of single words
"""
text = (text.lower() for text in text)
from bs4 import BeautifulSoup
text = BeautifulSoup(text).get_text()
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = tokenizer.tokenize(text)
from nltk.corpus import stopwords
clean = [word for word in text if word not in
stopwords.words('english')]
return clean
dataset['reviewText']=dataset['reviewText'].apply(cleanText)
dataset['reviewText']
I am getting these errors:
TypeError Traceback (most recent call last)
<ipython-input-68-f42f70ec46e5> in <module>()
----> 1 dataset['reviewText']=dataset['reviewText'].apply(cleanText)
2 dataset['reviewText']
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-64-5c6792de405c> in cleanText(text)
10 from nltk.tokenize import RegexpTokenizer
11 tokenizer = RegexpTokenizer(r'\w+')
---> 12 text = tokenizer.tokenize(text)
13
14 from nltk.corpus import stopwords
~\Anaconda3\lib\site-packages\nltk\tokenize\regexp.py in tokenize(self, text)
127 # If our regexp matches tokens, use re.findall:
128 else:
--> 129 return self._regexp.findall(text)
130
131 def span_tokenize(self, text):
TypeError: expected string or bytes-like object
and
TypeError Traceback (most recent call last)
<ipython-input-70-f42f70ec46e5> in <module>()
----> 1 dataset['reviewText']=dataset['reviewText'].apply(cleanText)
2 dataset['reviewText']
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-69-5c6792de405c> in cleanText(text)
10 from nltk.tokenize import RegexpTokenizer
11 tokenizer = RegexpTokenizer(r'\w+')
---> 12 text = tokenizer.tokenize(text)
13
14 from nltk.corpus import stopwords
~\Anaconda3\lib\site-packages\nltk\tokenize\regexp.py in tokenize(self, text)
127 # If our regexp matches tokens, use re.findall:
128 else:
--> 129 return self._regexp.findall(text)
130
131 def span_tokenize(self, text):
TypeError: expected string or bytes-like object
Please suggest corrections in this function for my data or suggest a new function for data cleaning.
Here is my data:
overall reviewText
0 5 Not much to write about here, but it does exac...
1 5 The product does exactly as it should and is q...
2 5 The primary job of this device is to block the...
3 5 Nice windscreen protects my MXL mic and preven...
4 5 This pop filter is great. It looks and perform...
5 5 So good that I bought another one. Love the h...
6 5 I have used monster cables for years, and with...
7 3 I now use this cable to run from the output of...
8 5 Perfect for my Epiphone Sheraton II. Monster ...
9 5 Monster makes the best cables and a lifetime w...
10 5 Monster makes a wide array of cables, includin...
11 4 I got it to have it if I needed it. I have fou...
12 3 If you are not use to using a large sustaining...
13 5 I love it, I used this for my Yamaha ypt-230 a...
14 5 I bought this to use in my home studio to cont...
15 2 I bought this to use with my keyboard. I wasn'...
print(df)
overall reviewText
0 5 Not much to write about here, but it does exac...
1 5 The product does exactly as it should and is q...
2 5 The primary job of this device is to block the...
3 5 Nice windscreen protects my MXL mic and preven...
4 5 This pop filter is great. It looks and perform...
5 5 So good that I bought another one. Love the h...
6 5 I have used monster cables for years, and with...
7 3 I now use this cable to run from the output of...
8 5 Perfect for my Epiphone Sheraton II. Monster ...
9 5 Monster makes the best cables and a lifetime w...
10 5 Monster makes a wide array of cables, includin...
11 4 I got it to have it if I needed it. I have fou...
12 3 If you are not use to using a large sustaining...
13 5 I love it, I used this for my Yamaha ypt-230 a...
14 5 I bought this to use in my home studio to cont...
15 2 I bought this to use with my keyboard. I wasn'...
To convert into lowercase
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x : str.lower(x))
To remove punctuation and numbers
import re
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x : " ".join(re.findall('[\w]+',x)))
To remove stopwords, you can either install stopwords or create your own stopword list and use it with a function
from stop_words import get_stop_words
stop_words = get_stop_words('en')
def remove_stopWords(s):
'''For removing stop words
'''
s = ' '.join(word for word in s.split() if word not in stop_words)
return s
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x: remove_stopWords(x))

'set' object cannot be interpreted as an integer

I have the following python code:
text = "this’s a sent tokenize test. this is sent two. is this sent three? sent 4 is cool! Now it’s your turn."
from nltk.tokenize import sent_tokenize
sent_tokenize_list = sent_tokenize(text)
import numpy as np
lenDoc=len(sent_tokenize_list)
features={'position','rate'}
score = np.empty((lenDoc, 2), dtype=object)
score=[[0 for x in range(sent_tokenize_list)] for y in range(features)]
for i,sentence in enumerate(sent_tokenize_list):
score[i,features].append((lenDoc-i)/lenDoc)
But it results in the following error:
TypeError Traceback (most recent call last) <ipython-input-27-c53da2b2ab02> in <module>()
13
14
---> 15 score=[[0 for x in range(sent_tokenize_list)] for y in range(features)]
16 for i,sentence in enumerate(sent_tokenize_list):
17 score[i,features].append((lenDoc-i)/lenDoc)
TypeError: 'set' object cannot be interpreted as an integer
range() takes int values. features is a set so it throws an error. you made the same mistake with range(sent_tokenize_list). sent_tokenize_list is a list value not an int.
If you want x and y to be indexes of features and sent_tokenize_list then you have to use this: score=[[0 for x in range(len(sent_tokenize_list))] for y in range(len(features))]
But if you want x and y to be values of features and sent_tokenize_list then you have to remove range() from that line.

Resources