I want to convert the text for suitable "natural language processing"
There are approx 3000+ books in column of "TEXT"
every row has huge text or one book in every row so when I apply this code I am getting a error as shown bellow.
When I am applying the below code
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(len(dt)):
review = re.sub('[^a-zA-Z0-9]', ' ', dt['TEXT'][i])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
I am getting the following error
TypeError Traceback (most recent call last)
<ipython-input-16-47569f8727fa> in <module>
6 corpus = []
7 for i in range(1000,2000):
----> 8 review = re.sub('[^a-zA-Z0-9]', ' ', dt['TEXT'][i])
9 review = review.lower()
10 review = review.split()
~\anaconda3\lib\re.py in sub(pattern, repl, string, count, flags)
190 a callable, it's passed the Match object and must return
191 a replacement string to be used."""
--> 192 return _compile(pattern, flags).sub(repl, string, count)
193
194 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object
This means that in your DataFrame column 'TEXT' there are values that are not strings.
You can do this instead:
for i in range(len(df)):
try:
re.sub('[^a-zA-Z0-9]', ' ', df['TEXT'][i])
# the rest of your code ...
except TypeError:
pass
Related
I don't understand this error... I've already turned df into lowercase before turning it into a list dataframe:
0 Masuk ke Liang Lahat, Rian D’Masiv Makin Sadar... Infotainment Untuk pertama kalinya, Rian masuk ke liang lah...
1 Alasan PPKM, Kuasa Hukum Vicky Prasetyo Sebut ... Infotainment Andai saja persidangan tetap berjalan seperti ...
...
1573 Jessica Iskandar Syok Tahu Kabar Nia Ramadhani... Infotainment “Banyak wartawan juga nanyain. Itu aku baru ba...
1574 Show 10 Menit BTS dalam Koleksi LV Music & Movie BTS melaksanakan ’’tugas’’ perdananya sebagai ...
Code:
import pandas as pd
import numpy as np
import re
import string
import nltk
import str
def load_data():
dataset = pd.read_csv("jawapos_entertainment.csv")
return dataset
news_df = load_data()
news_df.head()
df = pd.DataFrame(news_df[['judul_name','judul_kategori','judul_Headline']])
df
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
factory = StopWordRemoverFactory()
stopwords = factory.create_stop_word_remover()
kalimat = df [['judul_name','judul_Headline']]
kalimat = kalimat.lower()
stop = stopwords.remove(kalimat)
print(stop)
But I have an error in this line:
AttributeError Traceback (most recent call last)
<ipython-input-17-ce52d5ec4fb2> in <module>
4
5 kalimat = df [['judul_name','judul_Headline']]
----> 6 kalimat = kalimat.lower()
7
8 stop = stopwords.remove(kalimat)
~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5464 return self[name]
-> 5465 return object.__getattribute__(self, name)
5466
5467 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'lower'
But why is the program returning a lowercase error if I've already passed the lowercase dataframe before?
You can't just lower a Dataframe object. First you have to point that you want to use vectorized string functions for Series and Index-pd.Series.str.
Converting whole dataframe to lowercase format should looks like this:
for columns in kalimat.columns:
kalimat[columns] = kalimat[columns].str.lower()
I have trouble doing a language detection.
The code below raises an Exception Error.
from langdetect import detect
for row in df['Comments']:
text = str(row)
language_code = detect(text)
sentence = [all_languages_codes.get(language_code)]
df['Language']=sentence[0]
Error Message:
148 ngrams = self._extract_ngrams()
149 if not ngrams:
--> 150 raise LangDetectException(ErrorCode.CantDetectError, 'No features in text.')
151
152 self.langprob = [0.0] * len(self.langlist)
LangDetectException: No features in text.
How to print-out the row that causes the LangDetectException?
It looks like your Contents string is empty:
detect("")
LangDetectException: No features in text.
You can launch a debugger or interactive shell to know for sure, wrapping this in a try/except block and launching a debugger when an exception is raised:
from langdetect import detect
for row in df['Comments']:
try:
text = str(row)
language_code = detect(text)
sentence = [all_languages_codes.get(language_code)]
df['Language']=sentence[0]
except Exception:
import ipdb; ipdb.set_trace()
I'm pretty new to python and I'm currently working on an assignment to implement a movie recommendation system. I have a .csv file that contains various descriptions of a given movie's attribute. I ask the user for a movie title and then the system returns similar movies.
The dataset is named movie_dataset.csv from this folder on GitHub: https://github.com/codeheroku/Introduction-to-Machine-Learning/tree/master/Building%20a%20Movie%20Recommendation%20Engine
The problem I am encountering is that when I ask the user to enter a movie title, the program only works for certain titles.
The code:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
#helper functions#
def get_title_from_index(index):
return df[df.index == index]["title"].values[0]
def get_index_from_title(title):
return df[df.title == title]["index"].values[0]
df = pd.read_csv("movie_dataset.csv")
#print (df.columns)
features = ['keywords','cast','genres','director']
for feature in features:
df[feature] = df[feature].fillna('')
def combine_features(row):
return row['keywords'] +" "+ row['cast'] +" "+ row['genres'] +" "+ row['director']
df["combine_features"] = df.apply(combine_features, axis=1)
#print (df["combine_features"].head())
cv = CountVectorizer()
count_matrix = cv.fit_transform(df["combine_features"])
#MTitle = input("Type in a movie title: ")
cosine_sim = cosine_similarity(count_matrix)
movie_user_likes = 'Avatar'#MTitle
movie_index = get_index_from_title(movie_user_likes)
similar_movies = list(enumerate(cosine_sim[movie_index]))
sorted_similar_movies = sorted(similar_movies, key= lambda x:x[1], reverse=True)
i = 0
for movie in sorted_similar_movies:
print (get_title_from_index(movie[0]))
i=i+1
if i>10:
break
When I enter "Batman" the program runs fine. But when I run "Harry Potter" I get:
IndexError Traceback (most recent call last)
<ipython-input-51-687ddb420709> in <module>
30 movie_user_likes = MTitle
31
---> 32 movie_index = get_index_from_title(movie_user_likes)
33
34 similar_movies = list(enumerate(cosine_sim[movie_index]))
<ipython-input-51-687ddb420709> in get_index_from_title(title)
8
9 def get_index_from_title(title):
---> 10 return df[df.title == title]["index"].values[0]
11
12 df = pd.read_csv("movie_dataset.csv")
IndexError: index 0 is out of bounds for axis 0 with size 0
There's simply no entry in the data base for the movie "Harry Potter"
You should add some testing for these cases such as:
def get_index_from_title(title):
try:
return df[df.title == title]["index"].values[0]
except IndexError:
return None
Then of course in the calling code you'll have to test if you got a None from the function and act accordingly.
I have a error when trying to call calculate_similarity2 function which in in DocSim.py file from my notebook.
The error message is : 'DocSim' object has no attribute 'calculate_similarity2'
Here the content of my docsim File :
import numpy as np
class DocSim(object):
def __init__(self, w2v_model , stopwords=[]):
self.w2v_model = w2v_model
self.stopwords = stopwords
def vectorize(self, doc):
"""Identify the vector values for each word in the given document"""
doc = doc.lower()
words = [w for w in doc.split(" ") if w not in self.stopwords]
word_vecs = []
for word in words:
try:
vec = self.w2v_model[word]
word_vecs.append(vec)
except KeyError:
# Ignore, if the word doesn't exist in the vocabulary
pass
# Assuming that document vector is the mean of all the word vectors
# PS: There are other & better ways to do it.
vector = np.mean(word_vecs, axis=0)
return vector
def _cosine_sim(self, vecA, vecB):
"""Find the cosine similarity distance between two vectors."""
csim = np.dot(vecA, vecB) / (np.linalg.norm(vecA) * np.linalg.norm(vecB))
if np.isnan(np.sum(csim)):
return 0
return csim
def calculate_similarity(self, source_doc, target_docs=[], threshold=0):
"""Calculates & returns similarity scores between given source document & all
the target documents."""
if isinstance(target_docs, str):
target_docs = [target_docs]
source_vec = self.vectorize(source_doc)
results = []
for doc in target_docs:
target_vec = self.vectorize(doc)
sim_score = self._cosine_sim(source_vec, target_vec)
if sim_score > threshold:
results.append({
'score' : sim_score,
'sentence' : doc
})
# Sort results by score in desc order
results.sort(key=lambda k : k['score'] , reverse=True)
return results
def calculate_similarity2(self, source_doc=[], target_docs=[], threshold=0):
"""Calculates & returns similarity scores between given source document & all the target documents."""
if isinstance(source_doc, str):
target_docs = [source_doc]
if isinstance(target_docs, str):
target_docs = [target_docs]
#source_vec = self.vectorize(source_doc)
results = []
for doc in source_doc:
source_vec = self.vectorize(doc)
for doc1 in target_docs:
target_vec = self.vectorize(doc)
sim_score = self._cosine_sim(source_vec, target_vec)
if sim_score > threshold:
results.append({
'score' : sim_score,
'source sentence' : doc,
'target sentence' : doc1
})
# Sort results by score in desc order
results.sort(key=lambda k : k['score'] , reverse=True)
return results
here in instruction code when i try to call the fucntion :
To create DocSim Object
ds = DocSim(word2vec_model,stopwords=stopwords)
sim_scores = ds.calculate_similarity2(source_doc, target_docs)
the error message is :
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-54-bb0bd1e0e0ad> in <module>()
----> 1 sim_scores = ds.calculate_similarity2(source_doc, target_docs)
AttributeError: 'DocSim' object has no attribute 'calculate_similarity2'
i don't undersantand how to resolve this problem.
I can access to all function except calculate_similarity2
Can you help me please?
thanks
You have defined the calculate_similarity2 function inside the __init__ scope. Try getting it out of there
I have the following code where i am facing error and i am unable to identify the actual issue here. The code takes a .json file which holds the words and their meanings and finds the exact or nearest matches of the words given as input by the user along with their meanings. The code was running fine until i tried to modify it a little. I wanted to add the matching words where the first word is capital in the following line post which it started throwing exception:
Changed line:
if (word != "") and ((word in data.keys()) or (word.capitalize() in data.keys())):
Code:
import json
import difflib
def searchWord(word):
if (word != "") and ((word in data.keys()) or (word.capitalize() in data.keys())):
return data[word]
else:
closematch = difflib.get_close_matches(word,data.keys())[0]
confirmation = (input(f"\nDid you mean: {closematch} (y/n): ")).lower()
if confirmation == 'y':
return data[closematch]
else:
return 'Word Not Found in Dictionary'
print('Loading Data...\n')
data = json.load(open('data.json'))
print('Data Loaded!\n')
word = (input('Enter word to lookup in dictionary: ')).lower()
meanings = searchWord(word)
if meanings == list:
for meaning in meanings:
print("\n"+meaning)
else:
print(meanings[0])
Error:
Loading Data...
Data Loaded!
Enter word to lookup in dictionary: delhi
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
E:\Learning\Python\AdvancedPython\PythonMegaCourse\DictionaryApp\dictionary.py in <module>()
20 word = (input('Enter word to lookup in dictionary: ')).lower()
21
---> 22 meanings = searchWord(word)
23 if meanings == list:
24 for meaning in meanings:
E:\Learning\Python\AdvancedPython\PythonMegaCourse\DictionaryApp\dictionary.py in searchWord(word)
4 def searchWord(word):
5 if (word != "") and ((word in data.keys()) or (word.capitalize() in data.keys())):
----> 6 return data[word]
7 else:
8 closematch = difflib.get_close_matches(word,data.keys())[0]
KeyError: 'delhi'
The .json file has got a key named Delhi however, the capitalize() doesn't seem to work.
When you are trying to access the word from the dictionary, you are not capitalizing it.
This is not a clean way to handle it but to give you the idea.
if (word != "") and (word in data.keys()):
return data[word]
if (word != "") and (word.capitalize() in data.keys()):
return data[word.capitalize()]