Why did NLTK NaiveBayes classifier misclassify one record?

Why did NLTK NaiveBayes classifier misclassify one record? - nlp

This is the first time I am building a sentiment analysis machine learning model using the nltk NaiveBayesClassifier in Python. I know it is too simple of a model, but it is just a first step for me and I will try tokenized sentences next time.
The real issue I have with my current model is: I have clearly labeled the word 'bad' as negative in the training data set (as you can see from the 'negative_vocab' variable). However, when I ran the NaiveBayesClassifier on each sentence (lower case) in the list ['awesome movie', ' i like it', ' it is so bad'], the classifier mistakenly labeled 'it is so bad' as positive.
INPUT:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]
def word_feats(words):
return dict([(word, True) for word in words])
positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]
train_set = negative_features_1 + positive_features_1 + neutral_features_1
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad"
sentence = sentence.lower()
words = sentence.split('.')
def word_feat(word):
return dict([(word,True)])
#NOTE THAT THE FUNCTION 'word_feat(word)' I WROTE HERE IS DIFFERENT FROM THE 'word_feat(words)' FUNCTION I DEFINED EARLIER. THIS FUNCTION IS USED TO ITERATE OVER EACH OF THE THREE ELEMENTS IN THE LIST ['awesome movie', ' i like it', ' it is so bad'].
for word in words:
classResult = classifier.classify(word_feat(word))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print(str(word) + ' is ' + str(classResult))
print()
OUTPUT:
awesome movie is pos
i like it is pos
it is so bad is pos
To make sure the function 'word_feat(word)' iterates over each sentences instead of each word or letter, I did some diagnostic codes to see what is each element in 'word_feat(word)':
for word in words:
print(word_feat(word))
And it printed out:
{'awesome movie': True}
{' i like it': True}
{' it is so bad': True}
So it seems like the function 'word_feat(word)' is correct?
Does anyone know why the classifier classified 'It is so bad' as positive? As mentioned before, I had clearly labeled the word 'bad' as negative in my training data.

This particular failure is because your word_feats() function expects a list of words (a tokenized sentence), but you pass it each word separately... so word_feats() iterates over its letters. You've built a classifier that classifies strings as positive or negative on the basis of the letters they contain.
You're probably in this predicament because you pay no attention to what you name your variables. In your main loop, none of the variables sentence, words, or word contain what their name claims. To understand and improve your program, start by naming things properly.
Bugs aside, this is not how you build a sentiment classifier. The training data should be a list of tokenized sentences (each labeled with its sentiment), not a list of individual words. Similarly, you classify tokenized sentences.

Here is the modified code for you
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
from nltk.corpus import stopwords
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]
def word_feats(words):
return dict([(word, True) for word in words])
positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]
train_set = negative_features_1 + positive_features_1 + neutral_features_1
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad."
sentence = sentence.lower()
sentences = sentence.split('.') # these are actually list of sentences
for sent in sentences:
if sent != "":
words = [word for word in sent.split(" ") if word not in stopwords.words('english')]
classResult = classifier.classify(word_feats(words))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print(str(sent) + ' --> ' + str(classResult))
print
I modified where you are considering 'list of words' as an input to your classifier. But Actually you need to pass sentence one by one, which means you need to pass 'list of sentences'
Also, for each sentence, you need to pass 'words as features', which means you need to split the sentence on white-space character.
Also, if you want your classifier to work properly for sentiment analysis, you need to give less preference to "stop-words" like "it, they, is etc". As these words are not sufficient to decide if the sentence is positive, negative or neutral.
The above code gives below output
awesome movie --> pos
i like it --> pos
it is so bad --> neg
So for any classifier, the input format for training classifier and predicting classifier should be same. While training you are providing list of words, try to use the same method to convert your test set as well.

Let me show a rewriting of your code. All I changed near the top was adding import re, as it is easier to tokenize with regexes. Everything else up to defining classifier is the same as your code.
I added one more test case (something really, really negative), but more importantly I used proper variable names - then it is much harder to get confused about what is going on:
test_data = "Awesome movie. I like it. It is so bad. I hate this terrible useless movie."
sentences = test_data.lower().split('.')
So sentences now contains 4 strings, each a single sentence.
I left your word_feat() function unchanged.
For using the classifier I did quite a big rewrite:
for sentence in sentences:
if(len(sentence) == 0):continue
neg = 0
pos = 0
for word in re.findall(r"[\w']+", sentence):
classResult = classifier.classify(word_feat(word))
print(word, classResult)
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print("\n%s: %d vs -%d\n"%(sentence,pos,neg))
The outer loop is again descriptive, so that sentence contains one sentence.
I then have an inner loop where we classify each word in the sentence; I am using a regex to split the sentence up on whitespace and punctuation marks:
for word in re.findall(r"[\w']+", sentence):
classResult = classifier.classify(word_feat(word))
The rest is just basic adding up and reporting. I get this output:
awesome pos
movie neu
awesome movie: 1 vs -0
i pos
like pos
it pos
i like it: 3 vs -0
it pos
is neu
so pos
bad neg
it is so bad: 2 vs -1
i pos
hate neg
this pos
terrible neg
useless neg
movie neu
i hate this terrible useless movie: 2 vs -3
I still get the same as you - "it is so bad" is considered positive. And with the extra debug lines we can see it is because "it" and "so" are considered positive words, and "bad" is the only negative word, so overall it is positive.
I suspect this is because it hadn't seen those words in its training data.
...yes, if I add "it" and "so" to the list of neutral words, I get "it is so bad: 0 vs -1".
As next things to try, I'd suggest:
Try with more training data; toy examples like this carry the risk that the noise will swamp the signal.
Look into removing stop words.

You can try this code
from nltk.classify import NaiveBayesClassifier
def word_feats(words):
return dict([(word, True) for word in words])
positive_vocab = [ 'awesome', 'outstanding', 'fantastic','terrific','good','nice','great', ':)','love' ]
negative_vocab = [ 'bad', 'terrible','useless','hate',':(','kill','steal']
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not' ]
positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]
train_set = negative_features + positive_features + neutral_features
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = " Awesome movie, I like it :)"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
classResult = classifier.classify( word_feats(word))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print('Positive: ' + str(float(pos)/len(words)))
print('Negative: ' + str(float(neg)/len(words)))
results are:
Positive: 0.7142857142857143
Negative: 0.14285714285714285

Related

How to efficient batch-process in huggingface?

I am using Huggingface library and transformers to find whether a sentence is well-formed or not. I am using a masked language model called XLMR. I first tokenize my sentence, and then mask each word of the sentence one by one, and then process the masked sentences and find the probability that the predicted masked word is right.
def calculate_scores(sent, model, tokenizer, device, print_pred=False, maskval=False):
k = 0
dic = {}
ls = tokenizer.batch_encode_plus(sent)
input_list = ls.input_ids
h=0
with torch.no_grad():
for i in tqdm(range(len(input_list))):
item = input_list[i]
real_input = item
attmask = [1]*len(item)
seg = [0]*len(item)
seglist = [seg]
masked_list = [real_input]
attlist = [attmask]
for j in range(1, len(item)-1):
input = copy.deepcopy(real_input)
input[j] = 50264
masked_list.append(input)
attlist.append(attmask)
seglist.append(seg)
inid = torch.tensor(masked_list)
segtensor = torch.tensor(seglist)
atttensor = torch.tensor(attlist)
inid=inid.to(device)
segtensor=segtensor.to(device)
output = model(inid, segtensor)
predictions_logits = output.logits
predictions = torch.softmax(predictions_logits, dim=2)
ppscore = 0
for j in range(1, len(item)-1):
ppscore = ppscore+math.log(predictions[j, j, item[j]], 2)
try:
score = math.pow(2, (-1/(len(item)-2))*ppscore)
dic[sent[i]] = score
except:
print(sent[i])
dic[sent[i]] = 10000000
# dic[sent[i]]=10000000
return dic
I will explain my code quickly. The function calculate_scores has sent as an input which is a list of sentences. I first batch encode this list of sentences. And then for each encoded sentence that I get, I generate masked sentences where only one word is masked and the rest are un-masked. Then I input these generated sentences to output and get the probability. Then I compute perplexity.
But the way I'm using this is not a very good way of utilizing GPU. I want to process multiple sentences at once but at the same time, I also need to find the perplexity scores for each sentence. How would I go about doing this?

Current Tensorflow Program creates a Dataset for NLP using Cornell's Movie Dialog Data. How can I reprogram it to support a different Dataset?

This program essentially does many things. It downloads the Cornell Dialog Data Zip file and extracts it. It then reaches in and begins to load questions and answers from the datasets to a list of strings. It loads it into my training program and begins learning from the data set.
import os
import re
import tensorflow as tf
import tensorflow_datasets as tfds
# Maximum number of samples to preprocess
MAX_SAMPLES = 50000
# Maximum sentence length
MAX_LENGTH = 40
def create_datasets():
# 1) download the datasets and unzip it.
path_to_zip = tf.keras.utils.get_file('cornell_movie_dialogs.zip',
origin='http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip',
extract=True)
# get the unzipped files' path.
path_to_dataset = os.path.join(os.path.dirname(path_to_zip), "cornell movie-dialogs corpus")
path_to_movie_lines = os.path.join(path_to_dataset, 'movie_lines.txt')
path_to_movie_conversations = os.path.join(path_to_dataset, 'movie_conversations.txt')
# 2) loads questions and answers from the datasets into a list of strings. (I think this is where it really matters)
questions, answers = load_conversations(path_to_movie_lines, path_to_movie_conversations)
# 3) Build tokenizer using tfds for both questions and answers
tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(questions + answers, target_vocab_size=2 ** 13)
# Define start and end token to indicate the start and end of a sentence
start_token, end_token = [tokenizer.vocab_size], [tokenizer.vocab_size + 1]
# Vocabulary size plus start and end token
VOCAB_SIZE = tokenizer.vocab_size + 2
# tokenize the datasets.
questions, answers = tokenize_and_filter(questions, answers, start_token, end_token, tokenizer)
print('Vocab size: {}'.format(VOCAB_SIZE))
print('Number of samples: {}'.format(len(questions)))
# 4) convert the datasets to tf.data.Dataset format
# decoder inputs use the previous target as input
# remove START_TOKEN from targets
dataset = tf.data.Dataset.from_tensor_slices((
{
'inputs': questions,
'dec_inputs': answers[:, :-1]
},
{
'outputs': answers[:, 1:]
},
))
return dataset, tokenizer
def preprocess_sentence(sentence):
sentence = sentence.lower().strip()
# creating a space between a word and the punctuation following it
# eg: "he is a boy." => "he is a boy ."
sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
sentence = re.sub(r'[" "]+', " ", sentence)
# replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
sentence = re.sub(r"[^a-zA-Z?.!,]+", " ", sentence)
sentence = sentence.strip()
# adding a start and an end token to the sentence
return sentence
def load_conversations(path_to_movie_lines, path_to_movie_conversations):
# dictionary of line id to text
id2line = {}
with open(path_to_movie_lines, errors='ignore') as file:
lines = file.readlines()
for line in lines:
parts = line.replace('\n', '').split(' +++$+++ ')
id2line[parts[0]] = parts[4]
inputs, outputs = [], []
with open(path_to_movie_conversations, 'r') as file:
lines = file.readlines()
for line in lines:
parts = line.replace('\n', '').split(' +++$+++ ')
# get conversation in a list of line ID
conversation = [line[1:-1] for line in parts[3][1:-1].split(', ')]
for i in range(len(conversation) - 1):
inputs.append(preprocess_sentence(id2line[conversation[i]]))
outputs.append(preprocess_sentence(id2line[conversation[i + 1]]))
if len(inputs) >= MAX_SAMPLES:
return inputs, outputs
return inputs, outputs
# Tokenize, filter and pad sentences
def tokenize_and_filter(inputs, outputs, start_token, end_token, tokenizer):
tokenized_inputs, tokenized_outputs = [], []
for (sentence1, sentence2) in zip(inputs, outputs):
# tokenize sentence
sentence1 = start_token + tokenizer.encode(sentence1) + end_token
sentence2 = start_token + tokenizer.encode(sentence2) + end_token
# check tokenized sentence max length
if len(sentence1) <= MAX_LENGTH and len(sentence2) <= MAX_LENGTH:
tokenized_inputs.append(sentence1)
tokenized_outputs.append(sentence2)
# pad tokenized sentences
tokenized_inputs = tf.keras.preprocessing.sequence.pad_sequences(
tokenized_inputs, maxlen=MAX_LENGTH, padding='post')
tokenized_outputs = tf.keras.preprocessing.sequence.pad_sequences(
tokenized_outputs, maxlen=MAX_LENGTH, padding='post')
return tokenized_inputs, tokenized_outputs
if __name__ == "__main__":
tf.executing_eagerly()
create_datasets()
The problem is that I want to use a different data set but it doesn't seem to like it when I overwrite the txt files inside with other data as it isn't in a similar structure as the Cornell Data files. Here is an example of how these are structured...
L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!
L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!
L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.
L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?
L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.
Other data sets are structured with different annotations and this program doesn't like it when I bring those data txt files instead of the Cornell structured ones. Any advice on what I could do to change the program to support new kinds of data?

How to simplify text comparison for big data-set where text meaning is same but not exact - deduplicate text data

I have text data set (different menu items like chocolate, cake, coke etc) of around 1.8 million records which belongs to 6 different categories (category A, B, C, D, E, F). one of the category has around 700k records. Most of the menu items are mixed up in multiple categories to which they doesn't belong to, for example: cake belongs to category 'A' but it is found in category 'B' & 'C' as well.
I want to identify those misclassified items and report to a personnel but the challenge is the item name is not always correct because it is totally human typed text. For example: Chocolate might be updated as hot chclt, sweet choklate, chocolat etc. There can also be items like chocolate cake ;)
so to handle this, I tried a simple method using cosine similarity to compare category-wise and identify those anomalies but it takes alot of time since I am comparing each items to 1.8 million records (Sample code is as shown below). Can anyone suggest a better way to deal with this problem?
#Function
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def cos_similarity(a,b):
X =a
Y =b
# tokenization
X_list = word_tokenize(X)
Y_list = word_tokenize(Y)
# sw contains the list of stopwords
sw = stopwords.words('english')
l1 =[];l2 =[]
# remove stop words from the string
X_set = {w for w in X_list if not w in sw}
Y_set = {w for w in Y_list if not w in sw}
# form a set containing keywords of both strings
rvector = X_set.union(Y_set)
for w in rvector:
if w in X_set: l1.append(1) # create a vector
else: l1.append(0)
if w in Y_set: l2.append(1)
else: l2.append(0)
c = 0
# cosine formula
for i in range(len(rvector)):
c+= l1[i]*l2[i]
if float((sum(l1)*sum(l2))**0.5)>0:
cosine = c / float((sum(l1)*sum(l2))**0.5)
else:
cosine = 0
return cosine
#Base code
cos_sim_list = []
for i in category_B.index:
ln_cosdegree = 0
ln_degsem = []
for j in category_A.index:
ln_j = str(category_A['item_name'][j])
ln_i = str(category_B['item_name'][i])
degreeOfSimilarity = cos_similarity(ln_j,ln_i)
if degreeOfSimilarity>0.5:
cos_sim_list.append([ln_j,ln_i,degreeOfSimilarity])
Consider text is already cleaned

I used KNeighbor and cosine similarity to solve this case. Though I am running the code multiple times to compare category by category; still it is effective because of lesser number of categories. Please suggest me if any better solution is available
cat_A_clean = category_A['item_name'].unique()
print('Vecorizing the data - this could take a few minutes for large datasets...')
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
tfidf = vectorizer.fit_transform(cat_A_clean)
print('Vecorizing completed...')
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf)
unique_B = set(category_B['item_name'].values)
def getNearestN(query):
queryTFIDF_ = vectorizer.transform(query)
distances, indices = nbrs.kneighbors(queryTFIDF_)
return distances, indices
import time
t1 = time.time()
print('getting nearest n...')
distances, indices = getNearestN(unique_B)
t = time.time()-t1
print("COMPLETED IN:", t)
unique_B = list(unique_B)
print('finding matches...')
matches = []
for i,j in enumerate(indices):
temp = [round(distances[i][0],2), cat_A_clean['item_name'].values[j],unique_B[i]]
matches.append(temp)
print('Building data frame...')
matches = pd.DataFrame(matches, columns=['Match confidence (lower is better)','ITEM_A','ITEM_B'])
print('Done')
def clean_string(text):
text = str(text)
text = text.lower()
return(text)
def cosine_sim_vectors(vec1,vec2):
vec1 = vec1.reshape(1,-1)
vec2 = vec2.reshape(1,-1)
return cosine_similarity(vec1,vec2)[0][0]
def cos_similarity(sentences):
cleaned = list(map(clean_string,sentences))
print(cleaned)
vectorizer = CountVectorizer().fit_transform(cleaned)
vectors = vectorizer.toarray()
print(vectors)
return(cosine_sim_vectors(vectors[0],vectors[1]))
cos_sim_list =[]
for ind in matches.index:
a = matches['Match confidence (lower is better)'][ind]
b = matches['ITEM_A'][ind]
c = matches['ITEM_B'][ind]
degreeOfSimilarity = cos_similarity([b,c])
cos_sim_list.append([a,b,c,degreeOfSimilarity])

How to remove key error from the program?

In the image you can see that i have ID still getting key error I am trying to do a recommendation algorithm so i got this error
#the first argument in the below function to be passed is the id of the book, second argument is the number of books you want to be recommended#
KeyError: <built-in function id>
I am sharing link of article https://towardsdatascience.com/recommender-engine-under-the-hood-7869d5eab072
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
ds = pd.read_csv("test1.csv") #you can plug in your own list of products or movies or books here as csv file#
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
#ngram explanation begins#
#ngram (1,3) can be explained as follows#
#ngram(1,3) encompasses uni gram, bi gram and tri gram
#consider the sentence "The ball fell"
#ngram (1,3) would be the, ball, fell, the ball, ball fell, the ball fell
#ngram explanation ends#
tfidf_matrix = tf.fit_transform(ds['Book Title'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
results = {} # dictionary created to store the result in a dictionary format (ID :
(Score,item_id))#
for idx, row in ds.iterrows(): #iterates through all the rows
# the below code 'similar_indice' stores similar ids based on cosine similarity. sorts them in ascending
order. [:-5:-1] is then used so that the indices with most similarity are got. 0 means no similarity and 1 means perfect similarity#
similar_indices = cosine_similarities[idx].argsort()[:-5:-1]
#stores 5 most similar books, you can change it as per your needs
similar_items = [(cosine_similarities[idx][i], ds['ID'][i]) for i in similar_indices]
results[row['ID']] = similar_items[1:]
#below code 'function item(id)' returns a row matching the id along with Book Title. Initially it is a dataframe, then we convert it to a list#
def item(id):
return ds.loc[ds['ID'] == id]['Book Title'].tolist()[0]
def recommend(id, num):
if (num == 0):
print("Unable to recommend any book as you have not chosen the number of book to be
recommended")
elif (num==1):
print("Recommending " + str(num) + " book similar to " + item(id))
else :
print("Recommending " + str(num) + " books similar to " + item(id))
print("----------------------------------------------------------")
recs = results[id][:num]
for rec in recs:
print("You may also like to read: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")
#the first argument in the below function to be passed is the id of the book, second argument is the number of books you want to be recommended#
recommend(5,2)
i have try and run successfully till results variable then getting error.

because python default id keyword is called when you call "def item(id):"
instead of id you have to declare another identifier....then i think this is the only reason for keyerror..

As the error suggests id is an build-in function in python-3. So if you change the name of the parameters id in def item(id) and def recommend(id, num) and all their references then the code should work.
After changing the id and correcting the indentation, an example could look like this:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
ds = pd.read_csv("test1.csv") # you can plug in your own list of products or movies or books here as csv file
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
# ngram explanation begins#
# ngram (1,3) can be explained as follows#
# ngram(1,3) encompasses uni gram, bi gram and tri gram
# consider the sentence "The ball fell"
# ngram (1,3) would be the, ball, fell, the ball, ball fell, the ball fell
# ngram explanation ends#
tfidf_matrix = tf.fit_transform(ds['Book Title'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
results = {} # dictionary created to store the result in a dictionary format (ID : (Score,item_id))
for idx, row in ds.iterrows(): # iterates through all the rows
# the below code 'similar_indice' stores similar ids based on cosine similarity. sorts them in ascending
# order. [:-5:-1] is then used so that the indices with most similarity are got. 0 means no similarity and
# 1 means perfect similarity
similar_indices = cosine_similarities[idx].argsort()[:-5:-1]
# stores 5 most similar books, you can change it as per your needs
similar_items = [(cosine_similarities[idx][i], ds['ID'][i]) for i in similar_indices]
results[row['ID']] = similar_items[1:]
# below code 'function item(id)' returns a row matching the id along with Book Title. Initially it is a dataframe,
# then we convert it to a list#
def item(ID):
return ds.loc[ds['ID'] == ID]['Book Title'].tolist()[0]
def recommend(ID, num):
if num == 0:
print("Unable to recommend any book as you have not chosen the number of book to be recommended")
elif num == 1:
print("Recommending " + str(num) + " book similar to " + item(ID))
else:
print("Recommending " + str(num) + " books similar to " + item(ID))
print("----------------------------------------------------------")
recs = results[ID][:num]
for rec in recs:
print("You may also like to read: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")
# the first argument in the below function to be passed is the id of the book, second argument is the number of books
# you want to be recommended
recommend(5, 2)

Cosine similarity in python3

I have two lists
I have used word to vector and cosine similarity to find the similar words based on cosine value between two vectors.
I have already defined word to vector function and cosine similarities so I did not mention that here.
tar1 = ['apple','fruit', 'vegetable','school']
tar2 = ['fruit', 'apple', 'school','vegetable']
i=0
j=0
for i in range (len(tar1)):
vect1 = text_to_vector(tar1[i].strip().lower())
for j in range(len(keyword)):
vect2 = text_to_vector(tar2[j].strip().lower())
cosine = get_cosine(vect1, vect2)
j = j+1
i = i+1
In the nested loop I want to pick out the string which has maximum cosine similarity value after the inner loops runs.
Eg:
First item in tar1 is 'apple'
which has high cosine similarity of 'apple' in tar2. so based on high cosine similarity. it has to pick out the word
I am looking for the output like below.
o/p = ['apple','fruit', 'vegetable','school']

Possible implementation to get what you want (with comments):
def text_to_vector(text):
return text
def get_cosine(x, y):
return 1 if x == y else 0
tar1 = ['apple', 'fruit', 'vegetable', 'school']
tar2 = ['fruit', 'apple', 'school', 'vegetable']
result = list()
# iterate over words in tar1
for dummy_idx_1, vector_1 in enumerate(text_to_vector(word) for word in tar1):
# keep track of the maximum cosine and most similar word
max_cosine, best_word = -1, None
# iterate over words in tar2 for every word in tar1
for idx_2, vector_2 in enumerate(text_to_vector(word) for word in tar2):
# compute cosine
cosine = get_cosine(vector_1, vector_2)
# check if current word from tar2 is the most similar to the word from tar1
if cosine > max_cosine:
max_cosine, best_word = cosine, tar2[idx_2]
# remember result for every word from tar1
result.append(best_word)
print(result)
Output is:
['apple', 'fruit', 'vegetable', 'school']

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why did NLTK NaiveBayes classifier misclassify one record? - nlp

Related

How to efficient batch-process in huggingface?

Current Tensorflow Program creates a Dataset for NLP using Cornell's Movie Dialog Data. How can I reprogram it to support a different Dataset?

How to simplify text comparison for big data-set where text meaning is same but not exact - deduplicate text data

How to remove key error from the program?

Cosine similarity in python3

Categories

Resources