Is this text training with skip-gram correct? - python-3.x

I am still a beginner with neural networks and NLP.
In this code I'm training cleaned text (some tweets) with skip-gram.
But I do not know if I do it correctly.
Can anyone inform me about the correctness of this skip-gram text training?
Any help is appreciated.
This my code :
from nltk import word_tokenize
from gensim.models.phrases import Phrases, Phraser
sent = [row.split() for row in X['clean_text']]
phrases = Phrases(sent, max_vocab_size = 50, progress_per=10000)
bigram = Phraser(phrases)
sentences = bigram[sent]
from gensim.models import Word2Vec
w2v_model = Word2Vec(window=5,
size = 300,
sg=1)
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=25)
del sentences #to reduce memory usage
def get_mat(model, corpus, size):
vecs = np.zeros((len(corpus), size))
n = 0
for i in corpus.index:
vecs[i] = np.zeros(size).reshape((1, size))
for word in str(corpus.iloc[i,0]).split():
try:
vecs[i] += model[word]
#n += 1
except KeyError:
continue
return vecs
X_sg = get_vectors(w2v_model, X, 300)
del X
X_sg=pd.DataFrame(X_sg)
X_sg.head()
from sklearn import preprocessing
scale = preprocessing.normalize
X_sg=scale(X_sg)
for i in range(len(X_sg)):
X_sg[i]+=1 #I did this because some weights where negative! So could not
#apply LSTM on them later

You haven't mentioned if you've received any errors, or unsatisfactory results, so it's hard to know what kind of help you might need.
Your specific lines of code involving the Word2Vec model are roughly correct: plausibly-useful parameters (if you have a dataset large enough to train 300-dimensional vectors), and the proper steps. So the real proof would be whether your results are acceptable.
Regarding your attempted use of Phrases bigram-creation beforehand:
You should get things generally working and with promising results before adding this extra pre-processing complexity.
The parameter max_vocab_size=50 is seriously misguided and may make the phrases-step pointless. The max_vocab_size is a hard cap on how many words/bigrams are tallied by the class, as a way to cap its memory-usage. (Whenever the number of known words/bigrams hits this cap, many lower-frequency words/bigrams are pruned – in practice, a majority of all words/bigrams each pruning, giving up a lot of accuracy in return for capped memory usage.) The max_vocab_size default in gensim is 40,000,000 – but the default in the Google word2phrase.c source on which gensim's method is based was 500,000,000. By using just 50, it's not really going to learn anything useful about just whatever 50 words/bigrams survive the many prunings.
Regarding your get_mat() function & later DataFrame code, i have no idea what you're trying to do with it, so can't offer any opinion on it.

Related

NLP: Get opinionated terms that correspond to aspect terms

I want to extract the sentiment sentence that goes along an aspect term in a sentence. I have the following code:
import spacy
nlp = spacy.load("en_core_web_lg")
def find_sentiment(doc):
# find roots of all entities in the text
ner_heads = {ent.root.idx: ent for ent in doc.ents}
rule3_pairs = []
for token in doc:
children = token.children
A = "999999"
M = "999999"
add_neg_pfx = False
for child in children:
if(child.dep_ in ["nsubj"] and not child.is_stop): # nsubj is nominal subject
if child.idx in ner_heads:
A = ner_heads[child.idx].text
else:
A = child.text
if(child.dep_ in ["acomp", "advcl"] and not child.is_stop): # acomp is adjectival complement
M = child.text
# example - 'this could have been better' -> (this, not better)
if(child.dep_ == "aux" and child.tag_ == "MD"): # MD is modal auxiliary
neg_prefix = "not"
add_neg_pfx = True
if(child.dep_ == "neg"): # neg is negation
neg_prefix = child.text
add_neg_pfx = True
# print(child, child.dep_)
if (add_neg_pfx and M != "999999"):
M = neg_prefix + " " + M
if(A != "999999" and M != "999999"):
rule3_pairs.append((A, M))
return rule3_pairs
print(find_sentiment(nlp('NEW DELHI Refined soya oil remained weak for the second day and prices shed 0.56 per cent to Rs 682.50 per 10 kg in futures market today as speculators reduced positions following sluggish demand in the spot market against adequate stocks position.')))
Which gets me the output: [('oil', 'weak'), ('prices', 'reduced')]
But this is too little of the content of the text
I want to know if it is possible to get an output like: [('oil', 'weak'), ('prices', 'shed 0.56 percent'), ('demand', 'sluggish')]
Is there any approach you recomend trying?
I triedthe code given above. Also a another library of stanza which only got similar results.
Unfortunately, if your task is to extract all expressive words from the text (all the words that contain sentimental significance), then it is not possible with the current state of affairs. Language is highly variable, and the same word could change its sentiment and meaning from sentence to sentence. While words like "awful" are easy to classify as negative, "demand" from your text is not as obvious, not even speaking about edge cases when seemingly positive "incredible" may reverse its sentiment if used as empowerment: "incredibly stupid" should be classified as very negative, but machines can normally only output two opposite labels for those words.
This is why for purposes of sentimental analysis, the only reliable way is building machine learning model that will classify texts entirely, which means you should adapt your software to accept the final verdict and process it in some way or another.
Naive Bayes Classifier
The simplest way to classify text by sentiment is the Naive Bayes classifier algorithm (that, among other things, not only classifies sentiment) that is implemented in NLTK:
from nltk import NaiveBayesClassifier, classify
#The training data is a two-dimensional list of words to classify.
train_data = dataset[:7000]
test_data = dataset[7000:]
#Train method returns the trained model.
classifier = NaiveBayesClassifier.train(train_data)
#To get accuracy, use classify.accuracy method:
print("Accuracy is:", classify.accuracy(classifier, test_data))
In order to make a prediction, we need to pass a list of words. It's preferable to remove any words that do not play sentimental significance such as the stop words and punctuation so that it wouldn't disturb our model:
from nltk.corpus import stopwords
from nltk.tokenise import word_tokenise
def clearLexemes(words):
return [word if word not in stopwords.word("english")
or "!?<>:;.&*%^" in word for word in words]
text = "What a terrible day!"
tokens = clearLexemes(word_tokenise(text))
print("Text sentiment is " + str(classifier.classify(dict([token, True] for token in tokens)))))
The output will be the sentiment of the text.
The important notes:
requires a minimum parameters to train and trains relatively fast;
is highly efficient for working with natural languages (is also used for gender identification and named entity recognition);
is unlikely to properly classify edge cases when words shift their sentiment in creatively-styled or rare utterances. For example, "Sweetheart, I wish ll of your fears would come true and you will be happy to live in such world!" This sentence is negative and uses irony to mask negative attribute through positive expressions, and the model may not be able to detect this.
Linear Regression
Another related method is to use linear regression algorithms from your favourite machine learning framework. In this notebook I used the Amazon food review dataset
to measure how fast model accuracy increases as you feed it with more and more data. The data you need to feed the model is the raw text and its score label (that in your case could be sentiment).
import numpy as np #For converting strings to text
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, classification_report
#Preparing the data
ys: pd.DataFrame = reviews.head(170536) #30% of the dataframe is test data
xs: pd.DataFrame = reviews[170537:] #70% of the dataframe is training data
#Training the model
lr = LogisticRegression(max_iter=1000)
cv = CountVectorizer(token_pattern=r'\b\w+\b')
train = cv.fit_transform(xs["Summary"].apply(lambda x: np.str_(x)))
test = cv.transform(ys["Summary"].apply(lambda x: np.str_(x)))
lr.fit(train, xs["Score"])
#Measuring accuracy:
predictions = lr.predict(test)
labels = ["x1", "x2", "x3", "x4", "x5"]
report = classification_report(predictions, ys["Score"],
target_names = labels, output_dict=True)
accuracy = [report[label]["precision"] for label in labels]
print(accuracy)
Conclusion
Investigating sentimental analysis is a worthwhile area of academic and industrial research that completely relies on machine learning and is bound to its limitations. It is a powerful topic that should be covered in the classical NLP suite. Unfortunately, currently understanding meaning close enough to be able to extract situational meaning is a feat close to inventing Artificial General Intelligence, however technology rapidly grows in that direction.

Why do genetic algorithms converge to end up with a population that is identical?

I was implementing a genetic algorithm with tf keras, where i manualy modify the weight, make the gene cross over, all that. Ive found that after a few docen generations, the predictions of all the network are essentialy identical, and after a few more generations the predictions are exactly the same. trying to google the problem i found this page
that mentions the problem in a conceptual level but i cant understand how this would happen if im manualy creating genetic diverity every generation.
def model_mutate(weights,var):
for i in range(len(weights)):
for j in range(len(weights[i])):
if( random.uniform(0,1) < 0.2): #learing rate of 15%
change = np.random.uniform(-var,var,weights[i][j].shape)
weights[i][j] += change
return weights
def crossover_brains(parent1, parent2):
global brains
weight1 = parent1.get_weights()
weight2 = parent2.get_weights()
new_weight1 = weight1
new_weight2 = weight2
gene = random.randint(0,len(new_weight1)-1) #we change a random weight
#or set of weights
new_weight1[gene] = weight2[gene]
new_weight2[gene] = weight1[gene]
q=np.asarray([new_weight1,new_weight2],dtype=object)
return q
def evolve(best_fit1,best_fit2):
global generation
global best_brain
global best_brain2
mutations=[]
for i in range(total_brains//2):
cross_weights=model_crossover(best_fit1,best_fit2)
mutation1=model_mutate(cross_weights[0],0.5)
mutation2=model_mutate(cross_weights[1],0.5)
mutations.append(mutation1)
mutations.append(mutation2)
for i in range(total_brains):
brains[i].set_weights(mutations[i])
generation+=1
def find_best_fit():
fitness=np.loadtxt("fitness.txt")
print(f"fitness average {np.mean(fitness)} in generation {generation}")
print(f"fitness max is {np.max(fitness)} in generation {generation} ")
fitness_t.append(np.mean(fitness))
maxfit1=np.max(fitness)
best_fit1=np.where(fitness==maxfit1)[0]
fitness[best_fit1]=0
maxfit2=np.max(fitness)
best_fit2=np.where(fitness==maxfit2)[0]
if len(best_fit1)>1: #this is a band_aid for when several indiviuals are the same
# this would lead to best_fit(1,2) being an array of indeces
best_fit1=best_fit1[0]
if len(best_fit2)>1:
best_fit2=best_fit2[0]
return int(best_fit1),int(best_fit2)
bf1,bf2=find_best_fit()
evolve(bf1,bf2)
This is the code im using to set the modified weights to the existing keras models (mostly not mine, i dont understand it enough to have created this myself)
if keras is working how i think its working, then i dont see how this would converge to anything that does not maximize fitness, further more, it seems to be decreasing over time.

How to Fasten Knn Algorithm for face recognition in real time

I am doing my work on face detection and recognition, where I want to detect the faces in real time,
but when coming to the point of training it is taking very long time to train the
data is it possible to reduce the timing of training the data can any one help
me out with this problem
'''
def train(train_dir, model_save_path=None, n_neighbors=None, knn_algo='ball_tree', verbose=False):
X = []
y = []
# Loop through each person in the training set
for class_dir in tqdm(os.listdir(train_dir)):
if not os.path.isdir(os.path.join(train_dir, class_dir)):
continue
# Loop through each training image for the current person
for img_path in image_files_in_folder(os.path.join(train_dir, class_dir)):
image = face_recognition.load_image_file(img_path)
face_bounding_boxes = face_recognition.face_locations(image)
if len(face_bounding_boxes) != 1:
# If there are no people (or too many people) in a training image, skip the image.
if verbose:
print("Image {} not suitable for training: {}".format(img_path, "Didn't find a face" if len(face_bounding_boxes) < 1 else "Found more than one face"))
else:
# Add face encoding for current image to the training set
X.append(face_recognition.face_encodings(image, known_face_locations=face_bounding_boxes)[0])
y.append(class_dir.split('_')[0])
# Determine how many neighbors to use for weighting in the KNN classifier
if n_neighbors is None:
n_neighbors = int(round(math.sqrt(len(X))))
if verbose:
print("Chose n_neighbors automatically:", n_neighbors)
# Create and train the KNN classifier
knn_clf = neighbors.KNeighborsClassifier(n_neighbors=n_neighbors, algorithm=knn_algo, weights='distance')
print(knn_clf)
knn_clf.fit(X, y)
# Save the trained KNN classifier
if model_save_path is not None:
with open(model_save_path, 'wb') as f:
pickle.dump(knn_clf, f)
return knn_clf
'''
this the final call
'''
def trainer():
# STEP 1: Train the KNN classifier and save it to disk
# Once the model is trained and saved, you can skip this step next time.
print("Training KNN classifier...")
classifier = train("app/facerec/dataset", model_save_path="app/facerec/models/trained_model.clf", n_neighbors=3)
print("Training complete!")
'''
also wants to know is there any possibility instead of rewriting the 'trained_model.clf' file can we update the file instead.
Training kNN model shouldn't impose high runtime overhead. After all, the straightforward ("exact search") model is lazy. It stores the vectors and performs brute-force search at query (or classification) time.
I speculate the embedding computations dominate your training time.
As mentioned by #johncasey, you might want to use approximated-kNN models (or similarity search engines). There are many open-source similarity search libraries. Yet, if you need a production-ready, robust, real-time, efficient solution, then you should check out pinecone.io. (Disclaimer, I work for Pinecone.)
k-nn algorithm has a O(n) time complexity. I recommend you to use approximate nearest neighbor (a-nn) algorithm. Its time complexity is too low. For example, Google image search is based on this algorithm.
Spotify annoy, Facebook faiss, nmslib are a-nn libraries.

Word vocabulary generated by Word2vec and Glove models are different for the same corpus

I'm using CONLL2003 dataset to generate word embeddings using Word2vec and Glove.
The number of words returned by word2vecmodel.wv.vocab is different(much lesser) than glove.dictionary.
Here is the code:
Word2Vec:
word2vecmodel = Word2Vec(result ,size= 100, window =5, sg = 1)
X = word2vecmodel[word2vecmodel.wv.vocab]
w2vwords = list(word2vecmodel.wv.vocab)
Output len(w2vwords) = 4653
Glove:
from glove import Corpus
from glove import Glove
import numpy as np
corpus = Corpus()
nparray = []
allwords = []
no_clusters=500
corpus.fit(result, window=5)
glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)
Output: len(glove.dictionary) = 22833
The input is a list of sentences. For example:
result[1:5] =
['Peter', 'Blackburn'],
['BRUSSELS', '1996-08-22'],
['The',
'European',
'Commission',
'said',
'Thursday',
'disagreed',
'German',
'advice',
'consumers',
'shun',
'British',
'lamb',
'scientists',
'determine',
'whether',
'mad',
'cow',
'disease',
'transmitted',
'sheep',
'.'],
['Germany',
"'s",
'representative',
'European',
'Union',
"'s",
'veterinary',
'committee',
'Werner',
'Zwingmann',
'said',
'Wednesday',
'consumers',
'buy',
'sheepmeat',
'countries',
'Britain',
'scientific',
'advice',
'clearer',
'.']]
There are totally 13517 sentences in the result list.
Can someone please explain why the list of words for which the embeddings are created are drastically different in size?
You haven't mentioned which Word2Vec implementation you're using, but I'll assume you're using the popular Gensim library.
Like the original word2vec.c code released by Google, Gensim Word2Vec uses a default min_count parameter of 5, meaning that any words appearing fewer than 5 times are ignored.
The word2vec algorithm needs many varied examples of a word's usage is different contexts to generate strong word-vectors. When words are rare, they fail to get very good word-vectors themselves: the few examples only show a few uses that may be idiosyncractic compared to what a larger sampling would show, and can't be subtly balanced against many other word representations in the manner that's best.
But further, given that in typical word-distributions, there are many such low-frequency words, altogether they also tend to make the word-vectors for other more-frequent qords worse. The lower-frequency words are, comparatively, 'interference' that absorbs training state/effort to the detriment of other more-improtant words. (At best, you can offset this effect a bit by using more training epochs.)
So, discarding low-frequency words is usually the right approach. If you really need vectors-for those words, obtaining more data so that those words are no longer rare is the best approach.
You can also use a lower min_count, including as low as min_count=1 to retain all words. But often discarding such rare words is better for whatever end-purpose for which the word-vectors will be used.

How to find the score for sentence Similarity using Word2Vec

I am new to NLP, how to find the similarity between 2 sentences and also how to print scores of each word. And also how to implement the gensim word2Vec model.
Try this code:
here my two sentences :
sentence1="I am going to India"
sentence2=" I am going to Bharat"
from gensim.models import word2vec
import numpy as np
words1 = sentence1.split(' ')
words2 = sentence2.split(' ')
#The meaning of the sentence can be interpreted as the average of its words
sentence1_meaning = word2vec(words1[0])
count = 1
for w in words1[1:]:
sentence1_meaning = np.add(sentence1_meaning, word2vec(w))
count += 1
sentence1_meaning /= count
sentence2_meaning = word2vec(words2[0])
count = 1
for w in words2[1:]:
sentence2_meaning = np.add(sentence2_meaning, word2vec(w))
count += 1
sentence2_meaning /= count
#Similarity is the cosine between the vectors
similarity = np.dot(sentence1_meaning, sentence2_meaning)/(np.linalg.norm(sentence1_meaning)*np.linalg.norm(sentence2_meaning))
You can train the model and use the similarity function to get the cosine similarity between two words.
Here's a simple demo:
from gensim.models import Word2Vec
from gensim.test.utils import common_texts
model = Word2Vec(common_texts,
size = 500,
window = 5,
min_count = 1,
workers = 4)
word_vectors = model.wv
word_vectors.similarity('computer', 'computer')
The output will be 1.0, of course, which indicates 100% similarity.
After your from gensim.models import word2vec, word2vec is a Python module – not a function that you can call as word2vec(words1[0]) or word2vec(w).
So your code isn't even close to approaching this correctly, and you should review docs/tutorials which demonstrate the proper use of the gensim Word2Vec class & supporting methods, then mimic those.
As #david-dale mentions, there's a basic intro in the gensim docs for Word2Vec:
https://radimrehurek.com/gensim/models/word2vec.html
The gensim library also bundles within its docs/notebooks directory a number of Jupyter notebooks demonstrating various algorithms & techniques. The notebook word2vec.ipynb shows basic Word2Vec usage; you can also view it via the project's source code repository at...
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb
...however, it's really best to run as a local notebook, so you can step through the execution cell-by-cell, and try different variants yourself, perhaps even adapting it to use your data instead.
When you reach that level, note that:
these models require far more than just a few sentences as training - so ideally you'd either have (a) many sentences from the same domain as those you're comparing, so that the model can learn words in those contexts; (b) a model trained from a compatible corpus, which you then apply to your out-of-corpus sentences.
using the average of all the word-vectors in a sentence is just one relatively-simple way to make a vector for a longer text; there are many other more-sophisticated ways. One alternative very similar to Word2Vec is the 'Paragraph Vector' algorithm also available in gensim as the class Doc2Vec.

Resources