determine most similar phrase using word2vec - python-3.x

I try create a model which determine the most similar sentence for another sentence using word2vec.
The idea is to determine the most similar for a sentence, I created an average vector for the words composed this sentence.
Then, I should to predict the most similar sentence using embedding words.
My question is: How can I determine the best similar target sentence after created an average vector of source sentence?
Here the code :
import gensim
from gensim import utils
import numpy as np
import sys
from sklearn.datasets import fetch_20newsgroups
from nltk import word_tokenize
from nltk import download
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
download('punkt') #tokenizer, run once
download('stopwords') #stopwords dictionary, run once
stop_words = stopwords.words('english')
def preprocess(text):
text = text.lower()
doc = word_tokenize(text)
doc = [word for word in doc if word not in stop_words]
doc = [word for word in doc if word.isalpha()] #restricts string to alphabetic characters only
return doc
############ doc content -> num label -> string label
#note to self: texts[XXXX] -> y[XXXX] = ZZZ -> ng20.target_names[ZZZ]
# Fetch ng20 dataset
ng20 = fetch_20newsgroups(subset='all',
remove=('headers', 'footers', 'quotes'))
# text and ground truth labels
texts, y = ng20.data, ng20.target
corpus = [preprocess(text) for text in texts]
def filter_docs(corpus, texts, labels, condition_on_doc):
"""
Filter corpus, texts and labels given the function condition_on_doc which takes
a doc.
The document doc is kept if condition_on_doc(doc) is true.
"""
number_of_docs = len(corpus)
print(number_of_docs)
if texts is not None:
texts = [text for (text, doc) in zip(texts, corpus)
if condition_on_doc(doc)]
labels = [i for (i, doc) in zip(labels, corpus) if condition_on_doc(doc)]
corpus = [doc for doc in corpus if condition_on_doc(doc)]
print("{} docs removed".format(number_of_docs - len(corpus)))
return (corpus, texts, labels)
corpus, texts, y = filter_docs(corpus, texts, y, lambda doc: (len(doc) != 0))
def document_vector(word2vec_model, doc):
# remove out-of-vocabulary words
#print("doc:")
#print(doc)
doc = [word for word in doc if word in word2vec_model.vocab]
return np.mean(word2vec_model[doc], axis=0)
def has_vector_representation(word2vec_model, doc):
"""check if at least one word of the document is in the
word2vec dictionary"""
return not all(word not in word2vec_model.vocab for word in doc)
corpus, texts, y = filter_docs(corpus, texts, y, lambda doc: has_vector_representation(model, doc))
x =[]
for doc in corpus: #look up each doc in model
x.append(document_vector(model, doc))
X = np.array(x) #list to array
model.most_similar(positive=X, topn=1)

Just use the cosine distance. It's implemented in scipy.
For better efficiency, you can implement it yourself and precompute the norms of vectors in X:
X_norm = np.linalg.norm(X, axis=1).expand_dims(0)
Calling expand_dims ensures that dimensions got broadcasted. Then for vectors Y, you can get the most similar, you can get the most similar:
def get_most_similar_in_X(Y):
Y_norm = np.linalg.norm(Y, axis=1).expand_dims(1)
similarities = np.dot(Y, X.T) / Y_norm / X_norm
return np.argmax(distances, axis=2)
And you get indices of vectors in X that are most similar to vectors in Y.

Related

Text Classification on a custom dataset with spacy v3

I am really struggling to make things work with the new spacy v3 version. The documentation is full. However, I am trying to run a training loop in a script.
(I am also not able to perform text classification training with CLI approach).
Data are publically available here.
import pandas as pd
from spacy.training import Example
import random
TRAIN_DATA = pd.read_json('data.jsonl', lines = True)
nlp = spacy.load('en_core_web_sm')
config = {
"threshold": 0.5,
}
textcat = nlp.add_pipe("textcat", config=config, last=True)
label = TRAIN_DATA['label'].unique()
for label in label:
textcat.add_label(str(label))
nlp = spacy.blank("en")
nlp.begin_training()
# Loop for 10 iterations
for itn in range(100):
# Shuffle the training data
losses = {}
TRAIN_DATA = TRAIN_DATA.sample(frac = 1)
# Batch the examples and iterate over them
for batch in spacy.util.minibatch(TRAIN_DATA.values, size=4):
texts = [nlp.make_doc(text) for text, entities in batch]
annotations = [{"cats": entities} for text, entities in batch]
# uses an example object rather than text/annotation tuple
print(texts)
print(annotations)
examples = [Example.from_dict(a)]
nlp.update(examples, losses=losses)
if itn % 20 == 0:
print(losses)

Torchtext 0.7 shows Field is being deprecated. What is the alternative?

Looks like the previous paradigm of declaring Fields, Examples and using BucketIterator is deprecated and will move to legacy in 0.8. However, I don't seem to be able to find an example of the new paradigm for custom datasets (as in, not the ones included in torch.datasets) that doesn't use Field. Can anyone point me at an up-to-date example?
Reference for deprecation:
https://github.com/pytorch/text/releases
It took me a little while to find the solution myself. The new paradigm is like so for prebuilt datasets:
from torchtext.experimental.datasets import AG_NEWS
train, test = AG_NEWS(ngrams=3)
or like so for custom built datasets:
from torch.utils.data import DataLoader
def collate_fn(batch):
texts, labels = [], []
for label, txt in batch:
texts.append(txt)
labels.append(label)
return texts, labels
dataloader = DataLoader(train, batch_size=8, collate_fn=collate_fn)
for idx, (texts, labels) in enumerate(dataloader):
print(idx, texts, labels)
I've copied the examples from the Source
Browsing through torchtext's GitHub repo I stumbled over the README in the legacy directory, which is not documented in the official docs. The README links a GitHub issue that explains the rationale behind the change as well as a migration guide.
If you just want to keep your existing code running with torchtext 0.9.0, where the deprecated classes have been moved to the legacy module, you have to adjust your imports:
# from torchtext.data import Field, TabularDataset
from torchtext.legacy.data import Field, TabularDataset
Alternatively, you can import the whole torchtext.legacy module as torchtext as suggested by the README:
import torchtext.legacy as torchtext
There is a post regarding this. Instead of the deprecated Field and BucketIterator classes, it uses the TextClassificationDataset along with the collator and other preprocessing. It reads a txt file and builds a dataset, followed by a model. Inside the post, there is a link to a complete working notebook. The post is at: https://mmg10.github.io/pytorch/2021/02/16/text_torch.html. But you need the 'dev' (or nightly build) of PyTorch for it to work.
From the link above:
After tokenization and building vocabulary, you can build the dataset as follows
def data_to_dataset(data, tokenizer, vocab):
data = [(text, label) for (text, label) in data]
text_transform = sequential_transforms(tokenizer.tokenize,
vocab_func(vocab),
totensor(dtype=torch.long)
)
label_transform = sequential_transforms(lambda x: 1 if x =='1' else (0 if x =='0' else x),
totensor(dtype=torch.long)
)
transforms = (text_transform, label_transform)
dataset = TextClassificationDataset(data, vocab, transforms)
return dataset
The collator is as follows:
def __init__(self, pad_idx):
self.pad_idx = pad_idx
def collate(self, batch):
text, labels = zip(*batch)
labels = torch.LongTensor(labels)
text = nn.utils.rnn.pad_sequence(text, padding_value=self.pad_idx, batch_first=True)
return text, labels
Then, you can build the dataloader with the typical torch.utils.data.DataLoader using the collate_fn argument.
Well it seems like pipeline could be like that:
import torchtext as TT
import torch
from collections import Counter
from torchtext.vocab import Vocab
# read the data
with open('text_data.txt','r') as f:
data = f.readlines()
with open('labels.txt', 'r') as f:
labels = f.readlines()
tokenizer = TT.data.utils.get_tokenizer('spacy', 'en') # can remove 'spacy' and use a simple built-in tokenizer
train_iter = zip(labels, data)
counter = Counter()
for (label, line) in train_iter:
counter.update(tokenizer(line))
vocab = TT.vocab.Vocab(counter, min_freq=1)
text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
# this is data-specific - adapt for your data
label_pipeline = lambda x: 1 if x == 'positive\n' else 0
class TextData(torch.utils.data.Dataset):
'''
very basic dataset for processing text data
'''
def __init__(self, labels, text):
super(TextData, self).__init__()
self.labels = labels
self.text = text
def __getitem__(self, index):
return self.labels[index], self.text[index]
def __len__(self):
return len(self.labels)
def tokenize_batch(batch, max_len=200):
'''
tokenizer to use in DataLoader
takes a text batch of text dataset and produces a tensor batch, converting text and labels though tokenizer, labeler
tokenizer is a global function text_pipeline
labeler is a global function label_pipeline
max_len is a fixed len size, if text is less than max_len it is padded with ones (pad number)
if text is larger that max_len it is truncated but from the end of the string
'''
labels_list, text_list = [], []
for _label, _text in batch:
labels_list.append(label_pipeline(_label))
text_holder = torch.ones(max_len, dtype=torch.int32) # fixed size tensor of max_len
processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int32)
pos = min(200, len(processed_text))
text_holder[-pos:] = processed_text[-pos:]
text_list.append(text_holder.unsqueeze(dim=0))
return torch.FloatTensor(labels_list), torch.cat(text_list, dim=0)
train_dataset = TextData(labels, data)
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=False, collate_fn=tokenize_batch)
lbl, txt = iter(train_loader).next()

Retraining pre-trained word embeddings in Python using Gensim

I want to retrain pre-trained word embeddings in Python using Gensim. The pre-trained embeddings I want to use is Google's Word2Vec in the file GoogleNews-vectors-negative300.bin.
Following Gensim's word2vec tutorial, "it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there."
Therefore I can't use the KeyedVectors and for training a model the tutorial suggests to use:
model = gensim.models.Word2Vec.load('/tmp/mymodel')
model.train(more_sentences)
(https://rare-technologies.com/word2vec-tutorial/)
However, when I try this:
from gensim.models import Word2Vec
model = Word2Vec.load('data/GoogleNews-vectors-negative300.bin')
I get an error message:
1330 # Because of loading from S3 load can't be used (missing readline in smart_open)
1331 if sys.version_info > (3, 0):
-> 1332 return _pickle.load(f, encoding='latin1')
1333 else:
1334 return _pickle.loads(f.read())
UnpicklingError: invalid load key, '3'.
I didn't find a way to convert the binary google new file into a text file properly, and even if so I'm not sure whether that would solve my problem.
Does anyone have a solution to this problem or knows about a different way to retrain pre-trained word embeddings?
The Word2Vec.load() method can only load full models in gensim's native format (based on Python object-pickling) – not any other binary/text formats.
And, as per the documentation's note that "it’s not possible to resume training with models generated by the C tool", there's simply not enough information in the GoogleNews raw-vectors files to reconstruct the full working model that was used to train them. (That would require both some internal model-weights, not saved in that file, and word-frequency-information for controlling sampling, also not saved in that file.)
The best you could do is create a new Word2Vec model, then patch some/all of the GoogleNews vectors into it before doing your own training. This is an error-prone process with no real best-practices and many caveats about the interpretation of final results. (For example, if you bring in all the vectors, but then only re-train a subset using only your own corpus & word-frequencies, the more training you do – making the word-vectors better fit your corpus – the less such re-trained words will have any useful comparability to retained untrained words.)
Essentially, if you can look at the gensim Word2Vec source & work-out how to patch-together such a frankenstein-model, it may be appropriate. But there's no built-in support or handy off-the-shelf recipes that make it easy, because it's an inherently murky process.
I have already answered it here .
Save the google news model as text file in wor2vec format using gensim.
Refer this answer to save it as text file
Then try this code .
import os
import pickle
import numpy as np
import gensim
from gensim.models import Word2Vec, KeyedVectors
from gensim.models.callbacks import CallbackAny2Vec
import operator
os.mkdir("model_dir")
# class EpochSaver(CallbackAny2Vec):
# '''Callback to save model after each epoch.'''
# def __init__(self, path_prefix):
# self.path_prefix = path_prefix
# self.epoch = 0
# def on_epoch_end(self, model):
# list_of_existing_files = os.listdir(".")
# output_path = 'model_dir/{}_epoch{}.model'.format(self.path_prefix, self.epoch)
# try:
# model.save(output_path)
# except:
# model.wv.save_word2vec_format('model_dir/model_{}.bin'.format(self.epoch), binary=True)
# print("number of epochs completed = {}".format(self.epoch))
# self.epoch += 1
# list_of_total_files = os.listdir(".")
# saver = EpochSaver("my_finetuned")
# function to load vectors from existing model.
# I am loading glove vectors from a text file, benefit of doing this is that I get complete vocab of glove as well.
# If you are using a previous word2vec model I would recommed save that in txt format.
# In case you decide not to do it, you can tweak the function to get vectors for words in your vocab only.
def load_vectors(token2id, path, limit=None):
embed_shape = (len(token2id), 300)
freqs = np.zeros((len(token2id)), dtype='f')
vectors = np.zeros(embed_shape, dtype='f')
i = 0
with open(path, encoding="utf8", errors='ignore') as f:
for o in f:
token, *vector = o.split(' ')
token = str.lower(token)
if len(o) <= 100:
continue
if limit is not None and i > limit:
break
vectors[token2id[token]] = np.array(vector, 'f')
i += 1
return vectors
# path of text file of your word vectors.
embedding_name = "word2vec.txt"
data = "<training data(new line separated tect file)>"
# Dictionary to store a unique id for each token in vocab( in my case vocab contains both my vocab and glove vocab)
token2id = {}
# This dictionary will contain all the words and their frequencies.
vocab_freq_dict = {}
# Populating vocab_freq_dict and token2id from my data.
id_ = 0
training_examples = []
file = open("{}".format(data),'r', encoding="utf-8")
for line in file.readlines():
words = line.strip().split(" ")
training_examples.append(words)
for word in words:
if word not in vocab_freq_dict:
vocab_freq_dict.update({word:0})
vocab_freq_dict[word] += 1
if word not in token2id:
token2id.update({word:id_})
id_ += 1
# Populating vocab_freq_dict and token2id from glove vocab.
max_id = max(token2id.items(), key=operator.itemgetter(1))[0]
max_token_id = token2id[max_id]
with open(embedding_name, encoding="utf8", errors='ignore') as f:
for o in f:
token, *vector = o.split(' ')
token = str.lower(token)
if len(o) <= 100:
continue
if token not in token2id:
max_token_id += 1
token2id.update({token:max_token_id})
vocab_freq_dict.update({token:1})
with open("vocab_freq_dict","wb") as vocab_file:
pickle.dump(vocab_freq_dict, vocab_file)
with open("token2id", "wb") as token2id_file:
pickle.dump(token2id, token2id_file)
# converting vectors to keyedvectors format for gensim
vectors = load_vectors(token2id, embedding_name)
vec = KeyedVectors(300)
vec.add(list(token2id.keys()), vectors, replace=True)
# setting vectors(numpy_array) to None to release memory
vectors = None
params = dict(min_count=1,workers=14,iter=6,size=300)
model = Word2Vec(**params)
# using build from vocab to build the vocab
model.build_vocab_from_freq(vocab_freq_dict)
# using token2id to create idxmap
idxmap = np.array([token2id[w] for w in model.wv.index2entity])
# Setting hidden weights(syn0 = between input layer and hidden layer) = your vectors arranged accoring to ids
model.wv.vectors[:] = vec.vectors[idxmap]
# Setting hidden weights(syn0 = between hidden layer and output layer) = your vectors arranged accoring to ids
model.trainables.syn1neg[:] = vec.vectors[idxmap]
model.train(training_examples, total_examples=len(training_examples), epochs=model.epochs)
output_path = 'model_dir/final_model.model'
model.save(output_path)

scikit-learn - Using a single string with RandomForestClassifier.predict()?

I'm an sklearn dummy... I'm trying to predict the label for a given string from a RandomForestClassifier() fitted with text, labels.
It's obvious I don't know how to use predict() with a single string. The reason I'm using reshape() is because I got this error some time ago "Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
How can I predict the label of a single text string?
The script:
#!/usr/bin/env python
''' Read a txt file consisting of '<label>: <long string of text>'
to use as a model for predicting the label for a string
'''
from argparse import ArgumentParser
import json
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
def main(args):
'''
args: Arguments obtained by _Get_Args()
'''
print('Loading data...')
# Load data from args.txtfile and split the lines into
# two lists (labels, texts).
data = open(args.txtfile).readlines()
labels, texts = ([], [])
for line in data:
label, text = line.split(': ', 1)
labels.append(label)
texts.append(text)
# Print a list of unique labels
print(json.dumps(list(set(labels)), indent=4))
# Instantiate a CountVectorizer class and git the texts
# and labels into it.
cv = CountVectorizer(
stop_words='english',
strip_accents='unicode',
lowercase=True,
)
matrix = cv.fit_transform(texts)
encoder = LabelEncoder()
labels = encoder.fit_transform(labels)
rf = RandomForestClassifier()
rf.fit(matrix, labels)
# Try to predict the label for args.string.
prediction = Predict_Label(args.string, cv, rf)
print(prediction)
def Predict_Label(string, cv, rf):
'''
string: str() - A string of text
cv: The CountVectorizer class
rf: The RandomForestClassifier class
'''
matrix = cv.fit_transform([string])
matrix = matrix.reshape(1, -1)
try:
prediction = rf.predict(matrix)
except Exception as E:
print(str(E))
else:
return prediction
def _Get_Args():
parser = ArgumentParser(description='Learn labels from text')
parser.add_argument('-t', '--txtfile', required=True)
parser.add_argument('-s', '--string', required=True)
return parser.parse_args()
if __name__ == '__main__':
args = _Get_Args()
main(args)
The actual learning data text file is 43663 lines long but a sample is in small_list.txt which consists of lines each in the format: <label>: <long text string>
The error is noted in the Exception output:
$ ./learn.py -t small_list.txt -s 'This is a string that might have something to do with phishing or fraud'
Loading data...
[
"Vulnerabilities__Unknown",
"Vulnerabilities__MSSQL Browsing Service",
"Fraud__Phishing",
"Fraud__Copyright/Trademark Infringement",
"Attacks and Reconnaissance__Web Attacks",
"Vulnerabilities__Vulnerable SMB",
"Internal Report__SBL Notify",
"Objectionable Content__Russian Federation Objectionable Material",
"Malicious Code/Traffic__Malicious URL",
"Spam__Marketing Spam",
"Attacks and Reconnaissance__Scanning",
"Malicious Code/Traffic__Unknown",
"Attacks and Reconnaissance__SSH Brute Force",
"Spam__URL in Spam",
"Vulnerabilities__Vulnerable Open Memcached",
"Malicious Code/Traffic__Sinkhole",
"Attacks and Reconnaissance__SMTP Brute Force",
"Illegal content__Child Pornography"
]
Number of features of the model must match the input. Model n_features is 2070 and input n_features is 3
None
You need to get the vocabulary of the first CountVectorizer (cv) and use to transform the new single text before predict.
...
cv = CountVectorizer(
stop_words='english',
strip_accents='unicode',
lowercase=True,
)
matrix = cv.fit_transform(texts)
encoder = LabelEncoder()
labels = encoder.fit_transform(labels)
rf = RandomForestClassifier()
rf.fit(matrix, labels)
# Try to predict the label for args.string.
cv_new = CountVectorizer(
stop_words='english',
strip_accents='unicode',
lowercase=True,
vocabulary=cv.vocabulary_
)
prediction = Predict_Label(args.string, cv_new, rf)
print(prediction)
...

Classify text using NaiveBayesClassifier

I have a text file with a sentence on each line:
eg ""Have you registered your email ID with your Bank Account?"
I want to classify it into interrogative or not. FYI these are sentences from bank websites.
I've seen this answer
with this nltk code block:
import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
features = {}
for word in nltk.word_tokenize(post):
features['contains({})'.format(word.lower())] = True
return features
featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
So I did some preprocessing to my text file i.e. stemming words, removing stop words etc, to make each sentence into a bag of words. From the code above, I have a trained classifier. How do I implement it on my text file of sentences (either raw or preprocessed)?
Update: here is an example of my text file.
Assuming that you have preprocessed the document data as we discussed, you can do the following:
import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
features = {}
for word in nltk.word_tokenize(post):
features['contains({})'.format(word.lower())] = True
return features
featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(nltk.classify.accuracy(classifier, test_set))
0.668
For your data, you can iterate in your lines and fit, predict:
classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(classifier.classify(dialogue_act_features(line)))
Doing this for all lines in the text file works:
classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(classifier.classify(dialogue_act_features(line)))

Resources