Randomly select vector in gensim word2vec - python-3.x

I trained a word2vec model using gensim and I want to randomly select vectors from it, and find the corresponding word.
What is the best what to do so?

If your Word2Vec model instance is in the variable model, then there's a list of all words known to the model in model.wv.index2word. (The properties are slightly different in older versions of gensim.)
So, you can pick one item using Python's built-in choice() method in the random module:
import random
print(random.choice(model.wv.index2entity)

If you want to get n random words (keys) from word2vec with Gensim 4.0.0 just use random.sample:
import random
import gensim
# Here we use Gensim 4.0.0
w2v = gensim.models.KeyedVectors.load_word2vec_format("model.300d")
# Get 10 random words (keys) from word2vec model
random_words = random.sample(w2v.index_to_key, 10)
print("Random words: "+ str(random_words))
Piece a cake :)

Related

Looking for an effective NLP Phrase Embedding model

The goal I want to achieve is to find a good word_and_phrase embedding model that can do:
(1) For the words and phrases that I am interested in, they have embeddings.
(2) I can use embeddings to compare similarity between two things(could be word or phrase)
So far I have tried two paths:
1: Some Gensim-loaded pre-trained models, for instance:
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
# download the model and return as object ready for use
model_glove_twitter = api.load("fasttext-wiki-news-subwords-300")
model_glove_twitter.similarity('computer-science', 'machine-learning')
The problem with this path is that I do not know if a phrase has embedding. For this example, I got this error:
KeyError: "word 'computer-science' not in vocabulary"
I will have to try different pre-trained models, such as word2vec-google-news-300, glove-wiki-gigaword-300, glove-twitter-200, etc. Results are similar, there are always phrases of interests not having embeddings.
Then I tried to use some BERT-based sentence embedding method: https://github.com/UKPLab/sentence-transformers.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
from scipy.spatial.distance import cosine
def cosine_similarity(embedding_1, embedding_2):
# Calculate the cosine similarity of the two embeddings.
sim = 1 - cosine(embedding_1, embedding_2)
print('Cosine similarity: {:.2}'.format(sim))
phrase_1 = 'baby girl'
phrase_2 = 'annual report'
embedding_1 = model.encode(phrase_1)
embedding_2 = model.encode(phrase_2)
cosine_similarity(embedding_1[0], embedding_2[0])
Using this method I was able to get embeddings for my phrases, but the similarity score was 0.93, which did not seem to be reasonable.
So what can I try else to achieve the two goals mentioned above?
The problem with the first path is that you are loading fastText embeddings like word2vec embeddings and word2vec can't cope with Out Of Vocabulary words.
The good thing is that fastText can manage OOV words.
You can use Facebook original implementation (pip install fasttext) or Gensim implementation.
For example, using Facebook implementation, you can do:
import fasttext
import fasttext.util
# download an english model
fasttext.util.download_model('en', if_exists='ignore') # English
model = fasttext.load_model('cc.en.300.bin')
# get word embeddings
# (if instead you want sentence embeddings, use get_sentence_vector method)
word_1='computer-science'
word_2='machine-learning'
embedding_1=model.get_word_vector(word_1)
embedding_2=model.get_word_vector(word_2)
# compare the embeddings
cosine_similarity(embedding_1, embedding_2)

finding semantic similarity between 2 statements

I am currently working with small application in python and my application has search functionality (currently using difflib) but I want to create Semantic Search which can give top 5 or 10 results from my database, based on user inputted text. It is same as google search engine works. I found some solutions Here.
But the problem is, below two statements from one of solution are semantically incorrect. And I don't care about this. because they are making things too hard which I don't want And also solution will be some pretrained neural network model or library from which I can implement easily.
Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station
And also I found some solutions which are showing using gensim and Glove embeddings and finding similarity between words and not sentences.
What I wanted ?
Suppose my db has statement display classes and user inputs show, showed, displayed, displayed class, show types etc are same. And if above 2 statements are given as same then also I don't care. displayed and displayed class already showing in difflib.
Points to be noted
Find from fixed set of statements but user inputted statements can differ
Must work for statements
I think it is not gensim embedding. It is word2vec embedding. Whatever it is.
You need tensorflow_hub
The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.
I believe you need here is Text Classification or Semantic Similarity because you want to find nearest top 5 or 10 statements given statement from user.
It is easy to use. But size of model is ≈ 1GB. It works with words, sentences, phrases or short paragraphs. The input is variable length English text and the output is a 512 dimensional vector. You can find more information about it Here
Code
import tensorflow_hub as hub
import numpy as np
# Load model. It will download first time.
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/5"
model = hub.load(module_url)
# first data[0] is your actual value
data = ["display classes", "show", "showed" ,"displayed class", "show types"]
# find high-dimensional vectors.
vecs = model(data)
# find distance between statements using inner product
dists = np.inner(vecs[0], vecs)
# print dists
print(dists)
Output
array([0.9999999 , 0.5633253 , 0.46475542, 0.85303843, 0.61701006],dtype=float32)
Conclusion
First value 0.999999 is distance between display classes and display classes itself. second 0.5633253 is distance between display classes and show and last 0.61701006 is distance between display classes and show types.
Using this, you can find distance between given input and statements in db. then rank them according to distance.
You can use wordnet for finding synonyms and then use these synonyms for finding similar statements.
import nltk
from nltk.corpus import wordnet as wn
nltk.download('wordnet')
def get_syn_list(gword):
syn_list = []
try:
syn_list.extend(wn.synsets(gword,pos=wn.NOUN))
syn_list.extend(wn.synsets(gword,pos=wn.VERB))
syn_list.extend(wn.synsets(gword,pos=wn.ADJ))
syn_list.extend(wn.synsets(gword,pos=wn.ADV))
except :
print("Something Wrong Happened")
syn_words = []
for i in syn_list:
syn_words.append(i.lemmas()[0].name())
return syn_words
Now use split and split your statements in db. like this
stat = ["display classes"]
syn_dict = {}
for i in stat:
tmp = []
for x in i.split(" "):
tmp.extend(get_syn_list(x))
syn_dict[i] = set(tmp)
Now you have synonyms just compare them with inputted text. And use lemmatizer before comparing words so that displayed become display.
Hey you can use spacy
This answer is from https://medium.com/better-programming/the-beginners-guide-to-similarity-matching-using-spacy-782fc2922f7c
import spacy
nlp = spacy.load("en_core_web_lg")
doc1 = nlp("display classes")
doc2 = nlp("show types")
print(doc1.similarity(doc2))
Output
0.6277548513279427
Edit
Run following command, which will download model.
!python -m spacy download en_core_web_lg

How to sentence embed from gensim Word2Vec embedding vectors?

I have a pandas dataframe containing descriptions. I would like to cluster descriptions based on meanings usign CBOW. My challenge for now is to document embed each row into equal dimensions vectors. At first I am training the word vectors using gensim as so:
from gensim.models import Word2Vec
vocab = pd.concat((df['description'], df['more_description']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
I am however a bit confused now on how to replace the full sentences from my df with document vectors of equal dimensions.
For now, my workaround is repacing each word in each row with a vector then applying PCA dimentinality reduction to bring each vector to similar dimensions. Is there a better way of doing this though gensim, so that I could say something like this:
df['description'].apply(model.vectorize)
I think you are looking for sentence embedding. There are a lot ways of generating sentence embedding from word embeddings. You may find this useful: https://stats.stackexchange.com/questions/286579/how-to-train-sentence-paragraph-document-embeddings

Use gensim Random Projection in sklearn SVM

Is it possible to use a gensim Random Projection to train a SVM in sklearn?
I need to use gensim's tfidf implementation because it's better at dealing with large inputs and then want to put that into a random projection on which I will train my SVM. I'd also be happy to just pass the tfidf model generated by gensim to sklearn and use their random projection, if that makes things easier.
But so far I haven't found a way to get either model out of gensim into sklearn.
I have tried using gensim.matutils.corpus2cscbut of course that doesn't work: neither TfidfModel nor RpModel are corpi, so now I'm clueless at what to try next.
This is now very easy thanks to an awesome gensim contribution from Chinmaya Pancholi (see post here).
Simply import the sklearn wrapper from `gensim:
from gensim.sklearn_api import RpTransformer
Then, you can use the model to do analysis as you would any other sklearn classifier:
model = RpTransformer(num_topics=2)
clf = svm.SVC()
pipe = Pipeline([('features', model,), ('classifier', clf)])
pipe.fit(X_train, y_train)
One thing to be aware of, when using the gensim models, is that you still need to perform the dictionary and corpus steps. So instead of fitting your model on X_train, you'll have to do something along the following lines:
dictionary = Dictionary(X_train)
corpus_train = [dictionary.doc2bow(text) for text in X_train]
corpus_test = [dictionary.doc2bow(text) for text in X_test]
Then fit/predict your model on corpus_train or corpus_test.

Bigram vector representations using word2vec

I want to construct word embeddings for documents using the word2vec tool. I know how to find a vector embedding corresponding to a single word (unigram). Now, I want to find a vector for a bigram. Is it possible to construct a bigram word embedding using word2vec? If yes, how?
The following snippet will get you the vector representation of a bigram. Note that the bigram you want to convert to a vector needs to have an underscore instead of a space between the words, e.g. bigram2vec(unigrams, "this report") is wrong, it should be bigram2vec(unigrams, "this_report"). For more details on generating the unigrams, please see the gensim.models.word2vec.Word2Vec class here.
from gensim.models import word2vec
def bigram2vec(unigrams, bigram_to_search):
bigrams = Phrases(unigrams)
model = word2vec.Word2Vec(bigrams[unigrams])
if bigram_to_search in model.vocab.keys():
return model[bigram_to_search]
else:
return None

Resources