Gensim Compute centroid from list of words - nlp

How to compute the centroid of given 5 words from the word-embedding and then find the most similar words from that centroid. (In gensim)

You should checkout the Word2Vec gensim tutorial
from gensim.test.utils import datapath
from gensim import utils
class MyCorpus:
"""An iterator that yields sentences (lists of str)."""
def __iter__(self):
corpus_path = datapath('lee_background.cor')
for line in open(corpus_path):
# assume there's one document per line, tokens separated by whitespace
yield utils.simple_preprocess(line)
import gensim.models
sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences)
word_vectors = model.wv
import numpy as np
centroid = np.average([word_vectors[w] for w in ['king', 'man', 'walk', 'tennis', 'victorian']], axis=0)
which will give you in this case
[('man', 0.9996674060821533),
('by', 0.9995684623718262),
('over', 0.9995648264884949),
('from', 0.9995632171630859),
('were', 0.9995599389076233),
('who', 0.99954754114151),
('today', 0.9995439648628235),
('which', 0.999538004398346),
('on', 0.9995279312133789),
('being', 0.9995211958885193)]


Rouge-L score very low

I use huggingface transformer api to calculate the rouge score of summarization results. The rouge-1 and rouge-2 scores are fine, but I find my rouge-L score is very low compared to the results in papers.
For example, in the dataset of eife, the baseline model lead-k's rouge scores are 34.12 6.73 32.06, while mine is 37.18 7.97 15.05. Apparently, something goes wrong with my calculation.
Here is my code:
import evaluate
import transformers
import os
import torch
from datasets import list_datasets, load_dataset
import nltk
import numpy as np
rouge = evaluate.load('rouge')
elife = load_dataset('tomasg25/scientific_lay_summarisation', 'elife')
lexsum = load_dataset('allenai/multi_lexsum')
refs = []
predicts_lead3 = []
predicts_leadk = []
for text in elife['test']['summary']:
for text in elife['test']['article']:
predicts_lead3.append(' '.join(nltk.sent_tokenize(text)[:3]))
predicts_leadk.append(' '.join(text.split(' ')[:383]))
result_3 = rouge.compute(predictions=predicts_lead3, references=refs)
print("lead 3 results:")
result_k = rouge.compute(predictions=predicts_leadk, references=refs)
print("lead k results:")

Why is sklearn RandomForestClassifier root node different from the most important feature?

How is feature importance calculated in RandomForestClassifier in scikit-learn?
Here's a reproducible code. I run the classifier once with criterion set to gini and once to entropy. For each of them, I print the feature importance and plot the tree.
In neither of the instances, the root tree is the same as the most important feature. Why is that?
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_graphviz
from IPython.display import Image, display
from subprocess import call
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_wine
from sklearn.datasets import load_iris
wines = load_wine()
iris = load_iris()
def create_and_fit(clf,model_name):
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=3, n_redundant=5, random_state=seed)
# X,y =;
# X,y =,
# fit the mode, y)
# get importance
importance = clf.feature_importances_
indices = np.argsort(importance)[::-1]
for f in range(X.shape[1]):
print("feature {}: ({})".format(indices[f], importance[indices[f]]))
filename = model_name+model.criterion
if model_name == 'forest_':
export_graphviz(clf.estimators_[0], out_file=filename+'.dot')
export_graphviz(clf, out_file=filename+'.dot')
f = 'tree_'+model.criterion+'.png'
call(['dot', '-Tpng', filename+'.dot', '-o', filename+'.png', '-Gdpi=600'])
models = [
RandomForestClassifier(criterion='gini',max_depth=5, random_state=seed),
RandomForestClassifier(criterion='entropy',max_depth=5, random_state=seed),
names =['forest_', 'forest_']
for name, model in zip(names, models):
Here's the snippet to load the image:
Image(filename = 'forest_gini'+'.png')
and for the entropy
Image(filename = 'forest_entropy'+'.png')
This behaviour seems to only happen with ensembles not trees (I'm generalizing as I only tried on Random forest and Decision Tree).
Here's the snippet for decision trees
models = [
DecisionTreeClassifier(criterion='gini',max_depth=5, random_state=seed),
DecisionTreeClassifier(criterion='entropy',max_depth=5, random_state=seed)
names =['tree_', 'tree_']
for name, model in zip(names, models):
Here's the snippet to load the image:
Image(filename = 'tree_gini'+'.png')
and for the entropy
Image(filename = 'tree_entropy'+'.png')
I think I found the answer, which is related to max_features parameter in RandomForestClassifier. Here's scikit-learn documentation:
max_features{“sqrt”, “log2”, None}, int or float,
The number of features to consider when looking for
the best split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and round(max_features *
n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.

Deal with Out of vocabulary word with Gensim pretrained GloVe

I am working on an NLP assignment and loaded the GloVe vectors provided by Gensim:
import gensim.downloader
glove_vectors = gensim.downloader.load('glove-twitter-25')
I am trying to get the word embedding for each word in a sentence, but some of them are not in the vocabulary.
What is the best way to deal with it working with the Gensim API?
Load the model:
import gensim.downloader as api
model = api.load("glove-twitter-25") # load glove vectors
# model.most_similar("cat") # show words that similar to word 'cat'
There is a very simple way to find out if the words exist in the model's vocabulary.
result = print('Word exists') if word in model.wv.vocab else print('Word does not exist")
Apart from that, I had used the following logic to create sentence embedding (25 dim) with N tokens:
from __future__ import print_function, division
import os
import re
import sys
import regex
import numpy as np
from functools import partial
from fuzzywuzzy import process
from Levenshtein import ratio as lev_ratio
import gensim
import tempfile
def vocab_check(model, word):
similar_words = model.most_similar(word)
match_ratio = 0.
match_word = ''
for sim_word, sim_score in similar_words:
ratio = lev_ratio(word, sim_word)
if ratio > match_ratio:
match_word = sim_word
if match_word == '':
return similar_words[0][1]
return model.similarity(word, match_word)
def sentence2vector(model, sent, dim=25):
words = sent.split(' ')
emb = [model[w.strip()] for w in words]
weights = [1. if w in model.wv.vocab else vocab_check(model, w) for w in words]
if len(emb) == 0:
sent_vec = np.zeros(dim, dtype=np.float16)
sent_vec =, emb)
sent_vec = sent_vec.astype("float16")
return sent_vec

word not in vocabulary

First time using word2vec and the file I am working with is in XML format. I want to iterate through the patents to find each Title then apply word2vec to see if there are similar words(to indicate similar titles).
So far I have parsed the XML file using Element tree to retrieve each title, then I have applied sent_tokenizer followed by tweet tokenizer to return a list of sentences where each word has been tokenized (not sure if this was the best method). I then put the tokenized sentenses into my word2vec model and tested with one word to see if it returned a vector. This seems to only work for a word in the first sentence. I'm not sure it is recognising all the sentences?
import numpy as np
import pandas as pd
import gensim
import nltk
import xml.etree.ElementTree as ET
from gensim.models.word2vec import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer, sent_tokenize
tree = ET.parse('6785.xml')
root = tree.getroot()
for child in root.iter("Title"):
Patent_Title = child.text
sentence = Patent_Title
stopWords = set(stopwords.words('english'))
tokens = nltk.sent_tokenize(sentence)
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in tokens]
model = gensim.models.Word2Vec(tokens_sentences, min_count=1,size=32)
words = list(model.wv.vocab)
I would expect it to identify the word 'solar' in a sentence and print out the vector then I could look for similar words. I am receiving the error:
word 'Solar' not in vocabulary"
Just handle the errors as exceptions on first loop occurence.
# print(model['Solar'])
except Exception as e:
Working code :
import numpy as np
import pandas as pd
import gensim
import nltk
import xml.etree.ElementTree as ET
from gensim.models.word2vec import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer, sent_tokenize
tree = ET.parse('6785.xml')
root = tree.getroot()
for child in root.iter("Title"):
Patent_Title = child.text
sentence = Patent_Title
stopWords = set(stopwords.words('english'))
tokens = nltk.sent_tokenize(sentence)
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in tokens]
model = gensim.models.Word2Vec(tokens_sentences, min_count=1,size=32)
words = list(model.wv.vocab)
except Exception as e:
It is simply because Solar is not in your corpus.
Word2Vec tries to generate word vectors for each word in your tokens_sentences. If the training corpus didn't include the word/token that you try to look up, word2vec would not have the word vector for that word and that is why you got an error.
Advice: try to make your text data case-insensitive. That is, make all the text lower case (upper case works too but not the convention.)

sklearn partial fit of CountVectorizer

Does CountVectorizer support partial fit?
I would like to train the CountVectorizer using different batches of data.
No, it does not support partial fit.
But you can write a simple method to accomplish your goal:
def partial_fit(self , data):
if(hasattr(vectorizer , 'vocabulary_')):
vocab = self.vocabulary_
vocab = {}
vocab = list(set(vocab.keys()).union(set(self.vocabulary_ )))
self.vocabulary_ = {vocab[i] : i for i in range(len(vocab))}
from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer.partial_fit = partial_fit
vectorizer = CountVectorizer(stop_words=l)[15].values[0:100])
The implementation by sajiad is correct and I'm grateful to them for sharing their solution. It could be made more flexible by amending the call to hasattr() to reference self instead of vectorizer.
I've implemented this with a short reproducible example below illustrating the role of partial_fit() compared to fit():
def partial_fit(self , data):
if(hasattr(self , 'vocabulary_')):
vocab = self.vocabulary_
vocab = {}
vocab = list(set(vocab.keys()).union(set(self.vocabulary_ )))
self.vocabulary_ = {vocab[i] : i for i in range(len(vocab))}
from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer.partial_fit = partial_fit
vectorizer = CountVectorizer()
corpus = ['The quick brown fox',
'jumps over the lazy dog']
# Without partial fit
for i in corpus:[i])
['dog', 'jumps', 'lazy', 'over', 'the']
# With partial fit
for i in corpus:
['over', 'fox', 'lazy', 'quick', 'the', 'jumps', 'dog', 'brown']
