How to run tsne on word2vec created from gensim? - scikit-learn

I want to visualize a word2vec created from gensim library. I tried sklearn but it seems I need to install a developer version to get it. I tried installing the developer version but that is not working on my machine . Is it possible to modify this code to visualize a word2vec model ?
tsne_python

You don't need a developer version of scikit-learn - just install scikit-learn the usual way via pip or conda.
To access the word vectors created by word2vec simply use the word dictionary as index into the model:
X = model[model.wv.vocab]
Following is a simple but complete code example which loads some newsgroup data, applies very basic data preparation (cleaning and breaking up sentences), trains a word2vec model, reduces the dimensions with t-SNE, and visualizes the output.
from gensim.models.word2vec import Word2Vec
from sklearn.manifold import TSNE
from sklearn.datasets import fetch_20newsgroups
import re
import matplotlib.pyplot as plt
# download example data ( may take a while)
train = fetch_20newsgroups()
def clean(text):
"""Remove posting header, split by sentences and words, keep only letters"""
lines = re.split('[?!.:]\s', re.sub('^.*Lines: \d+', '', re.sub('\n', ' ', text)))
return [re.sub('[^a-zA-Z]', ' ', line).lower().split() for line in lines]
sentences = [line for text in train.data for line in clean(text)]
model = Word2Vec(sentences, workers=4, size=100, min_count=50, window=10, sample=1e-3)
print (model.wv.most_similar('memory'))
X = model.wv[model.wv.vocab]
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
plt.show()

Use the code below, instead of X concat all your word embeddings vertically using numpy.vstack into a matrix X and then fit_transform it.
import numpy as np
from sklearn.manifold import TSNE
X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
model.fit_transform(X)
the output of fit_transform has shape vocab_size x 2 so you can visualise it.
vocab = sorted(word2vec_model.get_vocab()) #not sure the exact api
emb_tuple = tuple([word2vec_model[v] for v in vocab])
X = numpy.vstack(emb_tuple)

Related

Getting similarity score with spacy and a transformer model

I've been using the spacy en_core_web_lg and wanted to try out en_core_web_trf (transformer model) but having some trouble wrapping my head around the difference in the model/pipeline usage.
My use case looks like the following:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_trf")
s1 = nlp("Running for president is probably hard.")
s2 = nlp("Space aliens lurk in the night time.")
s1.similarity(s2)
Output:
The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements.
(0.0, Space aliens lurk in the night time.)
Looking at this post, the transformer model does not have a word vector in the same way en_core_web_lg does, but you can get the embedding via s1._.trf_data.tensors. Which looks like:
sent1._.trf_data.tensors[0].shape
(1, 9, 768)
sent1._.trf_data.tensors[1].shape
(1, 768)
So I tried to manually take the cosine similarity (using this post as ref):
def similarity(obj1, obj2):
(v1, t1), (v2, t2) = obj1._.trf_data.tensors, obj2._.trf_data.tensors
try:
return ((1 - cosine(v1, v2)) + (1 - cosine(t1, t2))) / 2
except:
return 0.0
But this does not work.
As #polm23 mentioned, using sentence-transformers is a better approach to get sentence similarity.
First install the package: pip install sentence-transformers
Then use this code:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ["Running for president is probably hard.","Space aliens lurk in the night time."]
embedded_list = model.encode(sentences)
similarity = cos_sim(embedded_list[0],embedded_list[1])
But if you are determined to use spacy for sentence similarity be aware that the reason that your code does not work is that v1 and v2 don't have the same shape, as you can see:
s1._.trf_data.tensors[0].shape --> (1, 9, 768)
s2._.trf_data.tensors[0].shape --> (1, 11, 768)
So it's not possible to get similarity between these two arrays.
s1._.trf_data.tensors is a tuple consists of two arrays:
s1._.trf_data.tensors[0] gives an array of size (1, 9, 768) which is consists of 9 arrays of size (1, 768) for each token.
s1._.trf_data.tensors[1] gives an array of size (1, 768) for the whole sentence
So you can get similarity as follows:
similarity = cosine(s1._.trf_data.tensors[1], s2._.trf_data.tensors[1])

Inverse Feature Scaling not working while predicting results

# Importing required libraries
import numpy as np
import pandas as pd
# Importing dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1: -1].values
y = dataset.iloc[:, -1].values
y = y.reshape(len(y), 1)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
scy = StandardScaler()
scX = StandardScaler()
X = scX.fit_transform(X)
y = scy.fit_transform(y)
# Training SVR model
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)
# Predicting results from SCR model
# this line is generating error
scy.inverse_transform(regressor.predict(scX.transform([[6.5]])))
I am trying to execute this code to predict values from the model but after running it I am getting errors like this:
ValueError: Expected 2D array, got 1D array instead:
array=[-0.27861589].
Reshape your data either using an array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Complete Stack trace of error:
Even my instructor is using the same code but his one is working mine one not I am new to machine learning can anybody tell me what I am doing wrong here.
Thanks for your help.
This is the data for reference
It is because of the shape of your predictions, the scy is expecting an output with (-1, 1) shape. Change your last line to this:
scy.inverse_transform([regressor.predict(scX.transform([[6.5]]))])
You can also use this line to predict:
pred = regressor.predict(scX.transform([[6.5]]))
pred = pred.reshape(-1, 1)
scy.inverse_transform(pred)

How To Plot n Furthest Points From Each Centroid KMeans

I am trying to train a kmeans model on the iris dataset in Python.
Is there a way to plot n furthest points from each centroid using kmeans in Python?
Here is a fully working code:
from sklearn import datasets
from sklearn.cluster import KMeans
import numpy as np
# import iris dataset
iris = datasets.load_iris()
X = iris.data[:, 2:5] # use two variables
# plot the two variables to check number of clusters
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1])
# kmeans
km = KMeans(n_clusters = 2, random_state = 0) # Chose two clusters
y_pred = km.fit_predict(X)
X_dist = kmeans.transform(X) # get distances to each centroid
## Stuck at this point: How to make a function that extracts three points that are furthest from the two centroids
max3IdxArr = []
for label in np.unique(km.labels_):
X_label_indices = np.where(y_pred == label)[0]
# max3Idx = X_label_indices[np.argsort(X_dist[:3])] # This part is wrong
max3Idx = X_label_indices[np.argsort(X_dist[:3])] # This part is wrong
max3IdxArr.append(max3Idx)
max3IdxArr
# plot
plt.scatter(X[:, 0].iloc[max3IdxArr], X[:, 1].iloc[max3IdxArr])
what you did is np.argsort(X_dist[:3])
which already takes top three values from the unsorted X_dist hence you can
try taking x=np.argsort(x_dist) and
after sorting is done you could then try
x[:3]
feel free to ask,
if this isnt working
cheers

How to plot the output of k-means clustering of word embedding using python?

I have used gensims word embeddings to find vectors of each word. Then I used K-means to find clusters of word. There are close to 10,000 tokens/words and I want to plot them.
I want to plot the result in the following way:
Annotate points with name of words
Different color for clusters
Here is what I have done.
tsne = TSNE(perplexity=40, n_components=2, init='pca', n_iter=500)#, random_state=13)
def tsne_plot(data):
"Creates and TSNE model and plots it"
data=data.sample(n = 500).reset_index()
word=data["word"]
cluster=data["clusters"]
data=data.drop(["clusters","word"],axis=1)
X = tsne.fit_transform(data)
plt.figure(figsize=(48, 48))
for i in range(len(X)):
plt.scatter(X[:,0][i],X[:,1][i],c=cluster[i])
plt.annotate(word[i],
xy=(X[:,0][i],X[:,1][i]),
xytext=(3, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.show()
tsne_plot(data)
Though it's annotating the words but failing to color different groups/clusters?
Anyother other approach which annoates with word anmes and colors different clusters?
This is how it's typically done; with annotations and rainbow colors.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# %matplotlib inline
from sklearn.cluster import KMeans
import seaborn as sns
import matplotlib.pyplot as plt
X = np.array([[5,3],
[10,15],
[15,12],
[24,10],
[30,45],
[85,70],
[71,80],
[60,78],
[55,52],
[80,91],])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
print(kmeans.cluster_centers_)
print(kmeans.labels_)
#plt.scatter(X[:,0],X[:,1], c=kmeans.labels_, cmap='rainbow')
data = X
labels = kmeans.labels_
#######################################################################
plt.subplots_adjust(bottom = 0.1)
plt.scatter(data[:, 0], data[:, 1], c=kmeans.labels_, cmap='rainbow')
for label, x, y in zip(labels, data[:, 0], data[:, 1]):
plt.annotate(
label,
xy=(x, y), xytext=(-20, 20),
textcoords='offset points', ha='right', va='bottom',
bbox=dict(boxstyle='round,pad=0.5', fc='red', alpha=0.5),
arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0'))
plt.show()
#######################################################################
See the link below for all details.
https://stackabuse.com/k-means-clustering-with-scikit-learn/
See the link below for some samples of how to do annotations with characters, rather tan numbers.
https://nikkimarinsek.com/blog/7-ways-to-label-a-cluster-plot-python

Visualise word2vec generated from gensim using t-sne

I have trained a doc2vec and corresponding word2vec on my own corpus using gensim. I want to visualise the word2vec using t-sne with the words. As in, each dot in the figure has the "word" also with it.
I looked at a similar question here : t-sne on word2vec
Following it, I have this code :
import gensim
import gensim.models as g
from sklearn.manifold import TSNE
import re
import matplotlib.pyplot as plt
modelPath="/Users/tarun/Desktop/PE/doc2vec/model3_100_newCorpus60_1min_6window_100trainEpoch.bin"
model = g.Doc2Vec.load(modelPath)
X = model[model.wv.vocab]
print len(X)
print X[0]
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X[:1000,:])
plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
plt.show()
This gives a figure with dots but no words. That is I don't know which dot is representative of which word. How can I display the word with the dot?
Two parts to the answer: how to get the word labels, and how to plot the labels on a scatterplot.
Word labels in gensim's word2vec
model.wv.vocab is a dict of {word: object of numeric vector}. To load the data into X for t-SNE, I made one change.
vocab = list(model.wv.key_to_index)
X = model.wv[vocab]
This accomplishes two things: (1) it gets you a standalone vocab list for the final dataframe to plot, and (2) when you index model, you can be sure that you know the order of the words.
Proceed as before with
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)
Now let's put X_tsne together with the vocab list. This is easy with pandas, so import pandas as pd if you don't have that yet.
df = pd.DataFrame(X_tsne, index=vocab, columns=['x', 'y'])
The vocab words are the indices of the dataframe now.
I don't have your dataset, but in the other SO you mentioned, an example df that uses sklearn's newsgroups would look something like
x y
politics -1.524653e+20 -1.113538e+20
worry 2.065890e+19 1.403432e+20
mu -1.333273e+21 -5.648459e+20
format -4.780181e+19 2.397271e+19
recommended 8.694375e+20 1.358602e+21
arguing -4.903531e+19 4.734511e+20
or -3.658189e+19 -1.088200e+20
above 1.126082e+19 -4.933230e+19
Scatterplot
I like the object-oriented approach to matplotlib, so this starts out a little different.
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.scatter(df['x'], df['y'])
Lastly, the annotate method will label coordinates. The first two arguments are the text label and the 2-tuple. Using iterrows(), this can be very succinct:
for word, pos in df.iterrows():
ax.annotate(word, pos)
[Thanks to Ricardo in the comments for this suggestion.]
Then do plt.show() or fig.savefig(). Depending on your data, you'll probably have to mess with ax.set_xlim and ax.set_ylim to see into a dense cloud. This is the newsgroup example without any tweaking:
You can modify dot size, color, etc., too. Happy fine-tuning!
With the following, you can convert your model to a TSV and then use this page for visualization.
with open(self.word_tensors_TSV, 'bw') as file_vector, open(self.word_meta_TSV, 'bw') as file_metadata:
for word in model.wv.vocab:
file_metadata.write((word + '\n').encode('utf-8', errors='replace'))
vector_row = '\t'.join(str(x) for x in model[word])
file_vector.write((vector_row + '\n').encode('utf-8', errors='replace'))
:)

Resources