The goal I want to achieve is to find a good word_and_phrase embedding model that can do:
(1) For the words and phrases that I am interested in, they have embeddings.
(2) I can use embeddings to compare similarity between two things(could be word or phrase)
So far I have tried two paths:
1: Some Gensim-loaded pre-trained models, for instance:
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
# download the model and return as object ready for use
model_glove_twitter = api.load("fasttext-wiki-news-subwords-300")
model_glove_twitter.similarity('computer-science', 'machine-learning')
The problem with this path is that I do not know if a phrase has embedding. For this example, I got this error:
KeyError: "word 'computer-science' not in vocabulary"
I will have to try different pre-trained models, such as word2vec-google-news-300, glove-wiki-gigaword-300, glove-twitter-200, etc. Results are similar, there are always phrases of interests not having embeddings.
Then I tried to use some BERT-based sentence embedding method: https://github.com/UKPLab/sentence-transformers.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
from scipy.spatial.distance import cosine
def cosine_similarity(embedding_1, embedding_2):
# Calculate the cosine similarity of the two embeddings.
sim = 1 - cosine(embedding_1, embedding_2)
print('Cosine similarity: {:.2}'.format(sim))
phrase_1 = 'baby girl'
phrase_2 = 'annual report'
embedding_1 = model.encode(phrase_1)
embedding_2 = model.encode(phrase_2)
cosine_similarity(embedding_1[0], embedding_2[0])
Using this method I was able to get embeddings for my phrases, but the similarity score was 0.93, which did not seem to be reasonable.
So what can I try else to achieve the two goals mentioned above?
The problem with the first path is that you are loading fastText embeddings like word2vec embeddings and word2vec can't cope with Out Of Vocabulary words.
The good thing is that fastText can manage OOV words.
You can use Facebook original implementation (pip install fasttext) or Gensim implementation.
For example, using Facebook implementation, you can do:
import fasttext
import fasttext.util
# download an english model
fasttext.util.download_model('en', if_exists='ignore') # English
model = fasttext.load_model('cc.en.300.bin')
# get word embeddings
# (if instead you want sentence embeddings, use get_sentence_vector method)
word_1='computer-science'
word_2='machine-learning'
embedding_1=model.get_word_vector(word_1)
embedding_2=model.get_word_vector(word_2)
# compare the embeddings
cosine_similarity(embedding_1, embedding_2)
Related
When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because "embeddings" is an out of vocabulary word/token, that is being split into em, bed ,ding, s.
I would like to know if there is any aggregation strategies available that make sense apart from doing a mean of these vectors.
from transformers import AutoTokenizer, AutoModel
# download and load model
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
sentences = ['This framework generates embeddings for each input sentence']
#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')
#Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
print(encoded_input['input_ids'].shape)
Output:
torch.Size([1, 13])
for token in encoded_input['input_ids'][0]:
print(tokenizer.decode([token]))
Output:
[CLS]
this
framework
generates
em
##bed
##ding
##s
for
each
input
sentence
[SEP]
To my knowledge, mean aggregation is the most commonly used tool here, and in fact there is even scientific literature, empirically showing that it works well:
Generalizing Word Embeddings using Bag of Subwords by Zhao, Mudgal and Liang. Formula 1 is describing exactly what you are proposing as well.
The one alternative that you could theoretically employ is something like a mean aggregate over the entire input, essentially making a "context prediction" over all words (potentially except "embeddings"), therefore emulating something similar to the [MASK]ing during training of the transformer models. But this is just a suggestion from me without any backup of scientific evidence that it works (for better or worse).
I have a pandas dataframe containing descriptions. I would like to cluster descriptions based on meanings usign CBOW. My challenge for now is to document embed each row into equal dimensions vectors. At first I am training the word vectors using gensim as so:
from gensim.models import Word2Vec
vocab = pd.concat((df['description'], df['more_description']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
I am however a bit confused now on how to replace the full sentences from my df with document vectors of equal dimensions.
For now, my workaround is repacing each word in each row with a vector then applying PCA dimentinality reduction to bring each vector to similar dimensions. Is there a better way of doing this though gensim, so that I could say something like this:
df['description'].apply(model.vectorize)
I think you are looking for sentence embedding. There are a lot ways of generating sentence embedding from word embeddings. You may find this useful: https://stats.stackexchange.com/questions/286579/how-to-train-sentence-paragraph-document-embeddings
I trained a word2vec model using gensim and I want to randomly select vectors from it, and find the corresponding word.
What is the best what to do so?
If your Word2Vec model instance is in the variable model, then there's a list of all words known to the model in model.wv.index2word. (The properties are slightly different in older versions of gensim.)
So, you can pick one item using Python's built-in choice() method in the random module:
import random
print(random.choice(model.wv.index2entity)
If you want to get n random words (keys) from word2vec with Gensim 4.0.0 just use random.sample:
import random
import gensim
# Here we use Gensim 4.0.0
w2v = gensim.models.KeyedVectors.load_word2vec_format("model.300d")
# Get 10 random words (keys) from word2vec model
random_words = random.sample(w2v.index_to_key, 10)
print("Random words: "+ str(random_words))
Piece a cake :)
I am a bit new to gensim and right now I am trying to solve the problem which involves using the doc2vec embeddings in keras. I wasn't able to find existing implementation of doc2vec in keras - as far as I see in all examples I found so far everyone just uses the gensim to get the document embeddings.
Once I trained my doc2vec model in gensim I need to export embeddings weights from genim into keras somehow and it is not really clear on how to do that. I see that
model.syn0
Supposedly gives the word2vec embedding weights (according to this). But it is unclear how to do the same export for document embeddings. Any advise?
I know that in general I can just get the embeddings for each document directly from gensim model but I want to fine-tune the embedding layer in keras later on, since doc embeddings will be used as a part of a larger task hence they might be fine-tuned a bit.
I figured this out.
Assuming you already trained the gensim model and used string tags as document ids:
#get vector of doc
model.docvecs['2017-06-24AEON']
#raw docvectors (all of them)
model.docvecs.doctag_syn0
#docvector names in model
model.docvecs.offset2doctag
You can export this doc vectors into keras embedding layer as below, assuming your DataFrame df has all of the documents out there. Notice that in the embedding matrix you need to pass only integers as inputs. I use raw number in dataframe as the id of the doc for input. Also notice that embedding layer requires to not touch index 0 - it is reserved for masking, so when I pass the doc id as input to my network I need to ensure it is >0
#creating embedding matrix
embedding_matrix = np.zeros((len(df)+1, text_encode_dim))
for i, row in df.iterrows():
embedding = modelDoc2Vec.docvecs[row['docCode']]
embedding_matrix[i+1] = embedding
#input with id of document
doc_input = Input(shape=(1,),dtype='int16', name='doc_input')
#embedding layer intialized with the matrix created earlier
embedded_doc_input = Embedding(output_dim=text_encode_dim, input_dim=len(df)+1,weights=[embedding_matrix], input_length=1, trainable=False)(doc_input)
UPDATE
After late 2017, with the introduction of Keras 2.0 API very last line should be changed to:
embedded_doc_input = Embedding(output_dim=text_encode_dim, input_dim=len(df)+1,embeddings_initializer=Constant(embedding_matrix), input_length=1, trainable=False)(doc_input)
I have created word vectors using a distributed word2vec algorithm. Now I have words and their corresponding vectors. How to build a gensim word2vec model using these words and vectors?
I am not sure if you created word2vec model using gensim or some other tools but if understand your question correctly you want to just load the word2vec model using gensim. This is done in the following way:
import gensim
w2v_file = codecs.open(WORD2VEC_PATH, encoding='utf-8')
model = gensim.models.KeyedVectors.load_word2vec_format(w2v_file, binary=True) # or binary=False if the model is not compressed
If, however, what you want to do is to train word2vec model from scratch (i.e. from raw text) using purely gensim here is a tutorial on how to train word2vec model using gensim.