Word2vec: add external word to every context - nlp

I'm looking for a simple "hack" to implement the following idea: I want to have a specific word appear artificially in the context of every word (the underlying goal is to try and use word2vec for supervised sentence classification).
An example is best:
Say I have the sentence: "The dog is in the garden", and a window of 1.
So we would get the following pais of (target, context):
(dog, The), (dog, is), (is, dog), (is, in), etc.
But what I would like to feed to the word2vec algo is this:
(dog, The), (dog, is), **(dog, W)**, (is, dog), (is, in), **(is, W)**, etc.,
as if my word W was in the context of every word.
where W is a word of my choosing, not in the existing vocabulary.
Is there an easy way to do this in R or python ?

I imagined you have list of sentences and list of labels for each sentence:
sentences = [
["The", "dog", "is", "in", "the", "garden"],
["The", "dog", "is", "not", "in", "the", "garden"],
]
Then you created the word-context pairs:
word_context = [("dog", "The"), ("dog", "is"), ("is", "dog"), ("is", "in") ...]
Now if for each sentence you have a label, you can add labels to context of all words:
labels = [
"W1",
"W2",
]
word_labels = [
(word, label)
for sent, label in zip(sentences, labels)
for word in sent
]
word_context += word_labels
Unless you want to keep the order in word-context pairs!

Take a look at the 'Paragraph Vectors' algorithm – implemented as the class Doc2Vec in Python gensim. In it, each text example gets an extra pseudoword that essentially floats over the full example, contributing itself to every skip-gram-like (called PV-DBOW in Paragraph Vectors) or CBOW-like (called PV-DM in Paragraph Vectors) training-context.
Also take a look at Facebook's 'FastText' paper and library. It's essentially an extension of word2vec in two different directions:
First, it has the option of learning vectors for subword fragments (chracter n-grams) so that future unknown words can get rough-guess vectors from their subwords.
Second, it has the option of trying to predict not just nearby-words during vector-training, but also known-classification-labels for the containing text example (sentence). As a result, the learned word-vectors may be better for subsequent classification of other future sentences.

Related

Transformers get named entity prediction for words instead of tokens

This is very basic question, but I spend hours struggling to find the answer. I built NER using Hugginface transformers.
Say I have input sentence
input = "Damien Hirst oil in canvas"
I tokenize it to get
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')
tokenized = tokenizer.encode(input) #[101, 12587, 7632, 12096, 3514, 1999, 10683, 102]
Feed tokenized sentence to the model to get predicted tags for the tokens
['B-ARTIST' 'B-ARTIST' 'I-ARTIST' 'I-ARTIST' 'B-MEDIUM' 'I-MEDIUM'
'I-MEDIUM' 'B-ARTIST']
prediction comes as output from the model. It assigns tags to different tokens.
How can I recombine this data to obtain tags for words instead of tokens? So I would know that
"Damien Hirst" = ARTIST
"Oil in canvas" = MEDIUM
There are two questions here.
Annotating Token Classification
A common sequential tagging, especially in Named Entity Recognition, follows the scheme that a sequence to tokens with tag X at the beginning gets B-X and on reset of the labels it gets I-X.
The problem is that most annotated datasets are tokenized with space! For example:
[CSL] O
Damien B-ARTIST
Hirst I-ARTIST
oil B-MEDIUM
in I-MEDIUM
canvas I-MEDIUM
[SEP] O
where O indicates that it is not a named-entity, B-ARTIST is the beginning of the sequence of tokens labelled as ARTIST and I-ARTIST is inside the sequence - similar pattern for MEDIUM.
At the moment I posted this answer, there is an example of NER in huggingface documentation here:
https://huggingface.co/transformers/usage.html#named-entity-recognition
The example doesn't exactly answer the question here, but it can add some clarification. The similar style of named entity labels in that example could be as follows:
label_list = [
"O", # not a named entity
"B-ARTIST", # beginning of an artist name
"I-ARTIST", # an artist name
"B-MEDIUM", # beginning of a medium name
"I-MEDIUM", # a medium name
]
Adapt Tokenizations
With all that said about annotation schema, BERT and several other models have different tokenization model. So, we have to adapt these two tokenizations.
In this case with bert-base-uncased, the expected outcome is like this:
damien B-ARTIST
hi I-ARTIST
##rst I-ARTIST
oil B-MEDIUM
in I-MEDIUM
canvas I-MEDIUM
In order to get this done, you can go through each token in original annotation, then tokenize it and add its label again:
tokens_old = ['Damien', 'Hirst', 'oil', 'in', 'canvas']
labels_old = ["B-ARTIST", "I-ARTIST", "B-MEDIUM", "I-MEDIUM", "I-MEDIUM"]
label2id = {label: idx for idx, label in enumerate(label_list)}
tokens, labels = zip(*[
(token, label)
for token_old, label in zip(tokens_old, labels_old)
for token in tokenizer.tokenize(token_old)
])
When you add [CLS] and [SEP] in the tokens, their labels "O" must be added to labels.
With the code above, it is possible to get into a situation that a beginning tag like B-ARTIST get repeated when the beginning word splits into pieces. According to the description in huggingface documentation, you can encode these labels with -100 to be ignored:
https://huggingface.co/transformers/custom_datasets.html#token-classification-with-w-nut-emerging-entities
Something like this should work:
tokens, labels = zip(*[
(token, label2id[label] if (label[:2] != "B-" or i == 0) else -100)
for token_old, label in zip(tokens_old, labels_old)
for i, token in enumerate(tokenizer.tokenize(token_old))
])

Word vocabulary generated by Word2vec and Glove models are different for the same corpus

I'm using CONLL2003 dataset to generate word embeddings using Word2vec and Glove.
The number of words returned by word2vecmodel.wv.vocab is different(much lesser) than glove.dictionary.
Here is the code:
Word2Vec:
word2vecmodel = Word2Vec(result ,size= 100, window =5, sg = 1)
X = word2vecmodel[word2vecmodel.wv.vocab]
w2vwords = list(word2vecmodel.wv.vocab)
Output len(w2vwords) = 4653
Glove:
from glove import Corpus
from glove import Glove
import numpy as np
corpus = Corpus()
nparray = []
allwords = []
no_clusters=500
corpus.fit(result, window=5)
glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)
Output: len(glove.dictionary) = 22833
The input is a list of sentences. For example:
result[1:5] =
['Peter', 'Blackburn'],
['BRUSSELS', '1996-08-22'],
['The',
'European',
'Commission',
'said',
'Thursday',
'disagreed',
'German',
'advice',
'consumers',
'shun',
'British',
'lamb',
'scientists',
'determine',
'whether',
'mad',
'cow',
'disease',
'transmitted',
'sheep',
'.'],
['Germany',
"'s",
'representative',
'European',
'Union',
"'s",
'veterinary',
'committee',
'Werner',
'Zwingmann',
'said',
'Wednesday',
'consumers',
'buy',
'sheepmeat',
'countries',
'Britain',
'scientific',
'advice',
'clearer',
'.']]
There are totally 13517 sentences in the result list.
Can someone please explain why the list of words for which the embeddings are created are drastically different in size?
You haven't mentioned which Word2Vec implementation you're using, but I'll assume you're using the popular Gensim library.
Like the original word2vec.c code released by Google, Gensim Word2Vec uses a default min_count parameter of 5, meaning that any words appearing fewer than 5 times are ignored.
The word2vec algorithm needs many varied examples of a word's usage is different contexts to generate strong word-vectors. When words are rare, they fail to get very good word-vectors themselves: the few examples only show a few uses that may be idiosyncractic compared to what a larger sampling would show, and can't be subtly balanced against many other word representations in the manner that's best.
But further, given that in typical word-distributions, there are many such low-frequency words, altogether they also tend to make the word-vectors for other more-frequent qords worse. The lower-frequency words are, comparatively, 'interference' that absorbs training state/effort to the detriment of other more-improtant words. (At best, you can offset this effect a bit by using more training epochs.)
So, discarding low-frequency words is usually the right approach. If you really need vectors-for those words, obtaining more data so that those words are no longer rare is the best approach.
You can also use a lower min_count, including as low as min_count=1 to retain all words. But often discarding such rare words is better for whatever end-purpose for which the word-vectors will be used.

find bigram using gensim

this is my code
from gensim.models import Phrases
documents = ["the mayor of new york was there the hill have eyes","the_hill have_eyes new york mayor was present"]
sentence_stream = [doc.split(" ") for doc in documents]
bigram = Phrases(sentence_stream, min_count=1)
sent = ['the', 'mayor', 'of', 'new_york', 'was', 'there', 'the_hill', 'have_eyes']
print(bigram[sent])
i want it detects "the_hill_have_eyes" but the output is
['the', 'mayor', 'of', 'new_york', 'was', 'there', 'the_hill', 'have_eyes']
Phrases is a purely-statistical method for combining some unigram-token-pairs to new bigram-tokens. If it's not combining two unigrams you think should be combined, it's because the training data and/or chosen parameters (like threshold or min_count) don't imply that pairing should be combined.
Note especially that:
even when Phrases-combinations prove beneficial for downstream classification or info-retrieval steps, they may not intuitively/aesthetically match the "phrases" we as human readers would like to see
since Phrases requires bulk statistics for good results, it requires a lot of training data – you are unlikely to see impressive or representative results from tiny toy-sized training data
In particular with regard to that last point & your example, the interpretation of min_count in Phrases default-scoring means even a min_count=1 isn't low enough to cause bigrams for which there is only a single example in the training-corpus to be created.
So, if you expand your training-corpus a bit, you may be able to create the results you want. But you should still be aware that this method's only value comes from training are larger, realistic corpuses, so anything you see in tiny contrived examples may not generalize to real uses.
What you want is not actually bigrams but "fourgrams".
This can be achieved by doing something like this (my old piece of code I wrote some months ago):
// read the txt file
sentences = Text8Corpus(datapath('testcorpus.txt'))
phrases = Phrases(sentences, min_count=1, threshold=1)
bigram = Phraser(phrases)
sent = [u'trees', u'graph', u'minors']
// look for words in "sent"
print(bigram[sent])
[u'trees_graph', u'minors'] // output
// to create the bigrams
bigram_model = Phrases(unigram_sentences)
// apply the trained model to a sentence
for unigram_sentence in unigram_sentences:
bigram_sentence = u' '.join(bigram_model[unigram_sentence])
// get a trigram model out of the bigram
trigram_model = Phrases(bigram_sentences)
So here you have a trigram model (detecting 3 words together) and you get the idea on how to implement fourgrams.
Hope this helps. Good luck.

Some diverging issues of Word2Vec in Gensim using high alpha values

I am implementing word2vec in gensim, on a corpus with nested lists (collection of tokenized words in sentences of sentences form) with 408226 sentences (lists) and a total of 3150546 words or tokens.
I am getting a meaningful results (in terms of the similarity between two words using model.wv.similarity) with the chosen values of 200 as size, window as 15, min_count as 5, iter as 10 and alpha as 0.5. All are lemmatized words and these all are input to models with vocabulary as 32716.
The results incurred from default alpha value, size, window and dimensions are meaningless for me based on the used data in computing the similarity values. However higher value of alpha as 0.5 gives me some meaningful results in terms of inducing meaningful similarity scores between two words. However, when I calculate the top n similar words, it's again meaningless. Does I need to change the entire parameters used in the initial training process.
I am still unable to reveal the exact reason, why the model behaves good with such a higher alpha value in computing the similarity between two words of the used corpus, whereas it's meaningless while computing the top n similar words with scores for an input word. Why is this the case?
Does it is diverging towards optimal solution. How to check this?
Any idea why is it the case is deeply appreciated.
Note: I'm using Python 3.7 on Windows machine with anaconda prompt and giving input to the model from a file.
This is what I have tried.
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.models import Word2Vec
import ast
path = "F:/Folder/"
def load_data():
global Sentences
Sentences = []
for file in ['data_d1.txt','data_d2.txt']:
with open(path + file, 'r', encoding = 'utf-8') as f1:
Sentences.extend(ast.literal_eval(*f1.readlines()))
load_data()
def initialize_word_embedding():
model = Word2Vec(Sentences, size = 200, window = 15, min_count = 5, iter = 10, workers = 4)
print(model)
print(len(model.wv.vocab))
print(model.wv.similarity(w1 = 'structure', w2 = '_structure_'))
similarities = model.wv.most_similar('system')
for word, score in similarities:
print(word , score)
initialize_word_embedding()
The example of Sentences list is as follows:
[['scientist', 'time', 'comet', 'activity', 'sublimation', 'carbon', 'dioxide', 'nears', 'ice', 'system'], ['inconsistent', 'age', 'system', 'year', 'size', 'collision'], ['intelligence', 'system'], ['example', 'application', 'filter', 'image', 'motion', 'channel', 'estimation', 'equalization', 'example', 'application', 'filter', 'system']]
The data_d1.txt and data_d2.txt is a nested list (list of lists of lemmatized tokenized words). I have preprocessed the raw data and save it in a file. Now giving the same as input. For computing the lemmatizing tokens, I have used the popular WordNet lemmatizer.
I need the word-embedding model to calculate the similarity between two words and computing the most_similar words of a given input word. I am getting some meaningful scores for the model.wv.similarity() method, whereas in calculating the most_similar() words of a word (say, system as shown in above). I am not getting the desired results.
I am guessing the model is getting diverged from the global minima, with the use of high alpha values.
I am confused what should be the dimension size, window for inducing some meaningful results, as there is no such rules regarding how to compute the the size and window.
Any suggestion is appreciated. The size of total sentences and words are specified above in the question.
Results what I am getting without setting alpha = 0.5
Edit to Recent Comment:
Results:
Word2Vec(vocab=32716, size=200, alpha=0.025)
The similarity between set and _set_ is : 0.000269373188960656
which is meaningless for me as it is very very less in terms of accuracy, But, I am a getting 71% by setting alpha as 0.5, which seems to be meaningful for me as the word set is same for both the domains.
Explanation: The word set should be same for both the domains (as I am comparing the data of two domains with same word). Don't get confused with word _set_, this is because the word is same as set, I have injected a character _ at start and end to distinguish the same for two different domains.
The top 10 words along with scores of _set_ are:
_niche_ 0.6891741752624512
_intermediate_ 0.6883598566055298
_interpretation_ 0.6813371181488037
_printer_ 0.675414502620697
_finer_ 0.6625382900238037
_pertinent_ 0.6620787382125854
_respective_ 0.6619025468826294
_converse_ 0.6610435247421265
_developed_ 0.659270167350769
_tent_ 0.6588765382766724
Whereas, the top 10 words for set are:
cardinality 0.633270263671875
typereduction 0.6233855485916138
zdzisław 0.619156002998352
crisp 0.6165326833724976
equivalenceclass 0.605925977230072
pawlak 0.6058803200721741
straight 0.6045454740524292
culik 0.6040038466453552
rin 0.6038737297058105
multisets 0.6035065650939941
Why the cosine similarity value is 0.00 for the word set for two different data.

Keras - Text preprocessing

The goal:
To generate text from an authors style.
Input: an authors work to train on, a seed for a prediction
Output: generated text from that seed
Question about the embedding layer in keras:
I have raw text, a flat text file containing a few thousand lines of text. I want to input this into an embedding layer to keras to vectorize the data. Here is what I have as text:
--SNIP
The Wild West\n Ha ha, ride\n All you see is the sun reflectin\' off of the
--SNIP
and I call it input_text:
num_words = 2000#get 2000 words
tok = Tokenizer(num_words)#tokenize the words
tok.fit_on_texts(input_text)#takes in list of text to train on
#put all words from text into a words array
#this is essentially enumerating them
words = []
for iter in range(num_words):
words += [key for key,value in tok.word_index.items() if value==iter+1]
#words[:10]
#Class for vectorizing texts, or/and turning texts into sequences
#(=list of word indexes, where the word of rank i in the dataset (starting at 1) has index i).
X_train = tok.texts_to_sequences(input_text)#turns text to sequence, stating which word comes in what place
X_train = sequence.pad_sequences(X_train, maxlen=100)#pad sequence, essentially padding it with 0's at the end
y_train = words
The problem:
It seems that my code will take in the sequence, then when I apply padding it only gives the first 100 of the sequence. How should I break it apart?
Should I take the entire sequence and go through the first 100 words (X), and give the next one (Y) and do some skips along the way?
I want the output to be the probability of the next word coming up. So I have a softmax layer at the end. Essentially I want to generate text from a seed. Is this the correct way of going about doing this? or is it just better
i think you will not find a better answer anywhere than on this page here, by the way code is also available on github, dive in or ask more questions.

Resources