Wrong length for Gensim Word2Vec's vocabulary - nlp

I am trying to train the Gensim Word2Vec model by:
X = train['text']
model_word2vec = models.Word2Vec(X.values, size=150)
model_word2vec.train(X.values, total_examples=len(X.values), epochs=10)
after the training, I get a small vocabulary (model_word2vec.wv.vocab) of length 74 containing only the alphabet's letters.
How could I get the right vocabulary?
Update
I tried this before:
tokenizer = Tokenizer(lower=True)
tokenized_text = tokenizer.fit_on_texts(X)
sequence = tokenizer.texts_to_sequences(X)
model_word2vec.train(sequence, total_examples=len(X.values), epochs=10
but I got the same wrong vocabulary size.

Supply the model with the kind of corpus it needs: a sequence of texts, where each text is a list-of-string-tokens. If you supply it with non-tokenized strings instead, it will think each single character is a token, giving the results you're seeing.

Related

Bag of words causing google colab to crash?

problem
I am using bag of words for extracting features from text but when i used a model to predict the result, it is causing runtime error
feature extraction code
bag_of_words = CountVectorizer()
#extracting features from bag of words
x_train_bg = bag_of_words.fit_transform(x_train)
x_test_bg = bag_of_words.transform(x_test)
model prediction
model = GaussianNB()
model.fit(x_train_bg.toarray(), y_train)
y_pred = model.predict(x_test_bg.toarray())
I know that problem is due to excessive memory usage so does that mean that i cannot use bag of words or tfidf for feature extraction , if it can be used then what's the change that needs to be implemented

Should BERT embeddings be made on tokens or sentences?

I am making a sentence classification model and using BERT word embeddings in it. Due to very large dataset, I combined all the sentences together in one string and made embeddings on the tokens generated from those.
s = " ".join(text_list)
len(s)
Here s is the string and text_list contains the sentences on which I want to make my word embeddings.
I then tokenize the string
stokens = tokenizer.tokenize(s)
My question is, will BERT perform better on whole sentence given at a time or making embeddings on tokens for whole string is also fine?
Here is the code for my embedding generator
pool = []
all = []
i=0
while i!=600000:
stokens = stokens[i:i+500]
stokens = ["[CLS]"] + stokens + ["[SEP]"]
input_ids = get_ids(stokens, tokenizer, max_seq_length)
input_masks = get_masks(stokens, max_seq_length)
input_segments = get_segments(stokens, max_seq_length)
a, b= embedd(input_ids, input_masks, input_segments)
pool.append(a)
all.append(b)
print(i)
i+=500
What essentially I am doing here is, I have the string length of 600000 and I take 500 tokens at a time and generate embdedings for it and append it in a list call pool.
For classification, you don't have to concatenate the sentences. By concatenating, you are merging the sentences of different classes.
If it is BERT fine-tuning, by default, for the classification task a logistic regression layer is learnt on top of [CLS] token. Since, its attention based transformer model, it assumes that each token has seen the other tokens and has captured the context. Thus [CLS] token is sufficient.
However, if you want to use the embeddings, you can learn a classifier on single vector,i.e, embeddings [CLS] token or averaged embeddings of all the tokens. Or, you can get the embeddings for each token and form a sequence to learn it using other classifiers such as CNN or RNN.

Using pretrained word2vector model

I am trying to use a pretrained word2vector model to create word embeddings but i am getting the following error when Im trying to create weight matrix from word2vec genism model:
Code:
import gensim
w2v_model = gensim.models.KeyedVectors.load_word2vec_format("/content/drive/My Drive/GoogleNews-vectors-negative300.bin.gz", binary=True)
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)
EMBEDDING_DIM=300
# Function to create weight matrix from word2vec gensim model
def get_weight_matrix(model, vocab):
# total vocabulary size plus 0 for unknown words
vocab_size = len(vocab) + 1
# define weight matrix dimensions with all 0
weight_matrix = np.zeros((vocab_size, EMBEDDING_DIM))
# step vocab, store vectors using the Tokenizer's integer mapping
for word, i in vocab.items():
weight_matrix[i] = model[word]
return weight_matrix
embedding_vectors = get_weight_matrix(w2v_model, tokenizer.word_index)
Im getting the following error:
Error
As a note: it's better to paste a full error is as formatted text than as an image of text. (See Why not upload images of code/errors when asking a question? for a full list of the reasons why.)
But regarding your question:
If you get a KeyError: word 'didnt' not in vocabulary error, you can trust that the word you've requested is not in the set-of-word-vectors you've requested it from. (In this case, the GoogleNews vectors that Google trained & released back around 2012.)
You could check before looking it up – 'didnt' in w2v_model, which would return False, and then do something else. Or you could use a Python try: ... catch: ... formulation to let it happen, but then do something else when it happens.
But it's up to you what your code should do if the model you've provided doesn't have the word-vectors you were hoping for.
(Note: the GoogleNews vectors do include a vector for "didn't", the contraction with its internal apostrophe. So in this one case, the issue may be that your tokenization is stripping such internal-punctuation-marks from contractions, but Google chose not to when making those vectors. But your code should be ready for handling missing words in any case, unless you're sure through other steps that can never happen.)

Cannot reproduce pre-trained word vectors from its vector_ngrams

Just curiosity, but I was debugging gensim's FastText code for replicating the implementation of Out-of-Vocabulary (OOV) words, and I'm not being able to accomplish it.
So, the process i'm following is training a tiny model with a toy corpus, and then comparing the resulting vectors of a word in the vocabulary. That means if the whole process is OK, the output arrays should be the same.
Here is the code I've used for the test:
from gensim.models import FastText
import numpy as np
# Default gensim's function for hashing ngrams
from gensim.models._utils_any2vec import ft_hash_bytes
# Toy corpus
sentences = [['hello', 'test', 'hello', 'greeting'],
['hey', 'hello', 'another', 'test']]
# Instatiate FastText gensim's class
ft = FastText(sg=1, size=5, min_count=1, \
window=2, hs=0, negative=20, \
seed=0, workers=1, bucket=100, \
min_n=3, max_n=4)
# Build vocab
ft.build_vocab(sentences)
# Fit model weights (vectors_ngram)
ft.train(sentences=sentences, total_examples=ft.corpus_count, epochs=5)
# Save model
ft.save('./ft.model')
del ft
# Load model
ft = FastText.load('./ft.model')
# Generate ngrams for test-word given min_n=3 and max_n=4
encoded_ngrams = [b"<he", b"<hel", b"hel", b"hell", b"ell", b"ello", b"llo", b"llo>", b"lo>"]
# Hash ngrams to its corresponding index, just as Gensim does
ngram_hashes = [ft_hash_bytes(n) % 100 for n in encoded_ngrams]
word_vec = np.zeros(5, dtype=np.float32)
for nh in ngram_hashes:
word_vec += ft.wv.vectors_ngrams[nh]
# Compare both arrays
print(np.isclose(ft.wv['hello'], word_vec))
The output of this script is False for every dimension of the compared arrays.
It would be nice if someone could point me out if i'm missing something or doing something wrong. Thanks in advance!
The calculation of a full word's FastText word-vector is not just the sum of its character n-gram vectors, but also a raw full-word vector that's also trained for in-vocabulary words.
The full-word vectors you get back from ft.wv[word] for known-words have already had this combination pre-calculated. See the adjust_vectors() method for an example of this full calculation:
https://github.com/RaRe-Technologies/gensim/blob/68ec5b8ed7f18e75e0b13689f4da53405ef3ed96/gensim/models/keyedvectors.py#L2282
The raw full-word vectors are in a .vectors_vocab array on the model.wv object.
(If this isn't enough to reconcile matters: ensure you're using the latest gensim, as there have been many recent FT fixes. And, ensure your list of ngram-hashes matches the output of the ft_ngram_hashes() method of the library – if not, your manual ngram-list-creation and subsequent hashing may be doing something different.)

Text classification: value error couldn't convert str to float

Input for random forest classifier trained model for text classification
I am not able to know what should be the input for the trained model after opening the model from the pickle file.
with open('text_classifier', 'rb') as training_model:
model = pickle.load(training_model)
for message in text:
message1 = [str(message)]
pred = model.predict(message1)
list.append(pred)
return list
Expected output: Non political
Actual output :
ValueError: could not convert string to float: 'RT #ScotNational The
witness admitted that not all damage inflicted on police cars was
caused
You need to encode the text as numbers. No machine algorithm can process text directly.
More precisely, you need to use a word embedding (the same used for training the model). Example of common word embeddings are Word2vec, TF-IDF.
I suggest you to play with sklearn.feature_extraction.text.CountVectorizer and sklearn.feature_extraction.text.TfidfTransformer to familiarize yourself with the concept of embedding.
However, if you do not use the same embedding as the one used to train the model you load, there is no way you will obtain good results.

Resources