Text classification: value error couldn't convert str to float - scikit-learn

Input for random forest classifier trained model for text classification
I am not able to know what should be the input for the trained model after opening the model from the pickle file.
with open('text_classifier', 'rb') as training_model:
model = pickle.load(training_model)
for message in text:
message1 = [str(message)]
pred = model.predict(message1)
list.append(pred)
return list
Expected output: Non political
Actual output :
ValueError: could not convert string to float: 'RT #ScotNational The
witness admitted that not all damage inflicted on police cars was
caused

You need to encode the text as numbers. No machine algorithm can process text directly.
More precisely, you need to use a word embedding (the same used for training the model). Example of common word embeddings are Word2vec, TF-IDF.
I suggest you to play with sklearn.feature_extraction.text.CountVectorizer and sklearn.feature_extraction.text.TfidfTransformer to familiarize yourself with the concept of embedding.
However, if you do not use the same embedding as the one used to train the model you load, there is no way you will obtain good results.

Related

Bag of words causing google colab to crash?

problem
I am using bag of words for extracting features from text but when i used a model to predict the result, it is causing runtime error
feature extraction code
bag_of_words = CountVectorizer()
#extracting features from bag of words
x_train_bg = bag_of_words.fit_transform(x_train)
x_test_bg = bag_of_words.transform(x_test)
model prediction
model = GaussianNB()
model.fit(x_train_bg.toarray(), y_train)
y_pred = model.predict(x_test_bg.toarray())
I know that problem is due to excessive memory usage so does that mean that i cannot use bag of words or tfidf for feature extraction , if it can be used then what's the change that needs to be implemented

fasttext produces a different vector after training

Here's my training:
import fasttext
model = fasttext.train_unsupervised('data.txt', model='skipgram')
Now, let's observe the first vector (omitted the full output for readability)
model.get_input_vector(0)
# array([-0.1988439 , 0.40966552, 0.47418243, 0.148709 , 0.5891477
On the other hand, let's input the first string into our model:
model[data.iloc[0]]
# array([ 0.10782535, 0.3055557 , 0.19097836, -0.15849613, 0.14204402
We get a different vector.
Why?
You should have explained more about data structure. By the way, when you are using model[data.iloc[0]], it is equivalent to model.get_word_vector(data.iloc[0]). So, you should pass a word to the model.
On the other hand, model.get_input_vector(0) might input a sentence to the model. Therefore, you can compare the result of model.get_input_vector(0) with model.get_sentence_vector(data.iloc[0]), if data.iloc[0] is a sentence. Otherwise, you should get the first word in the data to input to the model and then compare their vectors.

Using pretrained word2vector model

I am trying to use a pretrained word2vector model to create word embeddings but i am getting the following error when Im trying to create weight matrix from word2vec genism model:
Code:
import gensim
w2v_model = gensim.models.KeyedVectors.load_word2vec_format("/content/drive/My Drive/GoogleNews-vectors-negative300.bin.gz", binary=True)
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)
EMBEDDING_DIM=300
# Function to create weight matrix from word2vec gensim model
def get_weight_matrix(model, vocab):
# total vocabulary size plus 0 for unknown words
vocab_size = len(vocab) + 1
# define weight matrix dimensions with all 0
weight_matrix = np.zeros((vocab_size, EMBEDDING_DIM))
# step vocab, store vectors using the Tokenizer's integer mapping
for word, i in vocab.items():
weight_matrix[i] = model[word]
return weight_matrix
embedding_vectors = get_weight_matrix(w2v_model, tokenizer.word_index)
Im getting the following error:
Error
As a note: it's better to paste a full error is as formatted text than as an image of text. (See Why not upload images of code/errors when asking a question? for a full list of the reasons why.)
But regarding your question:
If you get a KeyError: word 'didnt' not in vocabulary error, you can trust that the word you've requested is not in the set-of-word-vectors you've requested it from. (In this case, the GoogleNews vectors that Google trained & released back around 2012.)
You could check before looking it up – 'didnt' in w2v_model, which would return False, and then do something else. Or you could use a Python try: ... catch: ... formulation to let it happen, but then do something else when it happens.
But it's up to you what your code should do if the model you've provided doesn't have the word-vectors you were hoping for.
(Note: the GoogleNews vectors do include a vector for "didn't", the contraction with its internal apostrophe. So in this one case, the issue may be that your tokenization is stripping such internal-punctuation-marks from contractions, but Google chose not to when making those vectors. But your code should be ready for handling missing words in any case, unless you're sure through other steps that can never happen.)

Wrong length for Gensim Word2Vec's vocabulary

I am trying to train the Gensim Word2Vec model by:
X = train['text']
model_word2vec = models.Word2Vec(X.values, size=150)
model_word2vec.train(X.values, total_examples=len(X.values), epochs=10)
after the training, I get a small vocabulary (model_word2vec.wv.vocab) of length 74 containing only the alphabet's letters.
How could I get the right vocabulary?
Update
I tried this before:
tokenizer = Tokenizer(lower=True)
tokenized_text = tokenizer.fit_on_texts(X)
sequence = tokenizer.texts_to_sequences(X)
model_word2vec.train(sequence, total_examples=len(X.values), epochs=10
but I got the same wrong vocabulary size.
Supply the model with the kind of corpus it needs: a sequence of texts, where each text is a list-of-string-tokens. If you supply it with non-tokenized strings instead, it will think each single character is a token, giving the results you're seeing.

how to convert Word to vector using embedding layer in Keras

I am having a word embedding file as shown below click here to see the complete file in github.I would like to know the procedure for generating word embeddings So that i can generate word embedding for my personal dataset
in -0.051625 -0.063918 -0.132715 -0.122302 -0.265347
to 0.052796 0.076153 0.014475 0.096910 -0.045046
for 0.051237 -0.102637 0.049363 0.096058 -0.010658
of 0.073245 -0.061590 -0.079189 -0.095731 -0.026899
the -0.063727 -0.070157 -0.014622 -0.022271 -0.078383
on -0.035222 0.008236 -0.044824 0.075308 0.076621
and 0.038209 0.012271 0.063058 0.042883 -0.124830
a -0.060385 -0.018999 -0.034195 -0.086732 -0.025636
The 0.007047 -0.091152 -0.042944 -0.068369 -0.072737
after -0.015879 0.062852 0.015722 0.061325 -0.099242
as 0.009263 0.037517 0.028697 -0.010072 -0.013621
Google -0.028538 0.055254 -0.005006 -0.052552 -0.045671
New 0.002533 0.063183 0.070852 0.042174 0.077393
with 0.087201 -0.038249 -0.041059 0.086816 0.068579
at 0.082778 0.043505 -0.087001 0.044570 0.037580
over 0.022163 -0.033666 0.039190 0.053745 -0.035787
new 0.043216 0.015423 -0.062604 0.080569 -0.048067
I was able to convert each words in a dictionary to the above format by following the below steps:
initially represent each words in the dictionary by unique integer
take each integer one by one and perform array([[integer]]) and give it as input array in below code
then the word corresponding to integer and respective output vector can be stored to json file ( i used output_array.tolist() for storing the vector in json format)
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding
model = Sequential()
model.add(Embedding(dictionary_size_here, sizeof_embedding_vector, input_length= input_length_here))
input_array = array([[integer]]) #each integer is fed one by one using a loop
model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)
Reference
How does Keras 'Embedding' layer work?
It is important to understand that there are multiple ways to generate an embedding for words. The popular word2vec, for example, can generate word embeddings using CBOW or Skip-grams.
Hence, one could have multiple "procedures" to generate word embeddings. One of the easier to understand method (albeit with its drawbacks) to generate an embedding is using Singular Value Decomposition (SVD). The steps are briefly described below.
Create a Term-Document matrix. i.e. terms as rows and the document it appears in as columns.
Perform SVD
Truncate the output vector for the term to n dimension. In your example above, n = 5.
You can have a look at this link for a more detailed description using word2vec's skipgram model to generate an embedding. Word2Vec Tutorial - The Skip-Gram Model.
For more information on SVD, you can look at this and this.

Resources