How to use Tensorflow embeddings in scikit learn models? - python-3.x

I am to trying to use text data for linear regression model as input and converting my text data to vectors using Universal sentence encoder from tensorflow hub as pretrained model for this but this gives me tf.tensors and now I am not able to split the data into training and testing for scikit learn linear regression model as my target feature is continuous.
This gives me embeddings (i.e vectors of shape (1,512) for each text in my pandas dataframe text column)
import tensorflow_hub as hub
model_url = 'https://tfhub.dev/google/universal-sentence-encoder-large/5'
model = hub.load(model_url)
embeddings = model(train['excerpt'])
This is how data look :
id excerpt target
0 c12129c31 When the young people returned to the ballroom... -0.340259
1 85aa80a4c All through dinner time, Mrs. Fayre was somewh... -0.315372
2 b69ac6792 As Roger had predicted, the snow departed as q... -0.580118
3 dd1000b26 And outside before the palace a great garden w... -1.054013
4 37c1b32fb Once upon a time there were Three Bears who li... 0.247197
This is how embeddings look:
tf.Tensor: shape=(2834, 512), dtype=float32, numpy=
array([[-0.06747025, 0.02054032, -0.01223458, ..., 0.03468879,
-0.04216784, 0.01212691],
[-0.01053216, 0.01346854, 0.01992477, ..., 0.03078162,
-0.0226634 , 0.04429556],
[-0.10778417, 0.01735378, 0.00803178, ..., 0.00345916,
0.00552441, -0.02448413],
...,
[ 0.0364146 , 0.02996029, -0.06757646, ..., -0.00335971,
-0.01381749, -0.08319554],
[ 0.0042374 , 0.02291174, -0.04473154, ..., -0.02009053,
-0.00428826, -0.06476445],
[-0.0141812 , 0.03879716, 0.03304171, ..., 0.06709221,
-0.05016331, 0.00868828]], dtype=float32)
Now I want to use this embeddings as input in Linear Regression model or any Regression model using scikit learn. But not able to split the data using train_test_split(), giving me error TypeError: Only integers, slices (:), ellipsis (...), tf.newaxis (None) and scalar tf.int32/tf.int64 tensors are valid indices, got array([1434, 2653, 2620, ..., 749, 2114, 2389])
This is how I am splitting the data:
X_train,X_test,y_train,y_test = train_test_split(embeddings,train['target'],test_size =0.2, shuffle =True)

In the train_test_split you are passing a tensor. Instead, you should pass the NumPy array like this-
X_train,X_test,y_train,y_test = train_test_split(embeddings.numpy(), train['target'],test_size =0.2, shuffle =True)

Related

Is there a way to add a 'sentiment' column after applying CountVectorizer or TfIdfTransformer to a dataframe?

I am working with app store reviews to classify them as class "0" or class "1" based on the text in the review and the sentiment the review carries.
In my classification steps I apply the following methods to my dataframe:
def get_sentiment(s):
vs = analyzer.polarity_scores(s)
if vs['compound'] >= 0.5:
return 1
elif vs['compound'] <= -0.5:
return -1
else:
return 0
df['sentiment'] = df['review'].apply(get_sentiment)
For simplicity sake, the data has already been labeled as either class '0' or '1', but I am training the model for the classification of new instances that have not been labeled yet. In short, the data I'm working with has already been labeled. They are in the classification column.
Then in my train test split method do the following:
msg_train, msg_test, label_train, label_test = train_test_split(df.drop('classification', axis=1), df['classification'], test_size=0.3, random_state=42)
So the dataframe for the X parameter has review and sentiment, and for the y parameter I only have the classification that I am training my model on.
Since the normalization is repetitive, I am running a pipeline like so for simplicity:
pipeline1 = Pipeline([
('bow', CountVectorizer(analyzer=clean_review)),
('tfidf', TfidfTransformer()),
('classifier', MultinomialNB())
])
Where the clean_review function is as follows:
def clean_review(sentence):
no_punc = [c for c in sentence if c not in string.punctuation]
no_punc = ''.join(no_punc)
no_stopwords = [w.lower() for w in no_punc.split() if w not in stopwords_set]
stemmed_words = [ps.stem(w) for w in no_stopwords]
return stemmed_words
Where stopwords_set is the collection of english stopwords from the nltk library, and ps is from the PortStemmer module in the nltk library (for word stemming).
I get the following error: ValueError: Found input variables with inconsistent numbers of samples: [2, 505]
When I searched this error before, I saw that the likely issue could've been that there is a mismatch in the number of records for each attribute. I've found this not to be the case. All the records that I am using have values for every column.
Can someone else help me interpret what this error could mean?
My end goal is to have a dataframe that has the CountVectorizer and TfIdfTransformer applied to the text, but also retain the column for the sentiment of each review.
I would then like to be able to train the MultinomialNB classifier on this dataframe and apply this model to other tasks.
I'm not sure on what the error is due to since I don't know what the size of your dataframe should be. I would need more information. On which line is the error thrown?
Regarding the fact that you want to retain the sentiment column, you could apply CountVectorizer and TfIdfTransformer (by the way you could skip a step and directly apply TfidfVectorizer) only on the text data and then have another transformer in the pipeline which adds the original sentiment column before you feed the dataframe to the classifier.

Bunch object not callable - scikit-learn rcv1 dataset

I want to split the train and test set for RCV1 inbuilt dataset and apply k-means algorithm, however while trying to split the data, an error is shown saying bunch object not callable
from sklearn.datasets import fetch_rcv1
rcv1 = fetch_rcv1()
x_train = rcv1(subset='train')
Indeed it is not; neither it is a dataframe - see the docs. Some extra info is included in the DESCR attribute:
from sklearn.datasets import fetch_rcv1
rcv1 = fetch_rcv1()
print(rcv1.DESCR)
Result:
.. _rcv1_dataset:
RCV1 dataset
------------
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually
categorized newswire stories made available by Reuters, Ltd. for research
purposes. The dataset is extensively described in [1]_.
**Data Set Characteristics:**
============== =====================
Classes 103
Samples total 804414
Dimensionality 47236
Features real, between 0 and 1
============== =====================
:func:`sklearn.datasets.fetch_rcv1` will load the following
version: RCV1-v2, vectors, full sets, topics multilabels::
>>> from sklearn.datasets import fetch_rcv1
>>> rcv1 = fetch_rcv1()
It returns a dictionary-like object, with the following attributes:
``data``:
The feature matrix is a scipy CSR sparse matrix, with 804414 samples and
47236 features. Non-zero values contains cosine-normalized, log TF-IDF vectors.
A nearly chronological split is proposed in [1]_: The first 23149 samples are
the training set. The last 781265 samples are the testing set. This follows
the official LYRL2004 chronological split. The array has 0.16% of non zero
values::
>>> rcv1.data.shape
(804414, 47236)
``target``:
The target values are stored in a scipy CSR sparse matrix, with 804414 samples
and 103 categories. Each sample has a value of 1 in its categories, and 0 in
others. The array has 3.15% of non zero values::
>>> rcv1.target.shape
(804414, 103)
``sample_id``:
Each sample can be identified by its ID, ranging (with gaps) from 2286
to 810596::
>>> rcv1.sample_id[:3]
array([2286, 2287, 2288], dtype=uint32)
``target_names``:
The target values are the topics of each sample. Each sample belongs to at
least one topic, and to up to 17 topics. There are 103 topics, each
represented by a string. Their corpus frequencies span five orders of
magnitude, from 5 occurrences for 'GMIL', to 381327 for 'CCAT'::
>>> rcv1.target_names[:3].tolist() # doctest: +SKIP
['E11', 'ECAT', 'M11']
The dataset will be downloaded from the `rcv1 homepage`_ if necessary.
The compressed size is about 656 MB.
.. _rcv1 homepage: http://jmlr.csail.mit.edu/papers/volume5/lewis04a/
.. topic:: References
.. [1] Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004).
RCV1: A new benchmark collection for text categorization research.
The Journal of Machine Learning Research, 5, 361-397.
So, if you want to stick to the original training & test subsets, as described above, you should simply do:
X_train = rcv1.data[0:23149,]
X.train.shape
# (23149, 47236)
X_test = rcv1.data[23149:,]
X_test.shape
# (781265, 47236)
and similarly for your y_train and y_test, using rcv1.target.
If you want to use a different training & test partition, use:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
rcv1.data, rcv1.target, test_size=0.33, random_state=42)
adjusting your test_size accordingly.

How to do Text classification using word2vec

I want to perform text classification using word2vec.
I got vectors of words.
ls = []
sentences = lines.split(".")
for i in sentences:
ls.append(i.split())
model = Word2Vec(ls, min_count=1, size = 4)
words = list(model.wv.vocab)
print(words)
vectors = []
for word in words:
vectors.append(model[word].tolist())
data = np.array(vectors)
data
output:
array([[ 0.00933912, 0.07960335, -0.04559333, 0.10600036],
[ 0.10576613, 0.07267512, -0.10718666, -0.00804013],
[ 0.09459028, -0.09901826, -0.07074171, -0.12022413],
[-0.09893986, 0.01500741, -0.04796079, -0.04447284],
[ 0.04403428, -0.07966098, -0.06460238, -0.07369237],
[ 0.09352681, -0.03864434, -0.01743148, 0.11251986],.....])
How can i perform classification (product & non product)?
You already have the array of word vectors using model.wv.syn0. If you print it, you can see an array with each corresponding vector of a word.
You can see an example here using Python3:
import pandas as pd
import os
import gensim
import nltk as nl
from sklearn.linear_model import LogisticRegression
#Reading a csv file with text data
dbFilepandas = pd.read_csv('machine learning\\Python\\dbSubset.csv').apply(lambda x: x.astype(str).str.lower())
train = []
#getting only the first 4 columns of the file
for sentences in dbFilepandas[dbFilepandas.columns[0:4]].values:
train.extend(sentences)
# Create an array of tokens using nltk
tokens = [nl.word_tokenize(sentences) for sentences in train]
Now it's time to use the vector model, in this example we will calculate the LogisticRegression.
# method 1 - using tokens in Word2Vec class itself so you don't need to train again with train method
model = gensim.models.Word2Vec(tokens, size=300, min_count=1, workers=4)
# method 2 - creating an object 'model' of Word2Vec and building vocabulary for training our model
model = gensim.models.Word2vec(size=300, min_count=1, workers=4)
# building vocabulary for training
model.build_vocab(tokens)
print("\n Training the word2vec model...\n")
# reducing the epochs will decrease the computation time
model.train(tokens, total_examples=len(tokens), epochs=4000)
# You can save your model if you want....
# The two datasets must be the same size
max_dataset_size = len(model.wv.syn0)
Y_dataset = []
# get the last number of each file. In this case is the department number
# this will be the 0 or 1, or another kind of classification. ( to use words you need to extract them differently, this way is to numbers)
with open("dbSubset.csv", "r") as f:
for line in f:
lastchar = line.strip()[-1]
if lastchar.isdigit():
result = int(lastchar)
Y_dataset.append(result)
else:
result = 40
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(model.wv.syn0, Y_dataset[:max_dataset_size])
# Prediction of the first 15 samples of all features
predict = clf.predict(model.wv.syn0[:15, :])
# Calculating the score of the predictions
score = clf.score(model.wv.syn0, Y_dataset[:max_dataset_size])
print("\nPrediction word2vec : \n", predict)
print("Score word2vec : \n", score)
You can also calculate the similarity of words belonging to your created model dictionary:
print("\n\nSimilarity value : ",model.wv.similarity('women','men'))
You can find more functions to use here.
Your question is rather broad but I will try to give you a first approach to classify text documents.
First of all, I would decide how I want to represent each document as one vector. So you need a method that takes a list of vectors (of words) and returns one single vector. You want to avoid that the length of the document influences what this vector represents. You could for example choose the mean.
def document_vector(array_of_word_vectors):
return array_of_word_vectors.mean(axis=0)
where array_of_word_vectors is for example data in your code.
Now you can either play a bit around with distances (for example cosine distance would a nice first choice) and see how far certain documents are from each other or - and that's probably the approach that brings faster results - you can use the document vectors to build a training set for a classification algorithm of your choice from scikit learn, for example Logistic Regression.
The document vectors will become your matrix X and your vector y is an array of 1 and 0, depending on the binary category that you want the documents to be classified into.

How to oversample image dataset using Python?

I am working on a multiclass classification problem with an unbalanced dataset of images(different class). I tried imblearn library, but it is not working on the image dataset.
I have a dataset of images belonging to 3 class namely A,B,C. A has 1000 data, B has 300 and C has 100. I want to oversample class B and C, so that I can avoid data imbalance. Please let me know how to oversample the image dataset using python.
Actually, it seems imblearn.over_sampling resampling just 2d dims inputs. So one way to oversampling your image dataset by this library is to use reshaping alongside with it, you can:
reshape your images
oversample them
again reshape the new dataset to
the first dims
consider you have an image dataset of size (5000, 28, 28, 3) and dtype of nd.array, following the above instructions, you can use the solution below:
# X : current_dataset
# y : labels
from imblearn.over_sampling import RandomOverSampler
reshaped_X = X.reshape(X.shape[0],-1)
#oversampling
oversample = RandomOverSampler()
oversampled_X, oversampled_y = oversample.fit_resample(reshaped_X , y)
# reshaping X back to the first dims
new_X = oversampled_X.reshape(-1,28,28,3)
hope that was helpful!

Text Categorization Python with pre-trained data

how can i associate my tfidf matrix with a category ? for example i have the below data set
**ID** **Text** **Category**
1 jake loves me more than john loves me Romance
2 july likes me more than robert loves me Friendship
3 He likes videogames more than baseball Interest
once i calculate tfidf for each and every sentence by taking 'Text' column as my input, how would i be able to train the system to categorize that row of the matrix to be associated with my category above so that i would be able to reuse for my test data ?
using the above train dataset , when i pass a new sentence 'julie is a lovely person', i would like that sentence to be categorized into single or multiple pre-defined categories as above.
I have used this link Keep TFIDF result for predicting new content using Scikit for Python as my starting point to solve this issue but i was not able to understand on how to map tfidf matrix for a sentence to a category
It looks like you already vectorised the text, i.e. already converted the text to numbers so that you can use scinkit-learns classifiers. Now the next step is to train a classifier. You can follow this link. It looks like this:
Vectorization
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train = count_vect.fit_transform(your_text)
Train classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train, y_train)
Predict on new docs:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new = count_vect.transform(docs_new)
predicted = clf.predict(X_new)

Resources