I wonder if TfidfVectorizer keeps the order of the features when transforming documents using scikit-learn. Here is what I am doing:
from sklearn.feature_exteraction.text import TfidfVectorizer
corpus = ['this movie is cool', 'I love this book']
vec = TfidfVectorizer()
X = vec.fit_tranform(corpus)
joblib.dump(vec, './vec')
doc = 'What are the coolest movies in 2015'
vec = joblib.load('./vec')
X_test = vec.transform([doc])
Now, my question is that are the feature entries in X and X_test in the same order?
Yes. As when you call fit(), it creates a vocabulary dictionary from text strings to column indexes. It uses that to transform additional data sets. This is preserved in any serialization and deserialization.
vec.vocabulary_
> {u'book': 0, u'cool': 1, u'is': 2, u'love': 3, u'movie': 4, u'this': 5}
Related
I am using K-means clustering with TF-IDF using sckit-learn library. I understand that K-means uses distance to create clusters and the distance is represented in (x axis value, y axis value) but the tf-idf is a single numerical value. My question is how is this tf-idf value converted into (x,y) value by K-means clustering.
TF-IDF isn't a single value (i.e. scalar). For every document, it returns a vector where each value in the vector corresponds to each word in the vocabulary.
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.csr import csr_matrix
sent1 = "the quick brown fox jumps over the lazy brown dog"
sent2 = "mr brown jumps over the lazy fox"
corpus = [sent1, sent2]
vectorizer = TfidfVectorizer(input=corpus)
X = vectorizer.fit_transform(corpus)
print(X.todense())
[out]:
matrix([[0.50077266, 0.35190925, 0.25038633, 0.25038633, 0.25038633,
0. , 0.25038633, 0.35190925, 0.50077266],
[0.35409974, 0. , 0.35409974, 0.35409974, 0.35409974,
0.49767483, 0.35409974, 0. , 0.35409974]])
It returns a 2-D matrix where the rows represents the sentences and the columns represent the vocabulary.
>>> vectorizer.vocabulary_
{'the': 8,
'quick': 7,
'brown': 0,
'fox': 2,
'jumps': 3,
'over': 6,
'lazy': 4,
'dog': 1,
'mr': 5}
So when K-means tries to find the distance/similarity between two documents, it's performing the similarity between two rows in the matrix. E.g. assuming the similarity is just the dot product between two rows:
import numpy as np
vector1 = X.todense()[0]
vector2 = X.todense()[1]
float(np.dot(vector1, vector2.T))
[out]:
0.7092938737640962
Chris Potts has a nice tutorial on how vector space models like TF-IDF one is created http://web.stanford.edu/class/linguist236/materials/ling236-handout-05-09-vsm.pdf
I am not able to build vocabulary and getting an error:
TypeError: 'int' object is not iterable
Here is my code that is based on medium article:
https://towardsdatascience.com/implementing-multi-class-text-classification-with-doc2vec-df7c3812824d
I tried to provide pandas series, list to build_vocab function.
import pandas as pd
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.model_selection import train_test_split
import multiprocessing
import nltk
from nltk.corpus import stopwords
def tokenize_text(text):
tokens = []
for sent in nltk.sent_tokenize(text):
for word in nltk.word_tokenize(sent):
if len(word) < 2:
continue
tokens.append(word.lower())
return tokens
df = pd.read_csv("https://raw.githubusercontent.com/RaRe-Technologies/movie-plots-by-genre/master/data/tagged_plots_movielens.csv")
tags_index = {
"sci-fi": 1,
"action": 2,
"comedy": 3,
"fantasy": 4,
"animation": 5,
"romance": 6,
}
df["tindex"] = df.tag.replace(tags_index)
df = df[["plot", "tindex"]]
mylist = list()
for i, q in df.iterrows():
mylist.append(
TaggedDocument(tokenize_text(str(q["plot"])), tags=q["tindex"])
)
df["tdoc"] = mylist
X = df[["tdoc"]]
y = df["tindex"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
cores = multiprocessing.cpu_count()
model_doc2vec = Doc2Vec(
dm=1,
vector_size=300,
negative=5,
hs=0,
min_count=2,
sample=0,
workers=cores,
)
model_doc2vec.build_vocab([x for x in X_train["tdoc"]])
The documentation is very confusing for this method.
Doc2Vec needs an iterable sequence of TaggedDocument-like objects for its corpus (as is fed to build_vocab() or train()).
When showing an error, you should also show the full stack that accompanied it, so that it is clear what line-of-code, and surrounding call-frames, are involved.
But, it's unclear if what you've fed into the dataframe, then out via dataframe-bracket-access, then through the train_test_split(), is actually that.
So I'd suggest assigning things to descriptive interim variables, and verifying that they contain the right sorts of things at each step.
Is X_train["tdoc"][0] a proper TaggedDocument, with a words property that is a list-of-strings, and tags property a list-of-tags? (And, where each tag is probably a string, but could perhaps be a plain-int, counting upward from 0.)
Is mylist[0] a proper TaggedDocument?
Separately: many online examples of Doc2Vec use have egregious errors, and the Medium article you link is no exception. Its practice of calling train() multiple times in a loop is usually unneeded, and very error-prone, and in fact in that article results in severe learning-rate alpha mismanagement. (For example, deducting 0.002 from the starting-default alpha of 0.025 30 times results in a negative effective alpha, which is never justified and means the model is making itself worse with every example. This may be a factor contributing to the awful reported classifier accuracy.)
I would disregard that article entirely and seek better examples elsewhere.
I want to perform text classification using word2vec.
I got vectors of words.
ls = []
sentences = lines.split(".")
for i in sentences:
ls.append(i.split())
model = Word2Vec(ls, min_count=1, size = 4)
words = list(model.wv.vocab)
print(words)
vectors = []
for word in words:
vectors.append(model[word].tolist())
data = np.array(vectors)
data
output:
array([[ 0.00933912, 0.07960335, -0.04559333, 0.10600036],
[ 0.10576613, 0.07267512, -0.10718666, -0.00804013],
[ 0.09459028, -0.09901826, -0.07074171, -0.12022413],
[-0.09893986, 0.01500741, -0.04796079, -0.04447284],
[ 0.04403428, -0.07966098, -0.06460238, -0.07369237],
[ 0.09352681, -0.03864434, -0.01743148, 0.11251986],.....])
How can i perform classification (product & non product)?
You already have the array of word vectors using model.wv.syn0. If you print it, you can see an array with each corresponding vector of a word.
You can see an example here using Python3:
import pandas as pd
import os
import gensim
import nltk as nl
from sklearn.linear_model import LogisticRegression
#Reading a csv file with text data
dbFilepandas = pd.read_csv('machine learning\\Python\\dbSubset.csv').apply(lambda x: x.astype(str).str.lower())
train = []
#getting only the first 4 columns of the file
for sentences in dbFilepandas[dbFilepandas.columns[0:4]].values:
train.extend(sentences)
# Create an array of tokens using nltk
tokens = [nl.word_tokenize(sentences) for sentences in train]
Now it's time to use the vector model, in this example we will calculate the LogisticRegression.
# method 1 - using tokens in Word2Vec class itself so you don't need to train again with train method
model = gensim.models.Word2Vec(tokens, size=300, min_count=1, workers=4)
# method 2 - creating an object 'model' of Word2Vec and building vocabulary for training our model
model = gensim.models.Word2vec(size=300, min_count=1, workers=4)
# building vocabulary for training
model.build_vocab(tokens)
print("\n Training the word2vec model...\n")
# reducing the epochs will decrease the computation time
model.train(tokens, total_examples=len(tokens), epochs=4000)
# You can save your model if you want....
# The two datasets must be the same size
max_dataset_size = len(model.wv.syn0)
Y_dataset = []
# get the last number of each file. In this case is the department number
# this will be the 0 or 1, or another kind of classification. ( to use words you need to extract them differently, this way is to numbers)
with open("dbSubset.csv", "r") as f:
for line in f:
lastchar = line.strip()[-1]
if lastchar.isdigit():
result = int(lastchar)
Y_dataset.append(result)
else:
result = 40
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(model.wv.syn0, Y_dataset[:max_dataset_size])
# Prediction of the first 15 samples of all features
predict = clf.predict(model.wv.syn0[:15, :])
# Calculating the score of the predictions
score = clf.score(model.wv.syn0, Y_dataset[:max_dataset_size])
print("\nPrediction word2vec : \n", predict)
print("Score word2vec : \n", score)
You can also calculate the similarity of words belonging to your created model dictionary:
print("\n\nSimilarity value : ",model.wv.similarity('women','men'))
You can find more functions to use here.
Your question is rather broad but I will try to give you a first approach to classify text documents.
First of all, I would decide how I want to represent each document as one vector. So you need a method that takes a list of vectors (of words) and returns one single vector. You want to avoid that the length of the document influences what this vector represents. You could for example choose the mean.
def document_vector(array_of_word_vectors):
return array_of_word_vectors.mean(axis=0)
where array_of_word_vectors is for example data in your code.
Now you can either play a bit around with distances (for example cosine distance would a nice first choice) and see how far certain documents are from each other or - and that's probably the approach that brings faster results - you can use the document vectors to build a training set for a classification algorithm of your choice from scikit learn, for example Logistic Regression.
The document vectors will become your matrix X and your vector y is an array of 1 and 0, depending on the binary category that you want the documents to be classified into.
I have a very small list of short strings which I want to (1) cluster and (2) use that model to predict which cluster a new string belongs to.
Running the first part works fine, getting a prediction for the new string does not.
First Part
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
# List of
documents_lst = ['a small, narrow river',
'a continuous flow of liquid, air, or gas',
'a continuous flow of data or instructions, typically one having a constant or predictable rate.',
'a group in which schoolchildren of the same age and ability are taught',
'(of liquid, air, gas, etc.) run or flow in a continuous current in a specified direction',
'transmit or receive (data, especially video and audio material) over the Internet as a steady, continuous flow.',
'put (schoolchildren) in groups of the same age and ability to be taught together',
'a natural body of running water flowing on or under the earth']
# 1. Vectorize the text
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents_lst)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)
# 2. Get the number of clusters to make .. (find a better way than random)
num_clusters = 3
# 3. Cluster the defintions
km = KMeans(n_clusters=num_clusters, init='k-means++').fit(tfidf_matrix)
clusters = km.labels_.tolist()
print(clusters)
Which returns:
tfidf_matrix.shape: (8, 39)
[0, 1, 0, 2, 1, 0, 2, 0]
Second Part
The failing part:
predict_doc = ['A stream is a body of water with a current, confined within a bed and banks.']
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(predict_doc)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)
km.predict(tfidf_matrix)
The error:
ValueError: Incorrect number of features. Got 7 features, expected 39
FWIW: I somewhat understand that the training and predict have a different amount of features after vectorizing ...
I am open to any solution including changing from kmeans to an algorithm more suitable for short text clustering.
Thanks in advance
For completeness I will answer my own question with an answer from here , that doesn't answer that question. But answers mine
from sklearn.cluster import KMeans
list1 = ["My name is xyz", "My name is pqr", "I work in abc"]
list2 = ["My name is xyz", "I work in abc"]
vectorizer = TfidfVectorizer(min_df = 0, max_df=0.5, stop_words = "english", charset_error = "ignore", ngram_range = (1,3))
vec = vectorizer.fit(list1) # train vec using list1
vectorized = vec.transform(list1) # transform list1 using vec
km = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=1000, tol=0.0001, precompute_distances=True, verbose=0, random_state=None, n_jobs=1)
km.fit(vectorized)
list2Vec = vec.transform(list2) # transform list2 using vec
km.predict(list2Vec)
The credit goes to #IrshadBhat
I'm totally novice on scikit-learn.
I want to know whether I should use the same Label Encoder instance that had used on training dataset or not when I want to convert the same feature's categorical data on test dataset. And, it means like below
from sklearn import preprocessing
# trainig data label encoding
le_blood_type = preprocessing.LabelEncoder()
df_training[ 'BLOOD_TYPE' ] = le_blood_type.fit_transform( df_training[ 'BLOOD_TYPE' ] ) # labeling from string
....
1. Using same label encoder
df_test[ 'BLOOD_TYPE' ] = le_blood_type.fit_transform( df_test[ 'BLOOD_TYPE' ] )
2. Using different label encoder
le_for_test_blood_type = preprocessing.LabelEncoder()
df_test[ 'BLOOD_TYPE' ] = le_for_test_blood_type.fit_transform( df_test[ 'BLOOD_TYPE' ] )
Which one is right code?
Or, whatever I choose the above's code it does not make any differences
because training dataset's categorical data and test dataset's categorical data should be the same as a result.
The problem is the way you use it in fact.
As LabelEncoder is associating nominal feature to a numeric increment you should fit once and transform once the object has fitted. Don't forget that you need to have all your nominal feature in the training phase.
The good way to use it may be to have you nominal feature, do a fit on it, then only use the transform method.
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2]...)
from official doc
I think RPresle has already gave the answer. Just wanted to put it a little more direct to the situation in the question:
In general, you just need to fit LabelEncoder (with feature in training set) once and transforms the feature in testing set. But if your testing set has feature values that are not in training set, when you fit the label encoder put union of set of training feature and of testing set in it.