Related
I am working with app store reviews to classify them as class "0" or class "1" based on the text in the review and the sentiment the review carries.
In my classification steps I apply the following methods to my dataframe:
def get_sentiment(s):
vs = analyzer.polarity_scores(s)
if vs['compound'] >= 0.5:
return 1
elif vs['compound'] <= -0.5:
return -1
else:
return 0
df['sentiment'] = df['review'].apply(get_sentiment)
For simplicity sake, the data has already been labeled as either class '0' or '1', but I am training the model for the classification of new instances that have not been labeled yet. In short, the data I'm working with has already been labeled. They are in the classification column.
Then in my train test split method do the following:
msg_train, msg_test, label_train, label_test = train_test_split(df.drop('classification', axis=1), df['classification'], test_size=0.3, random_state=42)
So the dataframe for the X parameter has review and sentiment, and for the y parameter I only have the classification that I am training my model on.
Since the normalization is repetitive, I am running a pipeline like so for simplicity:
pipeline1 = Pipeline([
('bow', CountVectorizer(analyzer=clean_review)),
('tfidf', TfidfTransformer()),
('classifier', MultinomialNB())
])
Where the clean_review function is as follows:
def clean_review(sentence):
no_punc = [c for c in sentence if c not in string.punctuation]
no_punc = ''.join(no_punc)
no_stopwords = [w.lower() for w in no_punc.split() if w not in stopwords_set]
stemmed_words = [ps.stem(w) for w in no_stopwords]
return stemmed_words
Where stopwords_set is the collection of english stopwords from the nltk library, and ps is from the PortStemmer module in the nltk library (for word stemming).
I get the following error: ValueError: Found input variables with inconsistent numbers of samples: [2, 505]
When I searched this error before, I saw that the likely issue could've been that there is a mismatch in the number of records for each attribute. I've found this not to be the case. All the records that I am using have values for every column.
Can someone else help me interpret what this error could mean?
My end goal is to have a dataframe that has the CountVectorizer and TfIdfTransformer applied to the text, but also retain the column for the sentiment of each review.
I would then like to be able to train the MultinomialNB classifier on this dataframe and apply this model to other tasks.
I'm not sure on what the error is due to since I don't know what the size of your dataframe should be. I would need more information. On which line is the error thrown?
Regarding the fact that you want to retain the sentiment column, you could apply CountVectorizer and TfIdfTransformer (by the way you could skip a step and directly apply TfidfVectorizer) only on the text data and then have another transformer in the pipeline which adds the original sentiment column before you feed the dataframe to the classifier.
I want to perform text classification using word2vec.
I got vectors of words.
ls = []
sentences = lines.split(".")
for i in sentences:
ls.append(i.split())
model = Word2Vec(ls, min_count=1, size = 4)
words = list(model.wv.vocab)
print(words)
vectors = []
for word in words:
vectors.append(model[word].tolist())
data = np.array(vectors)
data
output:
array([[ 0.00933912, 0.07960335, -0.04559333, 0.10600036],
[ 0.10576613, 0.07267512, -0.10718666, -0.00804013],
[ 0.09459028, -0.09901826, -0.07074171, -0.12022413],
[-0.09893986, 0.01500741, -0.04796079, -0.04447284],
[ 0.04403428, -0.07966098, -0.06460238, -0.07369237],
[ 0.09352681, -0.03864434, -0.01743148, 0.11251986],.....])
How can i perform classification (product & non product)?
You already have the array of word vectors using model.wv.syn0. If you print it, you can see an array with each corresponding vector of a word.
You can see an example here using Python3:
import pandas as pd
import os
import gensim
import nltk as nl
from sklearn.linear_model import LogisticRegression
#Reading a csv file with text data
dbFilepandas = pd.read_csv('machine learning\\Python\\dbSubset.csv').apply(lambda x: x.astype(str).str.lower())
train = []
#getting only the first 4 columns of the file
for sentences in dbFilepandas[dbFilepandas.columns[0:4]].values:
train.extend(sentences)
# Create an array of tokens using nltk
tokens = [nl.word_tokenize(sentences) for sentences in train]
Now it's time to use the vector model, in this example we will calculate the LogisticRegression.
# method 1 - using tokens in Word2Vec class itself so you don't need to train again with train method
model = gensim.models.Word2Vec(tokens, size=300, min_count=1, workers=4)
# method 2 - creating an object 'model' of Word2Vec and building vocabulary for training our model
model = gensim.models.Word2vec(size=300, min_count=1, workers=4)
# building vocabulary for training
model.build_vocab(tokens)
print("\n Training the word2vec model...\n")
# reducing the epochs will decrease the computation time
model.train(tokens, total_examples=len(tokens), epochs=4000)
# You can save your model if you want....
# The two datasets must be the same size
max_dataset_size = len(model.wv.syn0)
Y_dataset = []
# get the last number of each file. In this case is the department number
# this will be the 0 or 1, or another kind of classification. ( to use words you need to extract them differently, this way is to numbers)
with open("dbSubset.csv", "r") as f:
for line in f:
lastchar = line.strip()[-1]
if lastchar.isdigit():
result = int(lastchar)
Y_dataset.append(result)
else:
result = 40
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(model.wv.syn0, Y_dataset[:max_dataset_size])
# Prediction of the first 15 samples of all features
predict = clf.predict(model.wv.syn0[:15, :])
# Calculating the score of the predictions
score = clf.score(model.wv.syn0, Y_dataset[:max_dataset_size])
print("\nPrediction word2vec : \n", predict)
print("Score word2vec : \n", score)
You can also calculate the similarity of words belonging to your created model dictionary:
print("\n\nSimilarity value : ",model.wv.similarity('women','men'))
You can find more functions to use here.
Your question is rather broad but I will try to give you a first approach to classify text documents.
First of all, I would decide how I want to represent each document as one vector. So you need a method that takes a list of vectors (of words) and returns one single vector. You want to avoid that the length of the document influences what this vector represents. You could for example choose the mean.
def document_vector(array_of_word_vectors):
return array_of_word_vectors.mean(axis=0)
where array_of_word_vectors is for example data in your code.
Now you can either play a bit around with distances (for example cosine distance would a nice first choice) and see how far certain documents are from each other or - and that's probably the approach that brings faster results - you can use the document vectors to build a training set for a classification algorithm of your choice from scikit learn, for example Logistic Regression.
The document vectors will become your matrix X and your vector y is an array of 1 and 0, depending on the binary category that you want the documents to be classified into.
I want to implement a most informative features function for binary NB in SciKit Learn. I am using Python3.
First off, I understand that the question of implementing some sort of 'informative features' function for SciKit's multinomial NB has been asked. However, I have tried the responses and have had no luck - so I think either SciKit updated, or I am doing something very wrong. I am using
tobigue's answer here for a function.
from nltk.corpus import stopwords
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
#Array contains a list of (headline, source) tupples where there are two sources.
#I want to classify each headline as belonging to a given source.
array = [('toyota showcases humanoid that mirrors user', 'drudge'), ('virginia again delays vote certification after error in ballot distribution', 'npr'), ("do doctors need to use computers? one physician's case highlights the quandary", 'npr'), ('office sex summons', 'drudge'), ('launch calibrated to avoid military response?', 'drudge'), ('snl skewers alum al franken, trump sons', 'npr'), ('mulvaney shows up for work at consumer watchdog group, as leadership feud deepens', 'npr'), ('indonesia tries to evacuate 100,000 people away from erupting volcano on bali', 'npr'), ('downing street blasts', 'drudge'), ('stocks soar more; records smashed', 'drudge'), ('aid begins to filter back into yemen, as saudi-led blockade eases', 'npr'), ('just look at these fancy port-a-potties', 'npr'), ('nyt turns to twitter activism to thwart', 'drudge'), ('uncertainty reigns in battle for virginia house of delegates', 'npr'), ('u.s. reverses its decision to close palestinian office in d.c.', 'npr'), ("'i don't believe in science,' says flat-earther set to launch himself in own rocket", 'npr'), ("bosnian war chief 'dies' after being filmed 'drinking poison' at the hague", 'drudge'), ('federal judge blocks new texas anti-abortion law', 'npr'), ('gm unveils driverless cars, aiming to lead pack', 'drudge'), ('in japan, a growing scandal over companies faking product-quality data', 'npr')]
#I want to classify each headline as belonging to a given source.
def scikit_naivebayes(data_array):
headlines = [element[0] for element in data_array]
sources = [element[1] for element in data_array]
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()),('clf', MultinomialNB())])
cf1 = text_clf.fit(headlines, sources)
train(cf1,headlines,sources)
#Call most_informative_features function on CountVectorizer and classifier
show_most_informative_features(CountVectorizer, cf1)
def train(classifier, X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=33)
classifier.fit(X_train, y_train)
print ("Accuracy: {}".format(classifier.score(X_test, y_test)))
#tobigue's code:
def show_most_informative_features(vectorizer, clf, n=20):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))
def main():
scikit_naivebayes(array)
main()
#ERROR:
# File "file_path_here", line 34, in program_name
# feature_names = vectorizer.get_feature_names()
# TypeError: get_feature_names() missing 1 required positional argument: 'self'
You need to fit the CountVectorizer before calling vectorizer.get_feature_names(). In your code, you only call the other function with the class CountVectorizer, which won't lead to anything.
You should try independtly from your pipeline to create a vectorizer with CountVectorizer, and then call fit on your text, and eventually use the function already provided, though you should further adapt it by yourself to your problem.
You should understand easily that the function you use needs an instanciated object, and not a class. Tell me if you don't.
Edit
coef_ is an attribute only accessible by an estimator, i.e a classifier (and not all). Pipeline is a sklearn object used to combined different steps in order to feed a classifier. Typically, a bag-of-words pipeline is constitued by a feature extractor and a classifier (here logistic regression):
pipeline = Pipeline([
('vectorizer', CountVectorizer(args)),
('classifier', LogisticRegression()
])
So, in your case, you should either avoid using pipeline (what I recommend you to begin), or use get_params() method from the pipeline to access the classifier.
I suggest you to fit_transform the text, then feed the transformed result to a logistic regression or naive bayes classifier, and then call the function you have :
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(headlines, sources)
naive_bayes = MultinomialNB()
naive_bayes.fit(X, sources)
show_most_informative_features(vectorizer, naive_bayes)
First try that, and if it works you'll understand better how to then use a pipeline. Note that your Pipeline should not work as you combine to feature extractors, the last step should be an estimator. If you want to stack to features extractors, you need to look out for FeatureUnion
how can i associate my tfidf matrix with a category ? for example i have the below data set
**ID** **Text** **Category**
1 jake loves me more than john loves me Romance
2 july likes me more than robert loves me Friendship
3 He likes videogames more than baseball Interest
once i calculate tfidf for each and every sentence by taking 'Text' column as my input, how would i be able to train the system to categorize that row of the matrix to be associated with my category above so that i would be able to reuse for my test data ?
using the above train dataset , when i pass a new sentence 'julie is a lovely person', i would like that sentence to be categorized into single or multiple pre-defined categories as above.
I have used this link Keep TFIDF result for predicting new content using Scikit for Python as my starting point to solve this issue but i was not able to understand on how to map tfidf matrix for a sentence to a category
It looks like you already vectorised the text, i.e. already converted the text to numbers so that you can use scinkit-learns classifiers. Now the next step is to train a classifier. You can follow this link. It looks like this:
Vectorization
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train = count_vect.fit_transform(your_text)
Train classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train, y_train)
Predict on new docs:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new = count_vect.transform(docs_new)
predicted = clf.predict(X_new)
I have created a Gaussian Naive Bayes classifier on a email (spam/not spam) dataset and was able to run it successfully. I vectorized the data, divided in it train and test sets and then calculated the accuracy, all the features that are present in the sklearn-Gaussian Naive Bayes classifier.
Now I want to be able to use this classifier to predict "labels" for new emails - whether they are by spam or not.
For example say I have an email. I want to feed it to my classifier and get the prediction as to whether it is a spam or not. How can I achieve this? Please Help.
Code for classifier file.
#!/usr/bin/python
import sys
from time import time
import logging
# Display progress logs on stdout
logging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(message)s')
sys.path.append("../DatasetProcessing/")
from vectorize_split_dataset import preprocess
### features_train and features_test are the features
for the training and testing datasets, respectively### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
#########################################################
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
t0 = time()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("training time:", round(time() - t0, 3), "s")
print(clf.score(features_test, labels_test))
## Printing Metrics
for Training and Testing
print("No. of Testing Features:" + str(len(features_test)))
print("No. of Testing Features Label:" + str(len(labels_test)))
print("No. of Training Features:" + str(len(features_train)))
print("No. of Training Features Label:" + str(len(labels_train)))
print("No. of Predicted Features:" + str(len(pred)))
## Calculating Classifier Performance
from sklearn.metrics import classification_report
y_true = labels_test
y_pred = pred
labels = ['0', '1']
target_names = ['class 0', 'class 1']
print(classification_report(y_true, y_pred, target_names = target_names, labels = labels))
# How to predict label of a new text
new_text = "You won a lottery at UK lottery commission. Reply to claim it"
Code for Vectorization
#!/usr/bin/python
import os
import pickle
import numpy
numpy.random.seed(42)
path = os.path.dirname(os.path.abspath(__file__))
### The words(features) and label_data(labels), already largely processed.###These files should have been created beforehand
feature_data_file = path + "./createdDataset/dataSet.pkl"
label_data_file = path + "./createdDataset/dataLabel.pkl"
feature_data = pickle.load(open(feature_data_file, "rb"))
label_data = pickle.load(open(label_data_file, "rb"))
### test_size is the percentage of events assigned to the test set(the### remainder go into training)### feature matrices changed to dense representations
for compatibility with### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(feature_data, label_data, test_size = 0.1, random_state = 42)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english')
features_train = vectorizer.fit_transform(features_train)
features_test = vectorizer.transform(features_test)#.toarray()
## feature selection to reduce dimensionality
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile = 5)
selector.fit(features_train, labels_train)
features_train_transformed_reduced = selector.transform(features_train).toarray()
features_test_transformed_reduced = selector.transform(features_test).toarray()
features_train = features_train_transformed_reduced
features_test = features_test_transformed_reduced
def preprocess():
return features_train, features_test, labels_train, labels_test
Code for dataset generation
#!/usr/bin/python
import os
import pickle
import re
import sys
# sys.path.append("../tools/")
""
"
Starter code to process the texts of accuate and inaccurate category to extract
the features and get the documents ready for classification.
The list of all the texts from accurate category are in the accurate_files list
likewise for texts of inaccurate category are in (inaccurate_files)
The data is stored in lists and packed away in pickle files at the end.
"
""
accurate_files = open("./rawDatasetLocation/accurateFiles.txt", "r")
inaccurate_files = open("./rawDatasetLocation/inaccurateFiles.txt", "r")
label_data = []
feature_data = []
### temp_counter is a way to speed up the development--there are### thousands of lines of accurate and inaccurate text, so running over all of them### can take a long time### temp_counter helps you only look at the first 200 lines in the list so you### can iterate your modifications quicker
temp_counter = 0
for name, from_text in [("accurate", accurate_files), ("inaccurate", inaccurate_files)]:
for path in from_text: ###only look at first 200 texts when developing### once everything is working, remove this line to run over full dataset
temp_counter = 1
if temp_counter < 200:
path = os.path.join('..', path[: -1])
print(path)
text = open(path, "r")
line = text.readline()
while line: ###use a
function parseOutText to extract the text from the opened text# stem_text = parseOutText(text)
stem_text = text.readline().strip()
print(stem_text)### use str.replace() to remove any instances of the words# stem_text = stem_text.replace("germani", "")### append the text to feature_data
feature_data.append(stem_text)### append a 0 to label_data
if text is from Sara, and 1
if text is from Chris
if (name == "accurate"):
label_data.append("0")
elif(name == "inaccurate"):
label_data.append("1")
line = text.readline()
text.close()
print("texts processed")
accurate_files.close()
inaccurate_files.close()
pickle.dump(feature_data, open("./createdDataset/dataSet.pkl", "wb"))
pickle.dump(label_data, open("./createdDataset/dataLabel.pkl", "wb"))
Also I want to know whether i can incrementally train the classifier meaning thereby that retrain a created model with newer data for refining the model over time?
I would be really glad if someone can help me out with this. I am really stuck at this point.
You are already using your model to predict labels of emails in your test set. This is what pred = clf.predict(features_test) does. If you want to see these labels, do print pred.
But perhaps you what to know how you can predict labels for emails that you discover in the future and that are not currently in your test set? If so, you can think of your new email(s) as a new test set. As with your previous test set, you will need to run several key processing steps on the data:
1) The first thing you need to do is to generate features for your new email data. The feature generation step is not included in your code above, but will need to occur.
2) You are using a Tfidf vectorizer, which converts a collection of documents to a matrix of Tfidf features based upon term frequency and inverse document frequency. You need to put your new email test feature data through the vectorizer that you fit on your training data.
3) Then your new email test feature data will need to go through dimensionality reduction using the same selector that you fit on your training data.
4) Finally, run predict on your new test data. Use print pred if you want to view the new label(s).
To respond to your final question about iteratively re-training your model, yes you definitely can do this. It's just a matter of selecting a frequency, producing a script that expands your data set with incoming data, then re-running all steps from there, from pre-processing to Tfidf vectorization, to dimensionality reduction, to fitting, and prediction.