I am facing with a weird issue. I use NaiveBayesClassifier from nltk.classify to categorize text and my problem is that it is displaying an incredible accuracy of 0.9966. I am sure this cannot be real, still I see no errors in my code. My input is huge, 40.000 sentences used for training and 80.000 used for testing.
I am building a set of train features made up of all the negative/positive/neutral labeled training text
trainFeats = negFeats + posFeats + neutralFeats
and a set of test features made up of all the negative/positive/neutral labeled training text
testFeats = negFeats + posFeats + neutralFeats
Afterwards I train the classifier on the trainFeats
classifier = NaiveBayesClassifier.train(trainFeats)
and test it on all the testFeats
print 'accuracy:', nltk.classify.util.accuracy(classifier, testFeats)
Is this a normal result, should I take things for granted? Because it behaves strikingly good. Thanks!
Related
I am a bit confused and hope someone can help me.
I am currently experimenting with supervised learning. And I think I have a basic misunderstanding about the input and output of LSTMs.
When I have a sequence of 10 observations,
And I split it into trains = 1,2,3,4,5,6,7,8
Also, test = 9,10.
And I transform it into a supervised problem like:
Xtrain= [(1,2)(2,3)(3,4)(4,5)(5,6)]
Ytrain= [(3,4)(4,5)(5,6)(6,7)(7,8)]
And
Xtest= [(7,8)]
So the model is made to predict the next two observations from the previous two.
prediction <- predict(Xtest)
Is this illegal for a train/test split ? Am I correct that I can than evaluate the prediction output from xtest against the actual test set containing [(9,10)]
Or should I stop training at xtrain =[(4,5)] and ytrain = [(6,7)] to get some space between training and testing, since the last observations from y training in my example are used for the prediction
?
I have been trying to follow Francois example of a binary image classifier of cats and dogs. I have attempted to follow his example in another similar set in kaggle (https://www.kaggle.com/playlist/men-women-classification) and I want to achieve the following
Visualise the predictions that are wrong
Come out with the classification report
I already have a model with around 85% accuracy on the validation set but I want to know roughly what kind of images my model is getting wrong as well as coming up with a classification report with sklearn.metric's classification report.
However I do not know how does the image generator works and have a big problem trying to know how to pair the predictions with the labels of the test images.
from sklearn.metrics import classification_report
new_test_datagen = test_datagen.flow_from_directory(
directory = test_dir,
target_size=(150,150),
batch_size=1,
class_mode='binary',
seed = 42,
)
train_image = new_train_generator.next()
plt.imshow(train_image[0].reshape(150,150,-1))
print(train_image[1])
#I want to output images but I am not sure if this is the most efficient way of doing it
predictions = model.predict(test_generator)
predictions.shape
#The predictions is a numpy array of length 476 but I do not know what are the 'correct' labels found in my test set to validate it against this output.
model.evaluate(test_generator)
# [0.3109202980995178, 0.8886554837226868]
I have tried different approaches like multinomialNB, SVM, MLPClassifier, CNN as well as LSTM network to train the dataset that consists of tweets and labels (big 5 classes - openness, conscientiousness, extraversion, agreeable, neuroticism). But the accuracy is at around 60% even after using word2vec, NRC features & MRC features. Is there something that I can do to improve the accuracy?
Would you please add few more details about the dataset you are using?
For example I would add:
Dataset size (number of samples)
Classes distribution (are they balanced or not)
Do you do any preprocessing?
Without the above information I would just guess but if I were you would try:
clean the tweets from noise e.g usernames,garbage symbols etc.
If the dataset is small
try random search on models (Naive Bayes ,SVM, Logistic regression) using various vectorizations strategies e.g bag of words, tf-idf and do hyper-parameter search
try applying transfer learning from a model trained on tweets, for example for sentiment analysis.
If the dataset is large enough
try neural network approach
Embedding(Glove, word2vec, fasttext) + RNN(LSTM, GRU) + Attention
try training own embedding
use pretrained ones such as those
Embedding + CNN + RNN
Bag of words + FNN
If classes are not balanced
use weighted loss
try to balance them
try stacking multiple models (ensemble)
Hope it helps!
Is the main premise of your project to do personality detection? If not, I would recommend using the Google Sentiment API to calculate sentiment of Twitter data.
I'm building a model for text classification with nltk, and sklearn, and training it on the 20newsgroups dataset from sklearn (each document is approximately 130 words).
My preprocessing includes removing stopwords and lemmatizing tokens.
Next, in my pipeline I pass it to the tfidfVectorizer() and want to manipulate some of the input parameters of the vectorizer to improve accuracy. I've read that n-grams (generally, with n less than improves accuracy, but when I classify the vectorizer outputs with the multinomialNB() classifier, using ngram_range=(1,2) and ngram_range=(1,3) in tfidf, it worsens the accuracy. Can someone help explain why?
EDIT:
Here's a sample datum as requested, with the code I used to fetch it and strip the header:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all', remove="headers")
#example of data text (no header)
print(news.data[0])
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final regular season game. PENS RULE!!!
This is my pipeline, the code run to train the model, and print accuracy:
test1_pipeline=Pipeline([('clean', clean()),
('vectorizer', TfidfVectorizer(ngram_range=(1,2))),
('classifier', MultinomialNB())])
train(test1_pipeline, news_group_train.data, news_group_train.target)
I'm running the example code from the spark docs for logistic regression using pyspark and the attendant training summary code:
from pyspark.ml.classification import LogisticRegression
# Load training data
training = spark.read.format("libsvm").load("/user/tim/sample_svm/sample_libsvm_data.txt")
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
# Fit the model
lrModel = lr.fit(training)
# Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))
# We can also use the multinomial family for binary classification
mlr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial")
# Fit the model
mlrModel = mlr.fit(training)
# Print the coefficients and intercepts for logistic regression with multinomial family
print("Multinomial coefficients: " + str(mlrModel.coefficientMatrix))
print("Multinomial intercepts: " + str(mlrModel.interceptVector))
# Extract the summary from the returned LogisticRegressionModel instance trained
# in the earlier example
trainingSummary = lrModel.summary
# Obtain the objective per iteration
objectiveHistory = trainingSummary.objectiveHistory
print("objectiveHistory:")
for objective in objectiveHistory:
print(objective)
# Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.
trainingSummary.roc.show(500)
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))
# Set the model threshold to maximize F-Measure
fMeasure = trainingSummary.fMeasureByThreshold
maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F-Measure)').head()
bestThreshold = fMeasure.where(fMeasure['F-Measure'] == maxFMeasure['max(F-Measure)']) \
.select('threshold').head()['threshold']
lr.setThreshold(bestThreshold)
and get:
areaUnderROC: 1.0
which I wouldn't expect. Perhaps it overfit and simply memorized the data, but I've done train and test, even randomized labels, and tweaked all the hyper parameters and they all led to the same thing:AUC=1.0. I tried the sample code for the SVC models, which uses the same dataset, and I get the same thing.
I'd normally post the code, but I literally ran the example code only changing the path to the data file. I've searched and searched and can find no example of anyone having run this example and examined the results. What's odd is that this dataset, sample_libsvm_data.txt, is used throughout the docs yet I can find neither analysis of it nor even an explanation of what the data actually is.
As a result I've switched to using the RDD-based API of MLLIB because I can't make sense of the results of the sample code. I hope someone can tell me how I'm doing something wrong.
EDIT:
As requested, here's the entire datafile.
I've been using pyspark with dataframes for quite some time, and I've
had positive experinces using dataframes over RDD, as they resemble
the pandas dataframes. So for changing that in your approach I don't
find it as a good solution for your problem.
The hypothesis of "simply memorized the data" is not a valid solution
for your problem. Check it yourself, simple change the name of the
objects and variables and see the same output.
So looking at this specific piece of code you put there you say you
got an AUC_ROC=1.0. My first intuition is because your are analysing
the summary statistcs of the training set not the test. You are most
likely correct the model overfit on the training set. However, with a
test set I doubt the value would get the same results. So I went and
evaluated it myself:
result = lrModel.transform(training)
result.prediction
result.show()
List item
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
AUC_ROC = evaluator.evaluate(result,{evaluator.metricName: "areaUnderROC"})
print('AUC ROC:' + str(AUC_ROC))
final results
AUC ROC:1.0
In other words, you are correct the model overfits... But these are the results for the train set, and it's working correctly. And indeed the AUC have 1.0 for the piece of code you provided.
Bottom line is, it is a bad example to be included in the spark documentation. Try it in other datasets with this code
So checking the API documentation, here is another example but this time you have the expected results... sadly 1.0 again... really bad choice of examples from spark I must admit: https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression