The below method is implementation of predict function in EMail Spam Detection project.
Predict the class for a given row(mail)
def predict(summaries, inputVector):
probabilities = calculateClassProbabilities(summaries, inputVector)
bestLabel, bestProb = None, -1
for classValue, probability in probabilities.items():
#print(classValue,'->',probability)
if bestLabel is None or probability > bestProb:
bestProb = probability
bestLabel = classValue
return bestLabel
I am unable to understand how a particular data item is being classified as spam or ham by using above function
The calculateClassProbabilities function is doing all the real work (probably has a spam-or-not score for every word in the dictionary, and sums the scores for the emails vocabulary). That function returns a list of possible categories ("spam", "legit", "mass email you did sign up for but dpn't actually want to read") and associated probabilities. The loop here just finds the category with the highest probability and returns it.
Related
I'm using text classification for news types like sports,politics,business,entertainment using logistic regression and the text which doesn't belongs to the above categories are also predicted as one of the category. How to prevent this in machine learning? also how to add out of category text to other_category label?
The predict method will give you the prediction which has the highest probability. You can use predict_proba method which will give you the probability score for each category. So you can use the
max() function to get the maximum probability, then you can simply use the if statement to check if the probability is greater than the required value print the prediction otherwise else print others.
If you don't get look at the sample code.
model.fit(text, tags)
textToBeClassified = ["Google's shares are going down"] # it is to be in a [] list that's how the predict method expects the input, you can classify more text by adding here separated by a comma.
prediction = model.predict(textToBeClassified) # it will return the tag as politics or sports or business etc.
predictionConfidence = model.predict_proba(textToBeClassified) # it will return a tuple of confidence score (probability) for each inputs.
maxConfidence = max(predictionConfidence[0]) # I'm having only one text to be classified which is the first text so I'm finding the maximum value of the first text.
if maxConfidence > 0.8: # I want the output only if it is 80% confident about the classification, you can change 0.8 to 0.9 if you want 90% accurate.
print(prediction)
else:
print("Sorry the text doesn't fall under any of the categories")
try adding print statements here and there so you know what is happening
model.fit(text, tags)
textToBeClassified = ["Google's shares are going down"]
prediction = model.predict(textToBeClassified)
print("Predicted as:", prediction)
predictionConfidence = model.predict_proba(textToBeClassified)
print("The Confidance scores:", predictionCondidence)
maxConfidence = max(predictionConfidence[0])
print("maximum confidence score is:", maxConfidence)
if maxConfidence > 0.8:
print(prediction)
else:
print("Sorry the text doesn't fall under any of the categories")
Like this :)
Currently, I am trying to predict the top five/10 subjects to a statistics exercise based on the exercise's description. The subjects and exercises (with ground truth label, as integer) are provided in CSV format. The ground truth label is also present in the subjects' CSV, and is there called "id".
My current model produces a tuple for ever exercise, of which the first element is the ground truth label, the second element is a list of the predicted labels.
Then my question: how to compute (Accuracy,) Precision, Recall, and F1 (if possible also MRR and MAR)?
Also, all exercises and subject are converted to vectors. Furthermore, I calculate accuracy by counting all instances for which the ground truth is present in the top 5/10, and dividing this the total number of exercises.
*note: in the code exercise = question, and subject = kc
My variables are as follows:
question_data = df[['all_text_clean', 'all_text_as_vector', 'groud_truth_id'] ].values
kc_data = subject_df[['id', 'all_text_as_vector']].values
Then, I loop over every exercise-question pair:
question_candidates = []
for qtext, qvec, gt_id in question_data:
scores = []
for kc_id, kc_vec in kc_data:
score = distance.cosine(qvec, kc_vec) # calculate cosine similarities
scores.append((kc_id, score)) # kc_id and related store cos-sim
scores = sorted(scores, key=itemgetter(1)) # sort cos-sims and related ids
candites = [id for id, score in scores][:5] # only id is relevant. These are the suggestions
question_candidates.append((gt_id, candites))
Accuracy is moderate: around 0,59. I don't expect anything higher since this is just a baseline model.
I am reading up about TF-IDF so that I can filter out common words from my corpus. It appears to me that you get a TF-IDF score for each word, document pair.
Which score do you pay attention to? Do you combine the scores across all documents for a word?
TFIDF ex:
doc1 = "This is doc1"
doc2 = "This is a different document"
corpus = [doc1, doc2]
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(corpus)
X.toarray()
return: array([[0. , 0.70490949, 0. , 0.50154891, 0.50154891],
[0.57615236, 0. , 0.57615236, 0.40993715, 0.40993715]])
vec.get_feature_names()
So you have a line/1d array for each doc in the corpus, and that array has len = total vocab in your corpus (can get quite sparse). What score you pay attention to depends on what you're doing, ie finding most important word in a doc you could look for highest TF-idf in that doc. Most important in a corpus, look in the entire array. If you're trying to identify stop words, you could consider finding the set of X number of words with the minimum TF-IDF scores. However, I wouldn't really recommend using TF-IDF to find stop words in the first place, it lowers the weight of stop words, but they still occur frequently which could offset the weight loss. You'd probably be better off finding the most common words and then filtering them out. You'd want to look at either set you generated manually though.
From the below script, I find the highest probability and its corresponding category in a multi class text classification problem. How do I find the highest top 3 predicted probability and its corresponding category in a best efficient way without using loops.
probabilities = classifier.predict_proba(X_test)
max_probabilities = probabilities.max(axis=1)
order=np.argsort(probabilities, axis=1)
classification=(classifier.classes_[order[:, -1:]])
print(accuracy_score(classification,y_test))
Thanks in advance.
( I have around 50 categories, I want to extract the top 3 best relevant category among 50 categories for each of my narrations and display them in a dataframe)
You've done most of the hard work here, just missing a bit of numpy foo to finish it off. Your line
order = np.argsort(probabilities, axis=1)
Contains the indices of the sorted probabilities, so [[lowest_prob_class_1, ..., highest_prob_class_1]...] for each of your samples. Which you have used to give your classification with order[:, -1:], i.e. the index of the highest probability class. So to get the top three classes we can just make a simple change
top_3_classes = classifier.classes_[order[:, -3:]]
Then to get the corresponding probabilities we can use
top_3_probabilities = probabilities[np.repeat(np.arange(order.shape[0]), 3),
order[:, -3:].flatten()].reshape(order.shape[0], 3)
I am using the RandomForestClassifier from pyspark.ml.classification
I run the model on a binary class dataset and display the probabilities.
I have the following in the col probabilities :
+-----+----------+---------------------------------------+
|label|prediction|probability |
+-----+----------+---------------------------------------+
|0.0 |0.0 |[0.9005918461098429,0.0994081538901571]|
|1.0 |1.0 |[0.6051335859900139,0.3948664140099861]|
+-----+----------+---------------------------------------+
I have a list of 2 elements which obviously correspond to the probabilities of the predicted class.
My question : probability[0 corresponds always to the value of prediction whereas in the spark documentation it is not clear!
I am interpreting your question as asking: does the first element in the array under the column 'predictions' always correspond to the "predicted class", by which you mean the label the Random Forest Classifier predicted the observation should have.
If I have that correct, the answer is Yes.
The items in the arrays in both probability rows can be read as the model telling you:
['My confidence that the predicted label = the true label',
'My confidence that the label != the true label']
In the case of multiple labels being predicted, then you would have the model telling you:
['My confidence that the label I predict = specific label 1',
'My confidence that the label I predict = specific label 2',
...'My confidence that the label I predict = specific label N']
This is indexed by the N labels you are trying to predict (which means you have to be careful about the way the labels are structured).
Perhaps it would help to take a look at this answer. You could do something like:
model = pipeline.fit(trainig_data)
predictions = model.transform(test_data)
print predictions.show(10)
(Using the relevant pipeline and data from your examples.)
This will show you the probabilities for each class.
I post almost the same question here and I think the answer might help you:
Scala: how to know which probability correspond to which class?
The answer is before the fit of the model.
To fit the model we use a labelIndexer on the target. This label indexer transform the target into an indexe by descending frequency.
ex: if, in my target I have 20% of "aa" and 80% of "bb" label indexer will create a column "label" that took the value 0 for "bb" and 1 for "aa" (because I "bb" is ore frequent than "aa")
When we fit a random forest, the probabilities correspond to the order of frequency.
In binary classification:
first proba = probability that the class is the most frequent class in the train set
second proba = probability that the class is the less frequent class in the train set