Evaluation with ground truth label and list of predicted labels - python-3.x

Currently, I am trying to predict the top five/10 subjects to a statistics exercise based on the exercise's description. The subjects and exercises (with ground truth label, as integer) are provided in CSV format. The ground truth label is also present in the subjects' CSV, and is there called "id".
My current model produces a tuple for ever exercise, of which the first element is the ground truth label, the second element is a list of the predicted labels.
Then my question: how to compute (Accuracy,) Precision, Recall, and F1 (if possible also MRR and MAR)?
Also, all exercises and subject are converted to vectors. Furthermore, I calculate accuracy by counting all instances for which the ground truth is present in the top 5/10, and dividing this the total number of exercises.
*note: in the code exercise = question, and subject = kc
My variables are as follows:
question_data = df[['all_text_clean', 'all_text_as_vector', 'groud_truth_id'] ].values
kc_data = subject_df[['id', 'all_text_as_vector']].values
Then, I loop over every exercise-question pair:
question_candidates = []
for qtext, qvec, gt_id in question_data:
scores = []
for kc_id, kc_vec in kc_data:
score = distance.cosine(qvec, kc_vec) # calculate cosine similarities
scores.append((kc_id, score)) # kc_id and related store cos-sim
scores = sorted(scores, key=itemgetter(1)) # sort cos-sims and related ids
candites = [id for id, score in scores][:5] # only id is relevant. These are the suggestions
question_candidates.append((gt_id, candites))
Accuracy is moderate: around 0,59. I don't expect anything higher since this is just a baseline model.

Related

In ML text classification what if text doesn't belongs to any category?

I'm using text classification for news types like sports,politics,business,entertainment using logistic regression and the text which doesn't belongs to the above categories are also predicted as one of the category. How to prevent this in machine learning? also how to add out of category text to other_category label?
The predict method will give you the prediction which has the highest probability. You can use predict_proba method which will give you the probability score for each category. So you can use the
max() function to get the maximum probability, then you can simply use the if statement to check if the probability is greater than the required value print the prediction otherwise else print others.
If you don't get look at the sample code.
model.fit(text, tags)
textToBeClassified = ["Google's shares are going down"] # it is to be in a [] list that's how the predict method expects the input, you can classify more text by adding here separated by a comma.
prediction = model.predict(textToBeClassified) # it will return the tag as politics or sports or business etc.
predictionConfidence = model.predict_proba(textToBeClassified) # it will return a tuple of confidence score (probability) for each inputs.
maxConfidence = max(predictionConfidence[0]) # I'm having only one text to be classified which is the first text so I'm finding the maximum value of the first text.
if maxConfidence > 0.8: # I want the output only if it is 80% confident about the classification, you can change 0.8 to 0.9 if you want 90% accurate.
print(prediction)
else:
print("Sorry the text doesn't fall under any of the categories")
try adding print statements here and there so you know what is happening
model.fit(text, tags)
textToBeClassified = ["Google's shares are going down"]
prediction = model.predict(textToBeClassified)
print("Predicted as:", prediction)
predictionConfidence = model.predict_proba(textToBeClassified)
print("The Confidance scores:", predictionCondidence)
maxConfidence = max(predictionConfidence[0])
print("maximum confidence score is:", maxConfidence)
if maxConfidence > 0.8:
print(prediction)
else:
print("Sorry the text doesn't fall under any of the categories")
Like this :)

nltk latent semantic analysis copies the first topics over and over

This is my first attempt with Natural Language Processing so I started with Latent Semantic Analysis and used this tutorial to build the algorithm. After testing it I see that it only classifies the first semantic words and repeats the same terms over and over on top of the other documents.
I tried feeding it the documents found in HERE too and it does exactly the same. Repeating the values of the same topic several times in the other ones.
Could anyone help explain what is happening? I've been searching all over and everything seems exactly like in the tutorials.
testDocs = [
"The Neatest Little Guide to Stock Market Investing",
"Investing For Dummies, 4th Edition",
"The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns",
"The Little Book of Value Investing",
"Value Investing: From Graham to Buffett and Beyond",
"Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!",
"Investing in Real Estate, 5th Edition",
"Stock Investing For Dummies",
"Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss",
]
stopwords = ['and','edition','for','in','little','of','the','to']
ignorechars = ''',:'!'''
#First we apply the standard SKLearn algorithm to compare with.
for element in testDocs:
#tokens.append(tokenizer.tokenize(element.lower()))
element = element.lower()
print(testDocs)
#Vectorize the features.
vectorizer = tfdv(max_df=0.5, min_df=2, max_features=8, stop_words='english', use_idf=True)#, ngram_range=(1,3))
#Store the values in matrix X.
X = vectorizer.fit_transform(testDocs)
#Apply LSA.
lsa = TruncatedSVD(n_components=3, n_iter=100)
lsa.fit(X)
#Get a list of the terms in the order it was decomposed.
terms = vectorizer.get_feature_names()
print("Terms decomposed from the document: " + str(terms))
print()
#Prints the matrix of concepts. Each number represents how important the term is to the concept and the position relates to the position of the term.
print("Number of components in element 0 of matrix of components:")
print(lsa.components_[0])
print("Shape: " + str(lsa.components_.shape))
print()
for i, comp in enumerate(lsa.components_):
#Stick each of the terms to the respective components. Zip command creates a tuple from 2 components.
termsInComp = zip(terms, comp)
#Sort the terms according to...
sortedTerms = sorted(termsInComp, key=lambda x: x[1], reverse=True)
print("Concept %d", i)
for term in sortedTerms:
print(term[0], end="\t")
print()

Some diverging issues of Word2Vec in Gensim using high alpha values

I am implementing word2vec in gensim, on a corpus with nested lists (collection of tokenized words in sentences of sentences form) with 408226 sentences (lists) and a total of 3150546 words or tokens.
I am getting a meaningful results (in terms of the similarity between two words using model.wv.similarity) with the chosen values of 200 as size, window as 15, min_count as 5, iter as 10 and alpha as 0.5. All are lemmatized words and these all are input to models with vocabulary as 32716.
The results incurred from default alpha value, size, window and dimensions are meaningless for me based on the used data in computing the similarity values. However higher value of alpha as 0.5 gives me some meaningful results in terms of inducing meaningful similarity scores between two words. However, when I calculate the top n similar words, it's again meaningless. Does I need to change the entire parameters used in the initial training process.
I am still unable to reveal the exact reason, why the model behaves good with such a higher alpha value in computing the similarity between two words of the used corpus, whereas it's meaningless while computing the top n similar words with scores for an input word. Why is this the case?
Does it is diverging towards optimal solution. How to check this?
Any idea why is it the case is deeply appreciated.
Note: I'm using Python 3.7 on Windows machine with anaconda prompt and giving input to the model from a file.
This is what I have tried.
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.models import Word2Vec
import ast
path = "F:/Folder/"
def load_data():
global Sentences
Sentences = []
for file in ['data_d1.txt','data_d2.txt']:
with open(path + file, 'r', encoding = 'utf-8') as f1:
Sentences.extend(ast.literal_eval(*f1.readlines()))
load_data()
def initialize_word_embedding():
model = Word2Vec(Sentences, size = 200, window = 15, min_count = 5, iter = 10, workers = 4)
print(model)
print(len(model.wv.vocab))
print(model.wv.similarity(w1 = 'structure', w2 = '_structure_'))
similarities = model.wv.most_similar('system')
for word, score in similarities:
print(word , score)
initialize_word_embedding()
The example of Sentences list is as follows:
[['scientist', 'time', 'comet', 'activity', 'sublimation', 'carbon', 'dioxide', 'nears', 'ice', 'system'], ['inconsistent', 'age', 'system', 'year', 'size', 'collision'], ['intelligence', 'system'], ['example', 'application', 'filter', 'image', 'motion', 'channel', 'estimation', 'equalization', 'example', 'application', 'filter', 'system']]
The data_d1.txt and data_d2.txt is a nested list (list of lists of lemmatized tokenized words). I have preprocessed the raw data and save it in a file. Now giving the same as input. For computing the lemmatizing tokens, I have used the popular WordNet lemmatizer.
I need the word-embedding model to calculate the similarity between two words and computing the most_similar words of a given input word. I am getting some meaningful scores for the model.wv.similarity() method, whereas in calculating the most_similar() words of a word (say, system as shown in above). I am not getting the desired results.
I am guessing the model is getting diverged from the global minima, with the use of high alpha values.
I am confused what should be the dimension size, window for inducing some meaningful results, as there is no such rules regarding how to compute the the size and window.
Any suggestion is appreciated. The size of total sentences and words are specified above in the question.
Results what I am getting without setting alpha = 0.5
Edit to Recent Comment:
Results:
Word2Vec(vocab=32716, size=200, alpha=0.025)
The similarity between set and _set_ is : 0.000269373188960656
which is meaningless for me as it is very very less in terms of accuracy, But, I am a getting 71% by setting alpha as 0.5, which seems to be meaningful for me as the word set is same for both the domains.
Explanation: The word set should be same for both the domains (as I am comparing the data of two domains with same word). Don't get confused with word _set_, this is because the word is same as set, I have injected a character _ at start and end to distinguish the same for two different domains.
The top 10 words along with scores of _set_ are:
_niche_ 0.6891741752624512
_intermediate_ 0.6883598566055298
_interpretation_ 0.6813371181488037
_printer_ 0.675414502620697
_finer_ 0.6625382900238037
_pertinent_ 0.6620787382125854
_respective_ 0.6619025468826294
_converse_ 0.6610435247421265
_developed_ 0.659270167350769
_tent_ 0.6588765382766724
Whereas, the top 10 words for set are:
cardinality 0.633270263671875
typereduction 0.6233855485916138
zdzisław 0.619156002998352
crisp 0.6165326833724976
equivalenceclass 0.605925977230072
pawlak 0.6058803200721741
straight 0.6045454740524292
culik 0.6040038466453552
rin 0.6038737297058105
multisets 0.6035065650939941
Why the cosine similarity value is 0.00 for the word set for two different data.

Finding the top three relevant category and its corresponding probabilities

From the below script, I find the highest probability and its corresponding category in a multi class text classification problem. How do I find the highest top 3 predicted probability and its corresponding category in a best efficient way without using loops.
probabilities = classifier.predict_proba(X_test)
max_probabilities = probabilities.max(axis=1)
order=np.argsort(probabilities, axis=1)
classification=(classifier.classes_[order[:, -1:]])
print(accuracy_score(classification,y_test))
Thanks in advance.
( I have around 50 categories, I want to extract the top 3 best relevant category among 50 categories for each of my narrations and display them in a dataframe)
You've done most of the hard work here, just missing a bit of numpy foo to finish it off. Your line
order = np.argsort(probabilities, axis=1)
Contains the indices of the sorted probabilities, so [[lowest_prob_class_1, ..., highest_prob_class_1]...] for each of your samples. Which you have used to give your classification with order[:, -1:], i.e. the index of the highest probability class. So to get the top three classes we can just make a simple change
top_3_classes = classifier.classes_[order[:, -3:]]
Then to get the corresponding probabilities we can use
top_3_probabilities = probabilities[np.repeat(np.arange(order.shape[0]), 3),
order[:, -3:].flatten()].reshape(order.shape[0], 3)

Random Forest Classifier :To which class corresponds the probabilities

I am using the RandomForestClassifier from pyspark.ml.classification
I run the model on a binary class dataset and display the probabilities.
I have the following in the col probabilities :
+-----+----------+---------------------------------------+
|label|prediction|probability |
+-----+----------+---------------------------------------+
|0.0 |0.0 |[0.9005918461098429,0.0994081538901571]|
|1.0 |1.0 |[0.6051335859900139,0.3948664140099861]|
+-----+----------+---------------------------------------+
I have a list of 2 elements which obviously correspond to the probabilities of the predicted class.
My question : probability[0 corresponds always to the value of prediction whereas in the spark documentation it is not clear!
I am interpreting your question as asking: does the first element in the array under the column 'predictions' always correspond to the "predicted class", by which you mean the label the Random Forest Classifier predicted the observation should have.
If I have that correct, the answer is Yes.
The items in the arrays in both probability rows can be read as the model telling you:
['My confidence that the predicted label = the true label',
'My confidence that the label != the true label']
In the case of multiple labels being predicted, then you would have the model telling you:
['My confidence that the label I predict = specific label 1',
'My confidence that the label I predict = specific label 2',
...'My confidence that the label I predict = specific label N']
This is indexed by the N labels you are trying to predict (which means you have to be careful about the way the labels are structured).
Perhaps it would help to take a look at this answer. You could do something like:
model = pipeline.fit(trainig_data)
predictions = model.transform(test_data)
print predictions.show(10)
(Using the relevant pipeline and data from your examples.)
This will show you the probabilities for each class.
I post almost the same question here and I think the answer might help you:
Scala: how to know which probability correspond to which class?
The answer is before the fit of the model.
To fit the model we use a labelIndexer on the target. This label indexer transform the target into an indexe by descending frequency.
ex: if, in my target I have 20% of "aa" and 80% of "bb" label indexer will create a column "label" that took the value 0 for "bb" and 1 for "aa" (because I "bb" is ore frequent than "aa")
When we fit a random forest, the probabilities correspond to the order of frequency.
In binary classification:
first proba = probability that the class is the most frequent class in the train set
second proba = probability that the class is the less frequent class in the train set

Resources