I have a file with some sentences (A Persian sentence, a tab, a Persian word (tag), a tab, an English word (tag)). The English words show the class of each sentence. There are 2 classes in this file, "passion" and "salty". I classified the sentences with naive bayes algorithm and now I have to calculate precision and recall. For that I have to make a confusion matrix but I don't know how. I wrote a small code and assumed that "passion" is the positive group and "salty" is the negative group. The code returned the output for this case. But if I assume "salty" as positive and "passion" as negative, the numbers are totally different from the first case, and consequently when I want to calculate precision and recall, I don't have the correct answer. Should I calculate tp, tn, fp and fn separately for the 2 classes (once for passion and once for salty) and then calculate the average and then calculate precision and recall according to this average?
(hint1: argmax is the output of the NB algorithm and it is the tag that the code recognized it for the test sentences.
hint2: I have some other files with more than 2 classes, too)
#t = line.strip().split("\t")
if t[2] == "passion" and argmax == "passion":
tp += 1
elif t[2] == "passion" and argmax != "passion":
fn += 1
elif t[2] == "salty" and argmax != "salty":
fp += 1
elif t[2] == "salty" and argmax == "salty":
tn += 1
print ("tp", tp, "tn", tn, "fp", fp, "fn", fn)
You should use scikit-learn, which already provides confusion matrix and classification reports. A sample:
from sklearn.metrics import confusion_matrix, classification_report
# suppose your predictions are stored in a variable called preds
# and the true values are stored in a variable called y
print(confusion_matrix(y, preds))
print(classification_report(y, preds))
(btw, scikit-learn is intended to be used with python 2.7, but it is probably safe to use these functions, since you already have the model built).
Also, since I see you are in the NLP domain, you could use the facilities that the nltk library provides. I'm not an expert, but I suppose this should be useful.
Related
Gensim Word2Vec Model has a great method which allows you to find the top n most similar words in the models vocabulary given a list of positive words and negative words.
wv.most_similar(positive=['word1', 'word2', 'word3'],
negative=['word4','word5'], topn=10)
What I am looking to do is create word vector that represents an averaged or summed vector of the input positive and negative words. I am hoping to use this new vector to compare to other vectors.
Something like this:
newVector = 'word1' + 'word2' + 'word3' - 'word4' - 'word5'
I know that vectors can be summed, but I am not sure if that is the best option. I am hoping to find out exactly how the above function (most_similar) combines the positive vectors and negative vectors, and if Gensim has a function to do so. Thank you in advance.
Gensim does not expose a separate function to add/subtract the (unit-normed) vectors in the same way that most_similar() does.
Perhaps it should, as that could be generally useful, including in sharing code between other existing methods.
But as an open-source project, you can look at its exact Python code for that operation, and use it as a model for your own calculations.
For the current code defining that function, see:
https://github.com/RaRe-Technologies/gensim/blob/ee3d6fd1e33fe39fc7aa31ebd56bd63b1a2a2ed6/gensim/models/keyedvectors.py#L687
From advice above, I chose to look at Gensim source code and copy their method for averaging the vectors. Here is the code incase it helps anyone else.
Note : this code is copied from gensim, and is just reformatted to return the averaged vector.
from gensim import matutils
import numpy as np
from numpy import ndarray, array, float32 as REAL
KEY_TYPES = (str, int, np.integer)
'''
FUNCTION : meanVector(...)
INPUT :
keyedVectors : word vectors or keyed vectors from gensim model, (model.wv)
positive : list of words or vectors to be applied positively [default = list()]
negative : list of words or vectors to be applied negatively [default = list()]
OUTPUT :
averaged word vector, [type = numpy.ndarray]
DESCRIPTION :
allows for simple averaging of positive and negative words and vectors given a gensim model's word vector library.
'''
def meanVector(keyedVectors, positive=list(), negative=list()):
positive = [
(item, 1.0) if isinstance(item, KEY_TYPES + (ndarray,))
else item for item in positive
]
negative = [
(item, -1.0) if isinstance(item, KEY_TYPES + (ndarray,))
else item for item in negative
]
# compute the weighted average of all keys
all_keys, mean = set(), []
for key, weight in positive + negative:
if isinstance(key, ndarray):
mean.append(weight * key)
else:
mean.append(weight * keyedVectors.get_vector(key, norm=True))
if keyedVectors.has_index_for(key):
all_keys.add(keyedVectors.get_index(key))
if not mean:
raise ValueError("cannot compute similarity with no input")
mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL)
return mean
Note: this has not been thoroughly tested.
I have trained my word2vec model from gensim and I am getting the nearest neighbors for some words in the corpus. Here are the similarity scores:
top neighbors for الاحتلال:
الاحتلال: 1.0000001192092896
الاختلال: 0.9541053175926208
الاهتلال: 0.872565507888794
الاحثلال: 0.8386293649673462
الاكتلال: 0.8209128379821777
It is odd to get a similarity greater than 1. I cannot apply any stemming to my text because the text includes many OCR spelling mistakes (I got the text from ORC-ed documents). How can I fix the issue ?
Note I am using model.similarity(t1, t2)
This is how I trained my Word2Vec Model:
documents = list()
tokenize = lambda x: gensim.utils.simple_preprocess(x)
t1 = time.time()
docs = read_files(TEXT_DIRS, nb_docs=5000)
t2 = time.time()
print('Reading docs took: {:.3f} mins'.format((t2 - t1) / 60))
print('Number of documents: %i' % len(docs))
# Training the model
model = gensim.models.Word2Vec(docs, size=EMBEDDING_SIZE, min_count=5)
if not os.path.exists(MODEL_DIR):
os.makedirs(MODEL_DIR)
model.save(os.path.join(MODEL_DIR, 'word2vec'))
weights = model.wv.vectors
index_words = model.wv.index2word
vocab_size = weights.shape[0]
embedding_dim = weights.shape[1]
print('Shape of weights:', weights.shape)
print('Vocabulary size: %i' % vocab_size)
print('Embedding size: %i' % embedding_dim)
Below is the read_files function I defined:
def read_files(text_directories, nb_docs):
"""
Read in text files
"""
documents = list()
tokenize = lambda x: gensim.utils.simple_preprocess(x)
print('started reading ...')
for path in text_directories:
count = 0
# Read in all files in directory
if os.path.isdir(path):
all_files = os.listdir(path)
for filename in all_files:
if filename.endswith('.txt') and filename[0].isdigit():
count += 1
with open('%s/%s' % (path, filename), encoding='utf-8') as f:
doc = f.read()
doc = clean_text_arabic_style(doc)
doc = clean_doc(doc)
documents.append(tokenize(doc))
if count % 100 == 0:
print('processed {} files so far from {}'.format(count, path))
if count >= nb_docs and count <= nb_docs + 200:
print('REACHED END')
break
if count >= nb_docs and count <= nb_docs:
print('REACHED END')
break
return documents
I tried this thread but it won't help me because I rather have arabic and misspelled text
Update
I tried the following: (getting the similarity between the exact same word)
print(model.similarity('الاحتلال','الاحتلال'))
and it gave me the following result:
1.0000001
Definitionally, the cosine-similarity measure should max at 1.0.
But in practice, floating-point number representations in computers have tiny imprecisions in the deep-decimals. And, especially when a number of calculations happen in a row (as with the calculation of this cosine-distance), those will sometimes lead to slight deviations from what the expected maximum or exactly-right answer "should" be.
(Similarly: sometimes calculations that, mathematically, should result in the exact same answer no matter how they are reordered/regrouped deviate slightly when done in different orders.)
But, as these representational errors are typically "very small", they're usually not of practical concern. (They are especially small in the range of numbers around -1.0 to 1.0, but can become quite large when dealing with giant numbers.)
In your original case, the deviation is just 0.000000119209289. In the word-to-itself case, the deviation is just 0.0000001. That is, about one-ten-millionth off. (Your other sub-1.0 values have similar tiny deviations from perfect calculation, but they aren't noticeable.)
In most cases, you should just ignore it.
If you find it distracting to you or your users in numerical displays/logging, simply choosing to display all such values to a limited number of after-the-decimal-point digits – say 4 or even 5 or 6 – will hide those noisy digits. For example, using a Python 3 format-string:
sim = model.similarity('الاحتلال','الاحتلال')
print(f"{sim:.6}")
(Libraries like numpy that work with large arrays of such floats can even set a global default for display precision – see numpy.set_print_options – though that shouldn't affect the raw Python floats you're examining.)
If for some reason you absolutely need the values to be capped at 1.0, you could add extra code to do that. But, it's usually a better idea to choose your tests & printouts to be robust to, & oblivious with regard to, such tiny deviations from perfect math.
I am doing a project on multiclass semantic segmentation. I have formulated a model that outputs pretty descent segmented images by decreasing the loss value. However, I cannot evaluate the model performance in metrics, such as meanIoU or Dice coefficient.
In case of binary semantic segmentation it was easy just to set the threshold of 0.5, to classify the outputs as an object or background, but it does not work in the case of multiclass semantic segmentation. Could you please tell me how to obtain model performance on the aforementioned metrics? Any help will be highly appreciated!
By the way, I am using PyTorch framework and CamVid dataset.
If anyone is interested in this answer, please also look at this issue. The author of the issue points out that mIoU can be computed in a different way (and that method is more accepted in literature). So, consider that before using the implementation for any formal publication.
Basically, the other method suggested by the issue-poster is to separately accumulate the intersections and unions over the entire dataset and divide them at the final step. The method in the below original answer computes intersection and union for a batch of images, then divides them to get IoU for the current batch, and then takes a mean of the IoUs over the entire dataset.
However, this below given original method is problematic because the final mean IoU would vary with the batch-size. On the other hand, the mIoU would not vary with the batch size for the method mentioned in the issue as the separate accumulation would ensure that batch size is irrelevant (though higher batch size can definitely help speed up the evaluation).
Original answer:
Given below is an implementation of mean IoU (Intersection over Union) in PyTorch.
def mIOU(label, pred, num_classes=19):
pred = F.softmax(pred, dim=1)
pred = torch.argmax(pred, dim=1).squeeze(1)
iou_list = list()
present_iou_list = list()
pred = pred.view(-1)
label = label.view(-1)
# Note: Following for loop goes from 0 to (num_classes-1)
# and ignore_index is num_classes, thus ignore_index is
# not considered in computation of IoU.
for sem_class in range(num_classes):
pred_inds = (pred == sem_class)
target_inds = (label == sem_class)
if target_inds.long().sum().item() == 0:
iou_now = float('nan')
else:
intersection_now = (pred_inds[target_inds]).long().sum().item()
union_now = pred_inds.long().sum().item() + target_inds.long().sum().item() - intersection_now
iou_now = float(intersection_now) / float(union_now)
present_iou_list.append(iou_now)
iou_list.append(iou_now)
return np.mean(present_iou_list)
Prediction of your model will be in one-hot form, so first take softmax (if your model doesn't already) followed by argmax to get the index with the highest probability at each pixel. Then, we calculate IoU for each class (and take the mean over it at the end).
We can reshape both the prediction and the label as 1-D vectors (I read that it makes the computation faster). For each class, we first identify the indices of that class using pred_inds = (pred == sem_class) and target_inds = (label == sem_class). The resulting pred_inds and target_inds will have 1 at pixels labelled as that particular class while 0 for any other class.
Then, there is a possibility that the target does not contain that particular class at all. This will make that class's IoU calculation invalid as it is not present in the target. So, you assign such classes a NaN IoU (so you can identify them later) and not involve them in the calculation of the mean.
If the particular class is present in the target, then pred_inds[target_inds] will give a vector of 1s and 0s where indices with 1 are those where prediction and target are equal and zero otherwise. Taking the sum of all elements of this will give us the intersection.
If we add all the elements of pred_inds and target_inds, we'll get the union + intersection of pixels of that particular class. So, we subtract the already calculated intersection to get the union. Then, we can divide the intersection and union to get the IoU of that particular class and add it to a list of valid IoUs.
At the end, you take the mean of the entire list to get the mIoU. If you want the Dice Coefficient, you can calculate it in a similar fashion.
I did a PCA on my 3D image datasets, and used the first n PCs as my features in a linear SVM. I have SVM weights for each PC. Now, I want to project the PC weights into original image space to find what regions of the image were more discriminative in the classification process. I used the inverse_transform PCA method on the weight vector. However, the resulting image only has positive values, whereas the SVM weights were both positive and negative. This makes me think if my approach is a valid one. Does anybody have any suggestions?
Thanks in advance.
I have a program that does this projection in image space. The thing to realise is that the weights themselves do not define the 'discrimination' weights (as also termed in this paper). You need the sum of the inputs weighted by their kernel coefficients.
Consider this toy example:
Class A has 2 vectors: a1=(1,1) and a2=(2,2)
Class B has 2 vectors: b1=(2,4) and a3=(4,2).
If you draw this, you can construct the decision boundary by hand: it's the line of points (x,y) where x+y == 5. My SVM program finds the solution where w_a1 == 0 (no support vector), w_a2 == -1) and w_b1 == w_b2 == 1/2, and bias == -5.
Now you can construct the projection vector p = a2*w_a2 + b1*w_b1 + b2*w_b2 = -1*(2,2) + 1/2*(2,4) + 1/2*(4,2) = (1,1).
In other words, every point should be projected onto the line y == x, and for a new vector v the inner product <v,p> is below 5 for class A vectors, and above 5 for class B vectors. You can centre the result around 0 by adding the bias.
I'm doing a project in liver tumor classification. Actually I initially used Region Growing method for liver segmentation and from that I segmented tumor using FCM.
I,then, obtained the texture features using Gray Level Co-occurence Matrix. My output for that was
stats =
autoc: [1.857855266614132e+000 1.857955341199538e+000]
contr: [5.103143332457753e-002 5.030548650257343e-002]
corrm: [9.512661919561399e-001 9.519459060378332e-001]
corrp: [9.512661919561385e-001 9.519459060378338e-001]
cprom: [7.885631654779597e+001 7.905268525471267e+001]
Now how should I give this as an input to the SVM program.
function [itr] = multisvm( T,C,tst )
%MULTISVM(2.0) classifies the class of given training vector according to the
% given group and gives us result that which class it belongs.
% We have also to input the testing matrix
%Inputs: T=Training Matrix, C=Group, tst=Testing matrix
%Outputs: itr=Resultant class(Group,USE ROW VECTOR MATRIX) to which tst set belongs
%----------------------------------------------------------------------%
% IMPORTANT: DON'T USE THIS PROGRAM FOR CLASS LESS THAN 3, %
% OTHERWISE USE svmtrain,svmclassify DIRECTLY or %
% add an else condition also for that case in this program. %
% Modify required data to use Kernel Functions and Plot also%
%----------------------------------------------------------------------%
% Date:11-08-2011(DD-MM-YYYY) %
% This function for multiclass Support Vector Machine is written by
% ANAND MISHRA (Machine Vision Lab. CEERI, Pilani, India)
% and this is free to use. email: anand.mishra2k88#gmail.com
% Updated version 2.0 Date:14-10-2011(DD-MM-YYYY)
u=unique(C);
N=length(u);
c4=[];
c3=[];
j=1;
k=1;
if(N>2)
itr=1;
classes=0;
cond=max(C)-min(C);
while((classes~=1)&&(itr<=length(u))&& size(C,2)>1 && cond>0)
%This while loop is the multiclass SVM Trick
c1=(C==u(itr));
newClass=c1;
svmStruct = svmtrain(T,newClass);
classes = svmclassify(svmStruct,tst);
% This is the loop for Reduction of Training Set
for i=1:size(newClass,2)
if newClass(1,i)==0;
c3(k,:)=T(i,:);
k=k+1;
end
end
T=c3;
c3=[];
k=1;
% This is the loop for reduction of group
for i=1:size(newClass,2)
if newClass(1,i)==0;
c4(1,j)=C(1,i);
j=j+1;
end
end
C=c4;
c4=[];
j=1;
cond=max(C)-min(C); % Condition for avoiding group
%to contain similar type of values
%and the reduce them to process
% This condition can select the particular value of iteration
% base on classes
if classes~=1
itr=itr+1;
end
end
end
end
Kindly guide me.
Images:
You have to take all the feature values you get and concatenate them into a feature vector. Then for the SVM the features should be normalized so that the values in each dimension vary between -1 and 1, if I remember correctly. I think libsvm has a function for doing the normalization.
So assuming your feature vector ends up having N dimensions, and you have M training instances, your training set should be an M x N matrix. Then if you have P test instances, your test set should be a P x N matrix.
May I also suggest you a very popular implementation of SVM, called SVMLight http://svmlight.joachims.org/.
You can find examples on the website on how to use it. Mex-matlav wrapper for it is also available.
As pointed out by Dima, you need to concatenate the features?
btw can you tell me which dataset are you using for liver-tumor-classification?
Is it publicly available for download?