How to calcaulate tf-idf value?

How to calcaulate tf-idf value? - nlp

My calculation is :
TF(d) = 3/5
IDF(d) = ln(4/3)
TF * IDF = 0.17, but that is not the answer, 0 is not correct answer.
Question:
The following question will ask you about a corpus with the following documents.
Document 1: a a b c
Document 2: a c c c d e f
Document 3: a c d d d
Document 4: a d f
What is the tf-idf value for "d" in Document 3?
Round answers to two decimal places. Use the natural logarithm (log base e) when taking a logarithm.
select:
0.00
0.57
0.69
0.86
2.07
3.47
6.00

TF = (number of time the term appears in the document) / (total number of terms in the document)
IDF = log( (number of the document in the corpus) / (number of documents in the corpus contain the term) )
The TF-IDF of a term is calculated by multiplying TF and IDF scores.
TF-IDF = TF * IDF
You can calculate by that however, some librariess use natural logarithm in addition, one can be added to the denominator as follows in order to avoid division by zero.
IDF = log( (number of the document in the corpus) / (number of documents in the corpus contain the term + 1) )
Exp, Imagine the term x appears 20 times in a document that contains a total of 100 words. So, TF is :
TF = 20/100 = 0.2
Assume a collection of related documents contains 10,000 documents. If 100 documents out of 10,000 documents contain the term x, So IDF is :
IDF = log(10000/100) = 2
By multiplying these two quantities, we can calculate the TF-IDF score for the x term for the document.
TF-IDF = 0.2 * 2 = 0.4

Related

Algorithm / Function to find weighting of parameters how influens on result

Im trying to find a way to predict high Results based on statistics data, thats why I need to find weights of parameters.
I have a data sample with following structure:
Price A
Price B
Price C
Result
5
4
9
80
2
3
0
30
On that structure I would like to calculate weights for
Price A , Price B, Price C to predict highest values of result column
Basing on my previous structure Price C seems to be most valued weight,
so weight for Price C should be the highest one.
So for data:
weightA * priceA + weightB * priceB + weightC * priceC = should generate Result value something around 80, but for second row there should be ~30 predicted
I've tried something with Pearson algorithm (in excel) but I didn't get 'good' results :(.

Why scikit learn confusion matrix is reversed?

I have 3 questions:
1)
The confusion matrix for sklearn is as follows:
TN | FP
FN | TP
While when I'm looking at online resources, I find it like this:
TP | FP
FN | TN
Which one should I consider?
2)
Since the above confusion matrix for scikit learn is different than the one I find in other rescources, in a multiclass confusion matrix, what's the structure will be? I'm looking at this post here:
Scikit-learn: How to obtain True Positive, True Negative, False Positive and False Negative
In that post, #lucidv01d had posted a graph to understand the categories for multiclass. is that category the same in scikit learn?
3)
How do you calculate the accuracy of a multiclass? for example, I have this confusion matrix:
[[27 6 0 16]
[ 5 18 0 21]
[ 1 3 6 9]
[ 0 0 0 48]]
In that same post I referred to in question 2, he has written this equation:
Overall accuracy
ACC = (TP+TN)/(TP+FP+FN+TN)
but isn't that just for binary? I mean, for what class do I replace TP with?

The reason why sklearn has show their confusion matrix like
TN | FP
FN | TP
like this is because in their code, they have considered 0 to be the negative class and one to be positive class. sklearn always considers the smaller number to be negative and large number to positive. By number, I mean the class value (0 or 1). The order depends on your dataset and class.
The accuracy will be the sum of diagonal elements divided by the sum of all the elements.p The diagonal elements are the number of correct predictions.

As the sklearn guide says: "(Wikipedia and other references may use a different convention for axes)"
What does it mean? When building the confusion matrix, the first step is to decide where to put predictions and real values (true labels). There are two possibilities:
put predictions to the columns, and true labes to rows
put predictions to the rows, and true labes to columns
It is totally subjective to decide which way you want to go. From this picture, explained in here, it is clear that scikit-learn's convention is to put predictions to columns, and true labels to rows.
Thus, according to scikit-learns convention, it means:
the first column contains, negative predictions (TN and FN)
the second column contains, positive predictions (TP and FP)
the first row contains negative labels (TN and FP)
the second row contains positive labels (TP and FN)
the diagonal contains the number of correctly predicted labels.
Based on this information I think you will be able to solve part 1 and part 2 of your questions.
For part 3, you just sum the values in the diagonal and divide by the sum of all elements, which will be
(27 + 18 + 6 + 48) / (27 + 18 + 6 + 48 + 6 + 16 + 5 + 21 + 1 + 3 + 9)
or you can just use score() function.

The scikit-learn convention is to place predictions in columns and real values in rows
The scikit-learn convention is to put 0 by default for a negative class (top) and 1 for a positive class (bottom). the order can be changed using labels = [1,0].
You can calculate the overall accuracy in this way
M = np.array([[27, 6, 0, 16], [5, 18,0,21],[1,3,6,9],[0,0,0,48]])
M
sum of diagonal
w = M.diagonal()
w.sum()
99
sum of matrices
M.sum()
160
ACC = w.sum()/M.sum()
ACC
0.61875

moving average - stddev for multiple feators in single vector column

I need to calculate a moving average and stddev for 20 fields. I came up with the following windowed query (see example) .
val w = Window.partitionBy("id").orderBy("cykle").rowsBetween(0, windowRange)
val x = withrul.select('*,
mean($"s1").over(w).as("a1"),
sqrt( sum(pow($"s1" - mean($"s1").over(w),2)).over(w) / 5).as("sd1"),
... repeat 19 times more
Is there a way of doing this with a single vector column (feature vector) ?

How do I use Cosine similarity for this use case?

If I have a query vector A and an item vector B, it would be great if someone can guide me how to weigh/normalize the vectors (strategies for the same).
Vector A would have the following components ( property1 (binary), property2 (binary), property 3 (int from range 0 to 50), property4 (int from range(0 to 10)
Vector B would have the same properties
I know that the angle between these 2 vectors using cosine similarity would give me the distance between the 2 vectors. I want to create a recommendation based on the similarity.
But i am not clear on how to normalize the properties and or the vectors in this case since it is binary+binary_int range +int range. Also, if I want to grant higher weightage to one property than the other, how do i do so. what options do i have.
I find examples of cosine similarity online with documents, but in this case the Vectors A and B are not documents so i am not using TF-idf in this case.
Please advise,
Thanks

If you want to use the traditional cosine similarity between the two vectors for td/idf, then each term is a dimension in your vector. That is, you need to form two new Vectors A' and B' and perform the similarity between these two.
These vectors have a dimension for each term, and you have 65 terms:
property 1: true and false
property 2: true and false
property 3: 0 through 50
property 4: 0 through 10
So A' and B' will be vectors of length 65 and each element will be either 0 or 1:
A'(0) = 1 if A(0) = true, and 0 otherwise
A'(1) = 1 if A(0) = false, and 0 otherwise
etc.
Clearly, you can see that this is inefficient. You don't actually need to calculate A' or B' to use cosine similarity with td/idf; you can just pretend you calculated them and perform the calculation on A and B. Note that length(A') = length(B') = sqrt(4) because there will be exactly 4 ones in A' and B'.
td/idf may not be your best bet though, if you want to take care of similarities within properties 3 and 4. That is, with td/idf, a property 3 value of 40 is different than a property 3 value of 41 and different than a property 3 value of 12. However, 41 is not considered "farther away" from 40 than 12; they are all just different terms.
So, if you want property 3 and 4 to incorporate a distance (1 is really close to 2 and 50 is far form 2) then you have to define a distance metric. And if you want to weigh the Boolean values more or less than properties 3 and 4, you will have to define a different distance metric too. If these are things you want to do, forget about cosine and just come up with a value.
Here's an example:
distance = abs(A.property1 - B.property1) * 5 +
abs(A.property2 - B.property2) * 5 +
abs(A.property3 - B.property3) / 51 * 1 +
abs(A.property4 - B.property4) / 10 * 2
And then the similarity = (the maximum of all distances) - distance;
Or, if you like, similarity = 1 / distance.
You can really define it how ever you like. And if you need the similarity to be between 0 and 1, then normalize by dividing by the maximum possible distance.

how to show that NDCG score is significant

Suppose the NDCG score for my retrieval system is .8. How do I interpret this score. How do i tell the reader that this score is significant?

To understand this lets check an example of Normalized Discounted Cumulative Gain (nDCG)
For nDCG we need DCG and Ideal DCG (IDCG)
Lets understand what is Cumulative Gain (CG) first,
Example: Suppose we have [Doc_1, Doc_2, Doc_3, Doc_4, Doc_5]
Doc_1 is 100% relevant
Doc_2 is 70% relevant
Doc_3 is 95% relevant
Doc_4 is 20% relevant
Doc_5 is 100% relevant
So our Cumulative Gain (CG) is
CG = 100 + 70 + 95 + 20 + 100 ###(Index of the doc doesn't matter)
= 385
and
Discounted cumulative gain (DCG) is
DCG = SUM( relivencyAt(index) / log2(index + 1) ) ###where index 1 -> 5
Doc_1 is 100 / log2(2) = 100.00
Doc_2 is 70 / log2(3) = 044.17
Doc_3 is 95 / log2(4) = 047.50
Doc_4 is 20 / log2(5) = 008.61
Doc_5 is 100 / log2(6) = 038.69
DCG = 100 + 44.17 + 47.5 + 8.61 + 38.69
DCG = 238.97
and Ideal DCG is
IDCG = Doc_1 , Doc_5, Doc_3, Doc_2, Doc_4
Doc_1 is 100 / log2(2) = 100.00
Doc_5 is 100 / log2(3) = 063.09
Doc_3 is 95 / log2(4) = 047.50
Doc_2 is 75 / log2(5) = 032.30
Doc_4 is 20 / log2(6) = 007.74
IDCG = 100 + 63.09 + 47.5 + 32.30 + 7.74
IDCG = 250.63
nDCG(5) = DCG / IDCG
= 238.97 / 250.63
= 0.95
Conclusion:
In the given example nDCG was 0.95, 0.95 is not prediction accuracy, 0.95 is the ranking of the document effective. So, the gain is accumulated from the top of the result list to the bottom, with the gain of each result discounted at lower ranks.
Wiki reference

The NDCG is a ranking metric. In the information retrieval field you should predict a sorted list of documents and them compare it with a list of relevant documents. Imagine that you predicted a sorted list of 1000 documents and there are 100 relevant documents, the NDCG equals 1 is reached when the 100 relevant docs have the 100 highest ranks in the list.
So .8 NDCG is 80% of the best ranking.
This is an intuitive explanation the real math includes some logarithms, but it is not so far from this.

If you have relatively big sample, you can use bootstrap resampling to compute the confidence intervals, which will show you whether your NDCG score is significantly better than zero.
Additionally, you can use pairwise bootstrap resampling in order to significantly compare your NDCG score with another system's NDCG score

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to calcaulate tf-idf value? - nlp

Related

Algorithm / Function to find weighting of parameters how influens on result

Why scikit learn confusion matrix is reversed?

moving average - stddev for multiple feators in single vector column

How do I use Cosine similarity for this use case?

how to show that NDCG score is significant

Categories

Resources