How to use precision_recall_curve to calculate precision from recall value

How to use precision_recall_curve to calculate precision from recall value - scikit-learn

I try to calculate precision from recall value (e.g. 0.9) using precision-recall curve. The way I do is to find the index (idx) near minimum value of abs(recall - 0.9), and then find the precision(idx) I can use interpolate from the two sides of minimum values to improve the accuracy. However, I think there must be a better way. Is there an function to look up or interpolate the prevision from recall or vice verse from prevision-recall curve?
The below is my code. I try to get a better way of doing it.
from sklearn.metrics import precision_recall_curve
y_scores_lr = m.decision_function(X_test)
precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)
idx = abs(recall - 0.9).argmin()
prec = precision[idx] # use interpolation to get a better result

Related

What to pass as threshold for Naive Bayes Classifier in Pyspark?

I'm trying to make a ROC curve for my model while using a Naive Bayes Classifier. To do this, I need to change the value of the threshold for my classifier. The way I interpreted it, a list must be passed with the value for the threshold of each category. So if i had two categories, and t is the threshold I want to set (0 <= t <= 1), then I would have to pass a list like this: [1-t, t].
Anyways, when i tried doing the ROC curve, I got this:
Given the result, my idea was that the idea I had for the theshold might have been wrong, so I went to check the documentation for the Naive Bayes Classifier. But when I finally found an example i dont get what the criteria is for the parameter:
nb = nb.setThresholds([0.01, 10.00])
Does anyone know what must be passed to the threshold? Supose I want the theshold to be set at 0.7 (if the probability is over 0.7 i want the prediction to be 1), what should i pass to the threshold parameter?

As it says in pyspark.ml's documentation for NaiveBayes under the thresholds parameter:
The class with largest value p/t is predicted, where p is the original
probability of that class and t is the class's threshold.
Therefore, thresholds can be thought of as handicaps on the probabilities. To keep it simple, in the case of binary classification, you can set the thresholds as a value in the range [0, 1], such that they sum to 1. This will get you the desired rule of "Classify as True if the probability is over threshold T, otherwise classify as False".
For your specific ask of a 0.7 probability threshold, this would look like:
nb = nb.setThresholds([0.3, 0.7])
assuming that the first entry is the threshold for False and the second value is the thresold for True. Using these thresholds, the model would classify a class with False and True probabilities p_false and p_true by taking the greater value out of [p_false/0.3, p_true/0.7].
You can technically set the thresholds to any value. Just remember that the probability for class X will be divided by its respective threshold and compared against the other adjusted probabilities for the other classes.

minimum the cosine similarity of two tensors and output one scalar. Pytorch

I use Pytorch cosine similarity function as follows. I have two feature vectors and my goal is to make them dissimilar to each other. So, I thought I could minimum their cosine similarity. I have some doubts about the way I have coded. I appreciate your suggestions about the following questions.
I don't know why here are some negative values in val1?
I have done three steps to convert val1 to a scalar. Am I doing it in the right way? Is there any other way?
To minimum the similarity, I have used 1/var1. Is it a standard way to do this? Is it correct if I use 1-var1?
def loss_func(feat1, feat2):
cosine_loss = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
val1 = cosine_loss(feat1, feat2).tolist()
# 1. calculate the absolute values of each element,
# 2. sum all values together,
# 3. divide it by the number of values
val1 = 1/(sum(list(map(abs, val1)))/int(len(val1)))
val1 = torch.tensor(val1, device='cuda', requires_grad=True)
return val1

Do not convert your loss function to a list. This breaks autograd so you won't be able to optimize your model parameters using pytorch.
A loss function is already something to be minimized. If you want to minimize the similarity then you probably just want to return the average cosine similarity. If instead you want minimize the magnitude of the similarity (i.e. encourage the features to be orthogonal) then you can return the average absolute value of cosine similarity.
It seems like what you've implemented will attempt to maximize the similarity. But that doesn't appear to be in line with what you've stated. Also, to turn a minimization problem into an equivalent maximization problem you would usually just negate the measure. There's nothing wrong with a negative loss value. Taking the reciprocal of a strictly positive measure does convert it from minimization to a maximization problem, but also changes the behavior of the measure and probably isn't what you want.
Depending on what you actually want, one of these is likely to meet your needs
import torch.nn.functional as F
def loss_func(feat1, feat2):
# minimize average magnitude of cosine similarity
return F.cosine_similarity(feat1, feat2).abs().mean()
def loss_func(feat1, feat2):
# minimize average cosine similarity
return F.cosine_similarity(feat1, feat2).mean()
def loss_func(feat1, feat2):
# maximize average magnitude of cosine similarity
return -F.cosine_similarity(feat1, feat2).abs().mean()
def loss_func(feat1, feat2):
# maximize average cosine similarity
return -F.cosine_similarity(feat1, feat2).mean()

Lomb Scargle phase

Is there any way I can extract the phase from the Lomb Scargle periodogram? I'm using the LombScargle implementation from gatspy.
from gatspy.periodic import LombScargleFast
model = LombScargleFast().fit(t, y)
periods, power = model.periodogram_auto()
frequencies = 1 / periods
fig, ax = plt.subplots()
ax.plot(frequencies, power)
plt.show()
Power gives me an absolute value. Any way I can extract the phase for each frequency as I can for a discrete fourier transform.

The Lomb-Scargle method produces a periodogram, i.e., powers at each frequency. This is in order to be able to be performant, compared to directly least-squares fitting a sinusoidal model. I don't know about gatspy, but astropy does allow you to compute the best phase for a specific frequency of interest, see http://docs.astropy.org/en/stable/stats/lombscargle.html#the-lomb-scargle-model . I imagine doing this for many frequencies is many times slower than computing the periodogram.
-EDIT-
The docs outline moved to:
https://docs.astropy.org/en/stable/timeseries/lombscargle.html

let's consider that you're looking for a specific frequency fo. Then the corresponding period can be given by P = 1/fo.
We can define a function, as in below:
def phase_plot(t,period):
#t is the array of timesteps
phases = (time/period)%1.
this will give you all the phases for that particular frequency of interest.

Expectation Maximization algorithm(Gaussian Mixture Model) : ValueError: the input matrix must be positive semidefinite

I am trying to implement Expectation Maximization algorithm(Gaussian Mixture Model) on a data set data=[[x,y],...]. I am using mv_norm.pdf(data, mean,cov) function to calculate cluster responsibilities. But after calculating new values of covariance (cov matrix) after 6-7 iterations, cov matrix is becoming singular i.e determinant of cov is 0 (very small value) and hence it is giving errors
ValueError: the input matrix must be positive semidefinite
and
raise np.linalg.LinAlgError('singular matrix')
Can someone suggest any solution for this?
#E-step: Compute cluster responsibilities, given cluster parameters
def calculate_cluster_responsibility(data,centroids,cov_m):
pdfmain=[[] for i in range(0,len(data))]
for i in range(0,len(data)):
sum1=0
pdfeach=[[] for m in range(0,len(centroids))]
pdfeach[0]=1/3.*mv_norm.pdf(data[i], mean=centroids[0],cov=[[cov_m[0][0][0],cov_m[0][0][1]],[cov_m[0][1][0],cov_m[0][1][1]]])
pdfeach[1]=1/3.*mv_norm.pdf(data[i], mean=centroids[1],cov=[[cov_m[1][0][0],cov_m[1][0][1]],[cov_m[1][1][0],cov_m[0][1][1]]])
pdfeach[2]=1/3.*mv_norm.pdf(data[i], mean=centroids[2],cov=[[cov_m[2][0][0],cov_m[2][0][1]],[cov_m[2][1][0],cov_m[2][1][1]]])
sum1+=pdfeach[0]+pdfeach[1]+pdfeach[2]
pdfeach[:] = [x / sum1 for x in pdfeach]
pdfmain[i]=pdfeach
global old_pdfmain
if old_pdfmain==pdfmain:
return
old_pdfmain=copy.deepcopy(pdfmain)
softcounts=[sum(i) for i in zip(*pdfmain)]
calculate_cluster_weights(data,centroids,pdfmain,soft counts)
Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.

Can someone suggest any solution for this?
The problem is your data lies in some manifold of dimension strictly smaller than the input data. In other words for example your data lies on a circle, while you have 3 dimensional data. As a consequence when your method tries to estimate 3 dimensional ellipsoid (covariance matrix) that fits your data - it fails since the optimal one is a 2 dimensional ellipse (third dimension is 0).
How to fix it? You will need some regularization of your covariance estimator. There are many possible solutions, all in M step, not E step, the problem is with computing covariance:
Simple solution, instead of doing something like cov = np.cov(X) add some regularizing term, like cov = np.cov(X) + eps * np.identity(X.shape[1]) with small eps
Use nicer estimator like LedoitWolf estimator from scikit-learn.
Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.
This makes no sense, covariance matrix values has nothing to do with amount of clusters. You can initialize it with anything more or less resonable.

Spark : regression model threshold and precision

I have logistic regression mode, where I explicitly set the threshold to 0.5.
model.setThreshold(0.5)
I train the model and then I want to get basic stats -- precision, recall etc.
This is what I do when I evaluate the model:
val metrics = new BinaryClassificationMetrics(predictionAndLabels)
val precision = metrics.precisionByThreshold
precision.foreach { case (t, p) =>
println(s"Threshold is: $t, Precision is: $p")
}
I get results with only 0.0 and 1.0 as values of threshold and 0.5 is completely ignored.
Here is the output of the above loop:
Threshold is: 1.0, Precision is: 0.8571428571428571
Threshold is: 0.0, Precision is: 0.3005181347150259
When I call metrics.thresholds() it also returns only two values, 0.0 and 1.0.
How do I get the precision and recall values with threshold as 0.5?

You need to clear the model threshold before you make predictions. Clearing threshold makes your predictions return a score and not the classified label. If not you will only have two thresholds, i.e. your labels 0.0 and 1.0.
model.clearThreshold()
A tuple from predictionsAndLabels should look like (0.6753421,1.0) and not (1.0,1.0)
Take a look at https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassificationMetricsExample.scala
You probably still want to set numBins to control the number of points if the input is large.

I think what happens is that all the predictions are 0.0 or 1.0. Then the intermediate threshold values make no difference.
Consider the numBins argument of BinaryClassificationMetrics:
numBins:
if greater than 0, then the curves (ROC curve, PR curve) computed internally will be down-sampled to this many "bins". If 0, no down-sampling will occur. This is useful because the curve contains a point for each distinct score in the input, and this could be as large as the input itself -- millions of points or more, when thousands may be entirely sufficient to summarize the curve. After down-sampling, the curves will instead be made of approximately numBins points instead. Points are made from bins of equal numbers of consecutive points. The size of each bin is floor(scoreAndLabels.count() / numBins), which means the resulting number of bins may not exactly equal numBins. The last bin in each partition may be smaller as a result, meaning there may be an extra sample at partition boundaries.
So if you don't set numBins, then precision will be calculated at all the different prediction values. In your case this seems to be just 0.0 and 1.0.

First, try adding more bins like this (here numBins is 10):
val metrics = new BinaryClassificationMetrics(probabilitiesAndLabels,10);
If you still only have two thresholds of 0 and 1, then check to make sure the way you have defined your predictionAndLabels. You many be having this problem if you have accidentally provided (label, prediction) instead of (prediction, label).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to use precision_recall_curve to calculate precision from recall value - scikit-learn

Related

What to pass as threshold for Naive Bayes Classifier in Pyspark?

minimum the cosine similarity of two tensors and output one scalar. Pytorch

Lomb Scargle phase

Expectation Maximization algorithm(Gaussian Mixture Model) : ValueError: the input matrix must be positive semidefinite

Spark : regression model threshold and precision

Categories

Resources