How to understand Dedupe library? - python-dedupe

Two questions:
How to interpret the 'confidence score' when there is cluster with 3 rows and 3 confidence scores (0.98, 0.45, 0.45). Where this confidence scores come from? From logistic regression or somehow from hierarchical clustering?
10 000 of my 16 millions is labeled as duplicates, should I put this all as trening data? or only 10 positive and 10 negative will be enough? what number will be better for quality and time of execution?

the confidence score is 1 - square root of the average squared distance between the record and the other records in the cluster, where distance is 1 - predicted probability that a pair of records are coreferent
See https://docs.dedupe.io/en/latest/API-documentation.html#dedupe.Dedupe.cluster for more details

Related

90% Confidence ellipsoid of 3 dimensinal data

i did get to know confidence ellipses during university (but that has been some semesters ago).
In my current project, I'd like to calculate a 3 dimensional confidence ellipse/ellipsoid in which I can set the probability of success to e.g. 90%. The center of the data is shifted from zero.
At the moment i am calculating the variance-covariance matrix of the dataset and from it its eigenvalues and eigenvectors which i then represent as an ellipsoid.
here, however, I am missing the information on the probability of success, which I cannot specify.
What is the correct way to calculate a confidence ellipsoid with e.g. 90% probability of success ?

How to model normal distribution curve based on a few datapoints in Excel?

I can't believe I'm not finding a simple answer on Google for this noob question.
I have a handful of datapoints (lets say 10) on scores and respective percentile ranks that are normally distributed, for example see below:
Scores
Percentile rank
846
96.5
809
91.0
729
67.8
592
27.7
...
...
I now want to use those datapoints to calculate the percentile ranks for scores for which I don't have datapoints. E.g. what would be the percentile rank for a score of 650?
I know how to do a linear regression in Excel, but for a normally distributed dataset this doesn't work obviously.
You have
where x is the value and p is the corresponding probability (percentile value/100).
so if you plotted NORM.S.INV(p) against x you would get a straight line with
and
so you could estimate
and
I simulated some data with a mean of 100 and SD of 10 - it works reasonably well, but that is with a fair number of points spread between -3 and +3 standard deviations, so it might not be very good on just a small number of points
The estimated mean is 99.5 and SD 9.75

Gaussian Mixture model log-likelihood to likelihood-Sklearn

I want to calculate the likelihoods instead of log-likelihoods. I know that score gives per sample average log-likelihood and for that I need to multiply score with sample size but the log likelihoods are very large negative numbers such as -38567258.1157 and when I take np.exp(scores) , I get a zero. Any help is appreciated.
gmm=GaussianMixture(covariance_type="diag",n_components=2)
y_pred=gmm.fit_predict(X_test)
scores=gmm.score(X_test)

How does probability come in to play in a kNN algorithm?

kNN seems relatively simple to understand: you have your data points and you draw them in your feature space (in a feature space of dimension 2, its the same as drawing points on a xy plane graph). When you want to classify a new data, you put the new data onto the same feature space, find the nearest k neighbors, and see what their labels are, ultimately taking the label(s) with highest votes.
So where does probability come in to play here? All I am doing to calculating distance between two points and taking the label(s) of the closest neighbor.
For a new test sample you look on the K nearest neighbors and look on their labels.
You count how many of those K samples you have in each of the classes, and divide the counts by the number of classes.
For example - lets say that you have 2 classes in your classifier and you use K=3 nearest neighbors, and the labels of those 3 nearest samples are (0,1,1) - the probability of class 0 is 1/3 and the probability for class 1 is 2/3.

K-means text documents clustering. How calculate intra and inner similarity?

I classify thousands of documents where the vector components are calculated according to the tf-idf. I use the cosine similarity. I did a frequency analysis of words in clusters to check the difference in top words. But I'm not sure how to calculate the similarity numerically in this sort of documents.
I count internal similarity of a cluster as the average of the similarity of each document to the centroid of the cluster. If I counted the average couple is based on small number.
External similarity calculated as the average similarity of all pairs cluster centroid
I count right? It is based on my inner similarity values average ​​from 0.2 (5 clusters and 2000 documents)to 0.35 (20 clusters and 2000 documents). Which is probably caused by a widely-oriented documents in computer science. Intra from 0.3-0.7. The result may be like that? On the Internet I found various ways of measuring, do not know which one to use than the one that was my idea. I am quite desperate.
Thank you so much for your advice!
Using k-means with anything but squared euclidean is risky. It may stop converging, as the convergence proof relies on both the mean and the distance assignment optimizing the same criterion. K-means minimizes squared deviations, not distances!
For a k-means variant that can handle arbitrary distance functions (and have guaranteed convergence), you will need to look at k-medoids.

Resources