Nearest Neighbour in SMOTE - azure

What is a good value of nearest neighbour in SMOTE? I know that when nearest neighbour = 1 in KNN, the model is prone to overfitting. But does that apply to SMOTE too? because SMOTE aims to create synthetic samples from the dataset, isnt a lower nearest neighbour better to get data similar to your original?
and then I split the data to training and test. And I perform a boosted decision tree. Typically, lower nearest neighbour in SMOTE gives a higher AUC. But is this due to overfitting?
I tried to find optimal values for nearest neighbors using the same parameter in Azure. However, nearest neighbor =1 will always get the highest AUC

Related

90% Confidence ellipsoid of 3 dimensinal data

i did get to know confidence ellipses during university (but that has been some semesters ago).
In my current project, I'd like to calculate a 3 dimensional confidence ellipse/ellipsoid in which I can set the probability of success to e.g. 90%. The center of the data is shifted from zero.
At the moment i am calculating the variance-covariance matrix of the dataset and from it its eigenvalues and eigenvectors which i then represent as an ellipsoid.
here, however, I am missing the information on the probability of success, which I cannot specify.
What is the correct way to calculate a confidence ellipsoid with e.g. 90% probability of success ?

How does probability come in to play in a kNN algorithm?

kNN seems relatively simple to understand: you have your data points and you draw them in your feature space (in a feature space of dimension 2, its the same as drawing points on a xy plane graph). When you want to classify a new data, you put the new data onto the same feature space, find the nearest k neighbors, and see what their labels are, ultimately taking the label(s) with highest votes.
So where does probability come in to play here? All I am doing to calculating distance between two points and taking the label(s) of the closest neighbor.
For a new test sample you look on the K nearest neighbors and look on their labels.
You count how many of those K samples you have in each of the classes, and divide the counts by the number of classes.
For example - lets say that you have 2 classes in your classifier and you use K=3 nearest neighbors, and the labels of those 3 nearest samples are (0,1,1) - the probability of class 0 is 1/3 and the probability for class 1 is 2/3.

Lua - Finding the best match for a string

I was curious if anyone had a good method of choosing the best matching case between strings. For example, say I have a table with keys “Hi there”, “Hello”, “Hiya”, “hi”, “Hi”, and “Hey there”. The I want to find the closest match for “Hi”. It would then match to the “Hi” first. If that wasn’t found, then the “hi” then “Hiya”, and so on. Prioritizing perfect matches, then lower/uppercase matches, then which ever had the least number of differences or length difference.
My current method seems unwieldy, first checking for a perfect match, then looping around with a string.match, saving any with the closest string.len.
If you're not looking for a perfect match only, you need to use some metric as a measure of similarity and then look for the closest match.
As McBarby suggested in his comment you can use the Levenshtein distance which is the minimum number of single character edits necessary to get from string 1 to string 2. Just research which metrics are available and which one suits your needs best. Of course you can also define your own metric.
https://en.wikipedia.org/wiki/String_metric lists a number of other string metrics:
Sørensen–Dice coefficient
Block distance or L1 distance or City block distance
Jaro–Winkler distance
Simple matching coefficient (SMC)
Jaccard similarity or Jaccard coefficient or Tanimoto coefficient
Tversky index
Overlap coefficient
Variational distance
Hellinger distance or Bhattacharyya distance
Information radius (Jensen–Shannon divergence)
Skew divergence
Confusion probability
Tau metric, an approximation of the Kullback–Leibler divergence
Fellegi and Sunters metric (SFS)
Maximal matches
Grammar-based distance
TFIDF distance metric

How can I do performance evaluation for aspect-based opinion mining?

I have calculated some value for every aspect and identified its polarity using sentiwordnet.
For example, the movie is great. here movie is an aspect and I identified its value using some metric for example movie=1.5677 and polarity is positive. hereafter how can I identify the precision and recall?
Since you do not have a discrete classifier, the precision would be how close your calculated scores are to the truth scores (something like sum squared error or sum absolute error would work). If you had a discrete classifier you could just calculate the number of correct classifications.
The recall would be the percentage of aspects that you were able to successfully extract. So for your example you extracted the only aspect, giving you a score of 1.0. If the input was "The pizza and the movie were amazing" and you only extracted "movie", then your recall score would be 0.5.
Normally you could combine your precision and recall scores into a F-Measure, but, since you do not have a discrete classifier, you probably wouldn't be able to use the F-Measure.
for evaluate your models in NLP , we can use this as u know:
Per and R
F-score and Acuraccy
1. evaluate Aspect Extraction model:
TP :
(true positive)
Number of aspects that are correctly extracted
FP :
(false positive)
Number of aspects that are annotated, but the not extracted by the algorithm
FN :
(false negative)
Number of aspects that are not annotated, but extracted by the algorithm
TN :
(true negative)
Number of aspects that are not annotated and not extracted by the algorithm
2. evaluate Sentiment Classification model:
TP:
Number of sentiment polarity scores that are calculated correctly by the algorithm
FP:
Number of sentiment polarity scores that are incorrectly calculated by the algorithm Irrelevant
FN:
Number of aspects that does not have sentiment assigned, but calculated by the algorithm
TN:
Number of aspects that does not have sentiment assigned and not calculated by the algorithm

K-means text documents clustering. How calculate intra and inner similarity?

I classify thousands of documents where the vector components are calculated according to the tf-idf. I use the cosine similarity. I did a frequency analysis of words in clusters to check the difference in top words. But I'm not sure how to calculate the similarity numerically in this sort of documents.
I count internal similarity of a cluster as the average of the similarity of each document to the centroid of the cluster. If I counted the average couple is based on small number.
External similarity calculated as the average similarity of all pairs cluster centroid
I count right? It is based on my inner similarity values average ​​from 0.2 (5 clusters and 2000 documents)to 0.35 (20 clusters and 2000 documents). Which is probably caused by a widely-oriented documents in computer science. Intra from 0.3-0.7. The result may be like that? On the Internet I found various ways of measuring, do not know which one to use than the one that was my idea. I am quite desperate.
Thank you so much for your advice!
Using k-means with anything but squared euclidean is risky. It may stop converging, as the convergence proof relies on both the mean and the distance assignment optimizing the same criterion. K-means minimizes squared deviations, not distances!
For a k-means variant that can handle arbitrary distance functions (and have guaranteed convergence), you will need to look at k-medoids.

Resources