Improving accuracy of nearest neighbours algorithm - unsupervised learning problem - python-3.x

I have a situation where I am trying to find out 3 nearest neighbours for a given ID in my dataframe. I am using NN alogrithm (not KNN) to achieve this. The below code is giving me the three nearest neighbours, for the top node the results are fine but for the middle ones and the bottom ones the accuracy is only 1/3 neighbours are correct whereas I am eyeing to have atleast 2/3 neighours correct at every ID. My dataset has 47 features and 5000 points.
from sklearn.neighbors import KDTree
def findsuccess(sso_id):
neighbors_f_sso_id = np.where(nbrs.kneighbors_graph([X[i]]))[0]
print('Neighbors of id', neighbors_f_sso_id)
kdt = KDTree(X, leaf_size=40, metric='euclidean')
kdt.query(X, k=4, return_distance=False)
The above code will return the ID itself and the 3 nearest neighbours ,hence k=4
I have read that due to curse of dimensionality, this NN algorithm might not work well as there are about 47 features in my dataset but this is the only option I think I have when it comes to a data frame without a target variable. There is one article available here that says the KD Tree is not best of the algorithms that can be used.
What would be the best way to achieve the maximum accuracy, meaning achieving minimum distance?
Do I need to perform scaling before passing into KD Tree algorithm? Any other things that I need to take care off?

Related

Assessing features to labelencode or get_dummies() on dataset in Python

I'm working on the heart attack analysis on Kaggle in python.
I am a beginner and I'm trying to figure whether it's still necessary to one-hot-encode or LableEncode these features. I see so many people encoding the values for this project, but I'm confused because everything already looks scaled (apart from age, thalach, oldpeak and slope).
age: age in years
sex: (1 = male; 0 = female)
cp: ordinal values 1-4
thalach: maximum heart rate achieved
exang: (1 = yes; 0 = no)
oldpeak: depression induced by exercise
slope: the slope of the peak exercise
ca: values (0-3)
thal: ordinal values 0-3
target: 0= less chance, 1= more chance
Would you say it's still necessary to one-hot-encode, or should I just use a StandardScaler straight away?
I've seen many people encode the whole dataset for this project, but it makes no sense to me to do so. Please confirm if only using StandardScaler would be enough?
When you apply StandardScaler, the columns would have values in the same range. That helps models to keep weights under bound and gradient descent will not shoot off when converging. This will help the model converge faster.
Independently, in order to decide between Ordinal values and One hot encoding, consider if the column values are similar or different based on the distance between them. If yes, then choose ordinal values. If you know the hierarchy of the category, then you can manually assign the ordinal values. Otherwise, you should use LabelEncoder. It seems like the heart attack data is already given with ordinal values manually assigned. For example, higher chest pain = 4.
Also, it is important to refer to notebooks that perform better. Take a look at the one below for reference.
95% Accuracy - https://www.kaggle.com/code/abhinavgargacb/heart-attack-eda-predictor-95-accuracy-score

Using scipy.stats.entropy on gmm.predict_proba() values

Background so I don't throw out an XY problem -- I'm trying to check the goodness of fit of a GMM because I want statistical back-up for why I'm choosing the number of clusters I've chosen to group these samples. I'm checking AIC, BIC, entropy, and root mean squared error. This question is about entropy.
I've used kmeans to cluster a bunch of samples, and I want an entropy greater than 0.9 (stats and psychology are not my expertise and this problem is both). I have 59 samples; each sample has 3 features in it. I look for the best covariance type via
for cv_type in cv_types:
for n_components in n_components_range:
# Fit a Gaussian mixture with EM
gmm = mixture.GaussianMixture(n_components=n_components,
covariance_type=cv_type)
gmm.fit(data3)
where the n_components_range is just [2] (later I'll check 2 through 5).
Then I take the GMM with the lowest AIC or BIC, saved as best_eitherAB, (not shown) of the four. I want to see if the label assignments of the predictions are stable across time (I want to run for 1000 iterations), so I know I then need to calculate the entropy, which needs class assignment probabilities. So I predict the probabilities of the class assignment via gmm's method,
probabilities = best_eitherAB.predict_proba(data3)
all_probabilities.append(probabilities)
After all the iterations, I have an array of 1000 arrays, each contains 59 rows (sample size) by 2 columns (for the 2 classes). Each inner row of two sums to 1 to make the probability.
Now, I'm not entirely sure what to do regarding the entropy. I can just feed the whole thing into scipy.stats.entropy,
entr = scipy.stats.entropy(all_probabilities)
and it spits out numbers - as many samples as I have, I get a 2 item numpy matrix for each. I could feed just one of the 1000 tests in and just get 1 small matrix of two items; or I could feed in just a single column and get a single values back. But I don't know what this is, and the numbers are between 1 and 3.
So my questions are -- am I totally misunderstanding how I can use scipy.stats.entropy to calculate the stability of my classes? If I'm not, what's the best way to find a single number entropy that tells me how good my model selection is?

How to scale input DBSCAN in scikit-learn

Should the input to sklearn.clustering.DBSCAN be pre-processeed?
In the example http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-py the distances between the input samples X are calculated and normalized:
D = distance.squareform(distance.pdist(X))
S = 1 - (D / np.max(D))
db = DBSCAN(eps=0.95, min_samples=10).fit(S)
In another example for v0.14 (http://jaquesgrobler.github.io/online-sklearn-build/auto_examples/cluster/plot_dbscan.html) some scaling is done:
X = StandardScaler().fit_transform(X)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
I base my code on the latter example and have the impression clustering works better with this scaling. However, this scaling "Standardizes features by removing the mean and scaling to unit variance". I try to find 2d clusters. If I have my clusters distributed in a squared area - let's say 100x100 I see no problem in the scaling. However, if the are distributed in an rectangled area e.g. 800x200 the scaling 'squeezes' my samples and changes the relative distances between them in one dimension. This deteriorates the clustering, doesn't it? Or am I understanding sth. wrong?
Do I need to apply some preprocessing at all, or can I simply input my 'raw' data?
It depends on what you are trying to do.
If you run DBSCAN on geographic data, and distances are in meters, you probably don't want to normalize anything, but set your epsilon threshold in meters, too.
And yes, in particular a non-uniform scaling does distort distances. While a non-distorting scaling is equivalent to just using a different epsilon value!
Note that in the first example, apparently a similarity and not a distance matrix is processed. S = (1 - D / np.max(D)) is a heuristic to convert a similarity matrix into a dissimilarity matrix. Epsilon 0.95 then effectively means at most "0.05 of the maximum dissimilarity observed". An alternate version that should yield the same result is:
D = distance.squareform(distance.pdist(X))
S = np.max(D) - D
db = DBSCAN(eps=0.95 * np.max(D), min_samples=10).fit(S)
Whereas in the second example, fit(X) actually processes the raw input data, and not a distance matrix. IMHO that is an ugly hack, to overload the method this way. It's convenient, but it leads to misunderstandings and maybe even incorrect usage sometimes.
Overall, I would not take sklearn's DBSCAN as a referene. The whole API seems to be heavily driven by classification, not by clustering. Usually, you don't fit a clustering, you do that for supervised methods only. Plus, sklearn currently does not use indexes for acceleration, and needs O(n^2) memory (which DBSCAN usually would not).
In general, you need to make sure that your distance works. If your distance function doesn't work no distance-based algorithm will produce the desired results. On some data sets, naive distances such as Euclidean work better when you first normalize your data. On other data sets, you have a good understanding on what distance is (e.g. geographic data. Doing a standardization on this obivously does not make sense, nor does Euclidean distance!)

How are feature_importances in RandomForestClassifier determined?

I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, which attributes/dates contribute to the result to what extent. Therefore I am just using the feature_importances_, which works well for me.
However, I would like to know how they are getting calculated and which measure/algorithm is used. Unfortunately I could not find any documentation on this topic.
There are indeed several ways to get feature "importances". As often, there is no strict consensus about what this word means.
In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read...). It is sometimes called "gini importance" or "mean decrease impurity" and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble.
In the literature or in some other packages, you can also find feature importances implemented as the "mean decrease accuracy". Basically, the idea is to measure the decrease in accuracy on OOB data when you randomly permute the values for that feature. If the decrease is low, then the feature is not important, and vice-versa.
(Note that both algorithms are available in the randomForest R package.)
[1]: Breiman, Friedman, "Classification and regression trees", 1984.
The usual way to compute the feature importance values of a single tree is as follows:
you initialize an array feature_importances of all zeros with size n_features.
you traverse the tree: for each internal node that splits on feature i you compute the error reduction of that node multiplied by the number of samples that were routed to the node and add this quantity to feature_importances[i].
The error reduction depends on the impurity criterion that you use (e.g. Gini, Entropy, MSE, ...). Its the impurity of the set of examples that gets routed to the internal node minus the sum of the impurities of the two partitions created by the split.
Its important that these values are relative to a specific dataset (both error reduction and the number of samples are dataset specific) thus these values cannot be compared between different datasets.
As far as I know there are alternative ways to compute feature importance values in decision trees. A brief description of the above method can be found in "Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
It's the ratio between the number of samples routed to a decision node involving that feature in any of the trees of the ensemble over the total number of samples in the training set.
Features that are involved in the top level nodes of the decision trees tend to see more samples hence are likely to have more importance.
Edit: this description is only partially correct: Gilles and Peter's answers are the correct answer.
As #GillesLouppe pointed out above, scikit-learn currently implements the "mean decrease impurity" metric for feature importances. I personally find the second metric a bit more interesting, where you randomly permute the values for each of your features one-by-one and see how much worse your out-of-bag performance is.
Since what you're after with feature importance is how much each feature contributes to your overall model's predictive performance, the second metric actually gives you a direct measure of this, whereas the "mean decrease impurity" is just a good proxy.
If you're interested, I wrote a small package that implements the Permutation Importance metric and can be used to calculate the values from an instance of a scikit-learn random forest class:
https://github.com/pjh2011/rf_perm_feat_import
Edit: This works for Python 2.7, not 3
code:
iris = datasets.load_iris()
X = iris.data
y = iris.target
clf = DecisionTreeClassifier()
clf.fit(X, y)
decision_tree plot:
enter image description here
We get
compute_feature_importance:[0. ,0.01333333,0.06405596,0.92261071]
Check source code:
cpdef compute_feature_importances(self, normalize=True):
"""Computes the importance of each feature (aka variable)."""
cdef Node* left
cdef Node* right
cdef Node* nodes = self.nodes
cdef Node* node = nodes
cdef Node* end_node = node + self.node_count
cdef double normalizer = 0.
cdef np.ndarray[np.float64_t, ndim=1] importances
importances = np.zeros((self.n_features,))
cdef DOUBLE_t* importance_data = <DOUBLE_t*>importances.data
with nogil:
while node != end_node:
if node.left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
left = &nodes[node.left_child]
right = &nodes[node.right_child]
importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1
importances /= nodes[0].weighted_n_node_samples
if normalize:
normalizer = np.sum(importances)
if normalizer > 0.0:
# Avoid dividing by zero (e.g., when root is pure)
importances /= normalizer
return importances
Try calculate the feature importance:
print("sepal length (cm)",0)
print("sepal width (cm)",(3*0.444-(0+0)))
print("petal length (cm)",(54* 0.168 - (48*0.041+6*0.444)) +(46*0.043 -(0+3*0.444)) + (3*0.444-(0+0)))
print("petal width (cm)",(150* 0.667 - (0+100*0.5)) +(100*0.5-(54*0.168+46*0.043))+(6*0.444 -(0+3*0.444)) + (48*0.041-(0+0)))
We get feature_importance: np.array([0,1.332,6.418,92.30]).
After normalized, we get array ([0., 0.01331334, 0.06414793, 0.92253873]),this is same as clf.feature_importances_.
Be careful all classes are supposed to have weight one.
For those looking for a reference to the scikit-learn's documentation on this topic or a reference to the answer by #GillesLouppe:
In RandomForestClassifier, estimators_ attribute is a list of DecisionTreeClassifier (as mentioned in the documentation). In order to compute the feature_importances_ for the RandomForestClassifier, in scikit-learn's source code, it averages over all estimator's (all DecisionTreeClassifer's) feature_importances_ attributes in the ensemble.
In DecisionTreeClassifer's documentation, it is mentioned that "The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [1]."
Here is a direct link for more info on variable and Gini importance, as provided by scikit-learn's reference below.
[1] L. Breiman, and A. Cutler, “Random Forests”, http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Feature Importance in Random Forest
Random forest uses many trees, and thus, the variance is reduced
Random forest allows far more exploration of feature combinations as well
Decision trees gives Variable Importance and it is more if there is reduction in impurity (reduction in Gini impurity)
Each tree has a different Order of Importance
Here is what happens in the background!
We take an attribute and check in all the trees where it is present and take the average values of the change in the homogeneity on this attribute split. This average value of change in the homogeneity gives us the feature importance of the attribute

'Probability' of a K-nearest neighbor like classification

I've a small set of data points (around 10) in a 2D space, and each of them have a category label. I wish to classify a new data point based on the existing data point labels and also associate a 'probability' for belonging to any particular label class.
Is it appropriate to label the new point based on the label to its nearest neighbor( like a K-nearest neighbor, K=1)? For getting the probability I wish to permute all the labels and calculate all the minimum distance of the unknown point and the rest and finding the fraction of cases where the minimum distance is lesser or equal to the distance that was used to label it.
Thanks
The Nearest Neighbour method is already using the Bayes theorem to estimate the probability using the points in a ball containing your chosen K points. There is no need to transform, as the number of points in the ball of K points belonging to each label divided by the total number of points in that ball already is an approximation of the posterior probability of that label. In other words:
P(label|z) = P(z|label)P(label) / P(z) = K(label)/K
This is obtained using the Bayes rule of probability on an estimated probability estimated using a subset of the data. In particular, using:
VP(x) = K/N (this gives you the probability of a point in a ball of volume V)
P(x) = K/NV (from above)
P(x=label) = K(label)/N(label)V (where K(label) and N(label) are the number of points in the ball of that given class and the number of points in the total samples of that class)
and
P(label) = N(label)/N.
Therefore, just pick a K, calculate the distances, count the points and by checking their labels and recounting you will have your probability.
Roweis uses a probabilistic framework with KNN in his publication Neighbourhood Component Analysis. The idea is to use a "soft" nearest neighbour classification, where the probability that a point i uses another point j as its neighbour is defined by
,
where d_ij is the euclidean distance between point i and j.
The are no probabilities for such K-nearest classification method because it is discriminative classification as well as SVM. There are should be used postporcess for learning probabilities on unseen data with generative model like logistic regression.
1. learn K nearest classifier
2. Train logistic regression on distance and average distance to K nearest for validation data.
Check for details LibSVM article.
Sort the distances to the 10 centres; they could be
1 5 6 ... — one near, others far
1 1 1 5 6 ... — 3 near, others far
... lots of possibilities.
You could combine the 10 distances to a single number, e.g. 1 - (nearest / average) ** p,
but that's throwing away information.
(Different powers p makes the hills around the centres steeper or flatter.)
If your centres are really Gaussian hills though, take a look at
Multivariate kernel density estimation.
Added:
There are zillions of functions that go smoothly between 0 and 1,
but that doesn't make them probabilities of something.
"Probability" means either that chance, likelihood, is involved,
as in probability of rain;
or that you're trying to impress somebody.
Added again: scholar.google.com "(single|1) nearest neighbor classifier" gets > 300 hits;
"k nearest neighbor classifier" gets almost 3000.
It seems to me (non-expert) that, out of 10 different ways of mapping k-NN distances to labels,
each one might be better than the 9 others — for some data, with some error measure.
Anyway, you could try asking stats.stackexchange.com ,
The answer is : it depends.
Imagine your labels are the surname of a person, and the X,Y coordinates represent some essential characteristics of the person's DNA sequence. Clearly a more close DNA description enhance the probability of having the same surnames.
Now suppose the X,Y is the lat/long of the work office for that person. Working closer isn't related to label (surname) sharing.
So, it depends on the semantic of your tags and axes.
HTH!

Resources