Multi feature clustering using optics - nlp

While working with optics clistering algorithm, facing issues of outliers.
I have used default ep and min samples, for 2 datasets I am getting 80 percent of datapoints as outlier/noise
How to reduce outliers and capture more and more data to get optimum cluster

Related

Which Algorithms are used for Drift Magnitude and Top Drifting Features by Azure in Data Drift Detection

What is the Exact Algorithm used below to derive the Drift magnitude as a percentage? And how did they get these percentages for Top Drifting Features?
This is a sample dashboard for Azure Drift Detection in Data:
Azure has specified these algorithms below for each categorical and numerical feature:
But none of them return a percentage. And mathematically the Wasserstein distance (Earth-Mover Distance) can be any number from 0 to infinity. So how do they derive a percentage out of it?
There was a mention of the Matthews correlation coefficient (MCC) used for Drift magnitude. If so how does that work exactly?
The data drift will be working on time series dataset. The time series dataset will work on distance methodology. The distance methodology is more like to be using with cluster. The clusters are working on distance pattern.
While creating the data drift, we are running it on an instance cluster. The machine learning model K-nearest neighbor. LSTM and dynamic time wrapping (DTW) will be working internally in the data drift.
While uploading the dataset the properties must be shifted to Timestamp
The data drift factors are measured using Min, Max, Mean, and other distance factors. Root mean squared methodology will be in front-line to take up the distance factors for evaluation operation.

how to correlate noise data of sklearn-DBSCAN result with other clusters?

I am using sklearn-DBSCAN to cluster my text data.
I used GoogleNews-vectors-negative300.bin to create 300 dimensional sentence vectors for each document and created metrics of size 10000*300.
when I passed metrics to DBSCAN with few possible values of eps (0.2 to 3) & min_samples (5 to 100) with other default parameters, getting numbers of clusters (200 to 10).
As I analyzed for all the clusters noise data is approx 75-80% of my data.
Is there any way to reduce noise or use some other parameters (distances) to reduce noise?
Even I checked with euclidean distance between 2 vectors is 0.6 but both are in different clusters, how can I manage to bring in same cluster?
X_scaled = scaler.fit_transform(sentence_vectors)
ep = 0.3
min_sam = 10
for itr in range(1,11):
dbscan = DBSCAN(eps=ep, min_samples = min_sam*itr)
clusters = dbscan.fit_predict(X_scaled)
If you want two points at distance 0.6 to be in the same cluster, then you may need to use a larger epsilon (which is a distance threshold). At 0.6 they should be in the same cluster.
Since word2vec is trained with dot products, it would likely make more sense to use the dot product as similarity and/or cosine distance.
But in general I doubt you'll be able to get good results. The way sentence vectors are built by averaging word2vec vectors kills too much signal, and adds to much noise. And since the data is high-dimensional, all such noise is a problem.

Interpreting clustering metrics

I'm doing clustering by k-means in Scikit-learn on 398 samples, 306 features. The features matrix is sparse, and the number of clusters is 4.
To improve the clustering, I tried two approaches:
After clustering, I used ExtraTreesClassifier() to classify and compute feature importances (samples labeled in clustering)
I used PCA to reduce the feature dimension to 2.
I have computed the following metrics (SS, CH, SH)
Method sum_of_squares, Calinski_Harabasz, Silhouette
1 kmeans 31.682 401.3 0.879
2 kmeans+top-features 5989230.351 75863584.45 0.977
3 kmeans+PCA 890.5431893 58479.00277 0.993
My questions are:
As far as I know, if sum of squares is smaller, the performance of clustering method is better, while if Silhouette is close to 1 the performance of clustering method is better. For instance in the last row both sum of squares and Silhouette are increased compared to the first row.
How can I choose which approach has better performance?
Never compare sum-of-squares and similar metrics across different projections, transformations or data sets.
To see why, simply multiply every feature by 0.5 - your SSQ will drop by 0.25. So to "improve" your data set, you just need to scale it to a tiny size...
These metrics must only be used on the exact same input and parameters. You can't even use sum-of-squares to compare k-means with different k, because the larger k will win. All you can do is multiple random attempts, and then keep the best minimum you found this way.
With 306 features you are under the curse of dimensionality. Clustering in 306 dimensions is not meaningful. Therefore I wouldn't select features after clustering.
To get interpretable results, you need to reduce dimensionality. For 398 samples you need low dimension (2, 3, maybe 4). Your PCA with dimension 2 is good. You can try 3.
An approach with selecting important features before clustering may be problematic. Anyway, are 2/3/4 "best" features meaningful in your case?

statistical test for samples that follow normal distribution, with each sample having multiple measurements?

I have a set of sample (i = 1 : n), with each one measured for a specific metric 10 times.
The metric mean of the 10 measurements for each sample has a mean mu(i).
I've done dbscan clustering on all the mu, to find out the outlier samples. Now I want to test whether a given outlier is statistically different from the core samples.
The samples appear to follow normal distribution. For each sample, the 10 measurements also appear to follow normal distribution.
If I just use the mu(i) as the metric for each sample, I can easily calculate Z-score and p-value based on normal distribution. My question is, how do I make use of the 10 measurements for each sample to add to my statistical power (is it possible?)
Not very good at statistics, anything would help, thanks in advance...

K-means metrics

I have read through the scikit learn documentation and Googled to no avail. I have 2000 data sets, clustered as the picture shows. Some of the clusters, as shown, are wrong, here the red cluster. I need a metrics to method to validate all the 2000 cluster-sets. Almost every metric in scikit learn requires the ground truth class labels, which I do not think I have or CAN have for that matter. I have the hourly traffic flow for 30 days and I am clustering them using k-means. The lines are the cluster centers. What should I do? Am I even on the right track?!The horizontal axis is the hour, 0 to 23, and the vertical axis is the traffic flow, so the data points represent the traffic flow in that hour over the 30 days, and k=3.
SciKit learn has no methods, except from the silhouette coefficient, for internal evaluation, to my knowledge, we can implement the DB Index (Davies-Bouldin) and the Dunn Index for such problems. The article here provides good metrics for k-means:
http://www.iaeng.org/publication/IMECS2012/IMECS2012_pp471-476.pdf
Both the Silhouette coefficient and the Calinski-Harabaz index are implemented in scikit-learn nowadays and will help you evaluate your clustering results when there is no ground-truth.
More details here:
http://scikit-learn.org/stable/modules/clustering.html
And here:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_samples.html#sklearn.metrics.silhouette_samples
Did you look at the Agglomerative clustering and then the subsection (Varying the metric):
http://scikit-learn.org/stable/modules/clustering.html#varying-the-metric
To me it seems very similar to what you are trying to do.

Resources