Improving my clustering for my dataset and changing my clusters to better reflect my data - scikit-learn

Image of clusters
So I think this is somewhat self-explanatory but my clusters don't really make sense. I am new to clustering and just want these datapoints to be grouped together. The problem I keep seeing is the black is being clustered over 2 distinct lines and I want them to be seperate. I am plotting weather radar over the united states and just want to have lines of storms a unique cluster and eventually remove any random super small groups, not near the lines. I tried to just add more clusters and adjust the algorithm, but nothing has been working.
Here is my code:
nc = 8
kmeans = KMeans(n_clusters = nc, init ='k-means++',n_init=1,algorithm='full')
kmeans.fit(X[X.columns[1:nc]]) # Compute k-means clustering.
X['cluster_label'] = kmeans.fit_predict(X[X.columns[1:nc]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
labels = kmeans.predict(X[X.columns[1:3]]) # Labels of each point
maxcluster = X['cluster_label'].max()

Related

Getting position within K-Means cluster

I'm clustering a set of 5000 images using K-Means, and moving them to folders according to cluster number. Is there a way to leverage the clustering to order them by similarity within each directory?
One solution I thought of is calculating the distance (with annoy) between a random image and all others in a cluster/folder, and adding an incremental index to the filename so they are sorted by visual proximity.
I wonder if there's already a subproduct of the clustering I could use for this. Distance from the cluster centroid wouldn't work since two samples could have similar distances but in different directions, so I'm thinking something like the distance to origin in a 2D scatterplot?

How to do clustering on a set of paraboloids and on a set of planes?

I am performing cluster analysis in two parts: in the first part (a), it is clustering on a set of paraboloides and in the second part (b), on a set of planes. The parts are separated, but in both I initially had one set of images, on every image of which I have detected the points to which I (a) fit the paraboloid and (b) the plane. I obtained the equations of the surfaces (paraboloids and planes) so now I have 2 sets of data, for (a) it is the array of the arrays of size 6 (6 coefficients of the equation of the paraboloid) and for (b) it is the array of the arrays of size 3 (3 coefficients of the equation of the plane).
I want to cluster both groups based on the similarities of (a) paraboloids and (b) planes. I am not sure which features of the surfaces (paraboloids and planes) are suitable for clustering.
For (b) I have tried using the angle between the fitted plane and the plane z = 0 -- so only 1 feature for every object in the sample.
I have also tried simply considering these 3 (or 6) coefficients to be seperate variables, but I believe that this way I am not using the fact that this coefficients are connected with each other.
I would be really greatful to hear if there is a better approach what features to use except merely a set of coefficients. Also, I am performing hierarchical and agglomerative clustering.

Spatstat: Export estimated cluster center coordinate points in fitted ThomasCluster model

If you fit a e.g. Thomas Cluster model (using kppm for example), it will fit a model with X number of clusters. Is there a way to extract where the center of each of the X clusters are estimated to be? E.g. if the best fit model on a ppp with 500 points has a mean number of 250 points we would expect there to be 2 clusters estimated from the data. What are the center coordinates of these two clusters?
Many thanks
kppm does not estimate the number of cluster centres or the locations of the cluster centres. It fits a clustered point process model to the point pattern, essentially by matching the K function of the model to the K function of the data. The fitted model only describes the probability distribution of the cluster centres (number and location) and the probability distribution of offspring points relative to their parents.
Estimation/prediction of the actual cluster locations is a much harder task (belonging to the class of missing data problems). You could try the R package mclust for this purpose. You can expect it to take a much longer time to compute.
The fitted model parameters obtained from kppm could be used to determine the cluster parameters in the mclust package to simplify the task.

how to correlate noise data of sklearn-DBSCAN result with other clusters?

I am using sklearn-DBSCAN to cluster my text data.
I used GoogleNews-vectors-negative300.bin to create 300 dimensional sentence vectors for each document and created metrics of size 10000*300.
when I passed metrics to DBSCAN with few possible values of eps (0.2 to 3) & min_samples (5 to 100) with other default parameters, getting numbers of clusters (200 to 10).
As I analyzed for all the clusters noise data is approx 75-80% of my data.
Is there any way to reduce noise or use some other parameters (distances) to reduce noise?
Even I checked with euclidean distance between 2 vectors is 0.6 but both are in different clusters, how can I manage to bring in same cluster?
X_scaled = scaler.fit_transform(sentence_vectors)
ep = 0.3
min_sam = 10
for itr in range(1,11):
dbscan = DBSCAN(eps=ep, min_samples = min_sam*itr)
clusters = dbscan.fit_predict(X_scaled)
If you want two points at distance 0.6 to be in the same cluster, then you may need to use a larger epsilon (which is a distance threshold). At 0.6 they should be in the same cluster.
Since word2vec is trained with dot products, it would likely make more sense to use the dot product as similarity and/or cosine distance.
But in general I doubt you'll be able to get good results. The way sentence vectors are built by averaging word2vec vectors kills too much signal, and adds to much noise. And since the data is high-dimensional, all such noise is a problem.

How remove duplicates from a dataframe and create new one with the weight for each sample?

I'm working on a Classification Problem where I know the label. I'm comparing 2 different algorithms K-Means and DBSCAN. However the latter has the famous problem with the Memory for computing the metric distance. But If in my dataset there are a lot of duplicated samples can I delete them and count their occurrences and after that use this weight in the Algorithm ? Everything for saving memory.
I do not know how to do it . This is my code:
df = dimensionality_reduction(dataframe = df_balanced_train)
train = np.array(df.iloc[:,1:])
### DBSCAN
#Here the centroids there aren't
y_dbscan, centroidi = Cluster(data = train, algo = "DBSCAN")
err, colori = error_Cluster(y_dbscan, df)
#These are the functions:
#DBSCAN Algorithm
#nbrs = NearestNeighbors(n_neighbors= 1500).fit(data)
#distances, indices = nbrs.kneighbors(data)
#print("The mean distance is about : " + str(np.mean(distances)))
#np.median(distances)
dbscan = DBSCAN(eps= 0.9, min_samples= 1000, metric="euclidean",
n_jobs = 1)
y_result = dbscan.fit_predict(data)
centroidi = "In DBSCAN there are not Centroids"
For a sample of 30k elements everything ok but for 800k always prloblem with the memory, could solve my problem delete dupliates and count thir occurrences ?
DBSCAN should take only O(n) memory - just as k-means.
But apparently the sklearn implementation does a version that first computes all neighbors, and thus uses O(n²) memory, and hence less scalable. I'd consider this a bug in sklearn, but apparently they are well aware of this limitation, which seems to be mostly a problem when you choose bad parameters. To guarantee O(n) memory it may be enough to just implement the standard DBSCAN yourself.
Merging duplicates is certainly an option, but A) that usually means you are using inappropriate data for these algorithms resp. for this distance and B) you'll also need to implement the algorithms yourself to add support for weight. Because you need to use weight sums instead of result counts etc. in DBSCAN.
Last but not least: if you have labels and a classification problem, these seem to be the wrong choice. They are clustering, not classification. Their job is not to recreate the labels you have, but to find new labels from the data.

Resources