Spatstat: Export estimated cluster center coordinate points in fitted ThomasCluster model - spatstat

If you fit a e.g. Thomas Cluster model (using kppm for example), it will fit a model with X number of clusters. Is there a way to extract where the center of each of the X clusters are estimated to be? E.g. if the best fit model on a ppp with 500 points has a mean number of 250 points we would expect there to be 2 clusters estimated from the data. What are the center coordinates of these two clusters?
Many thanks

kppm does not estimate the number of cluster centres or the locations of the cluster centres. It fits a clustered point process model to the point pattern, essentially by matching the K function of the model to the K function of the data. The fitted model only describes the probability distribution of the cluster centres (number and location) and the probability distribution of offspring points relative to their parents.
Estimation/prediction of the actual cluster locations is a much harder task (belonging to the class of missing data problems). You could try the R package mclust for this purpose. You can expect it to take a much longer time to compute.
The fitted model parameters obtained from kppm could be used to determine the cluster parameters in the mclust package to simplify the task.

Related

Which Algorithms are used for Drift Magnitude and Top Drifting Features by Azure in Data Drift Detection

What is the Exact Algorithm used below to derive the Drift magnitude as a percentage? And how did they get these percentages for Top Drifting Features?
This is a sample dashboard for Azure Drift Detection in Data:
Azure has specified these algorithms below for each categorical and numerical feature:
But none of them return a percentage. And mathematically the Wasserstein distance (Earth-Mover Distance) can be any number from 0 to infinity. So how do they derive a percentage out of it?
There was a mention of the Matthews correlation coefficient (MCC) used for Drift magnitude. If so how does that work exactly?
The data drift will be working on time series dataset. The time series dataset will work on distance methodology. The distance methodology is more like to be using with cluster. The clusters are working on distance pattern.
While creating the data drift, we are running it on an instance cluster. The machine learning model K-nearest neighbor. LSTM and dynamic time wrapping (DTW) will be working internally in the data drift.
While uploading the dataset the properties must be shifted to Timestamp
The data drift factors are measured using Min, Max, Mean, and other distance factors. Root mean squared methodology will be in front-line to take up the distance factors for evaluation operation.

how to correlate noise data of sklearn-DBSCAN result with other clusters?

I am using sklearn-DBSCAN to cluster my text data.
I used GoogleNews-vectors-negative300.bin to create 300 dimensional sentence vectors for each document and created metrics of size 10000*300.
when I passed metrics to DBSCAN with few possible values of eps (0.2 to 3) & min_samples (5 to 100) with other default parameters, getting numbers of clusters (200 to 10).
As I analyzed for all the clusters noise data is approx 75-80% of my data.
Is there any way to reduce noise or use some other parameters (distances) to reduce noise?
Even I checked with euclidean distance between 2 vectors is 0.6 but both are in different clusters, how can I manage to bring in same cluster?
X_scaled = scaler.fit_transform(sentence_vectors)
ep = 0.3
min_sam = 10
for itr in range(1,11):
dbscan = DBSCAN(eps=ep, min_samples = min_sam*itr)
clusters = dbscan.fit_predict(X_scaled)
If you want two points at distance 0.6 to be in the same cluster, then you may need to use a larger epsilon (which is a distance threshold). At 0.6 they should be in the same cluster.
Since word2vec is trained with dot products, it would likely make more sense to use the dot product as similarity and/or cosine distance.
But in general I doubt you'll be able to get good results. The way sentence vectors are built by averaging word2vec vectors kills too much signal, and adds to much noise. And since the data is high-dimensional, all such noise is a problem.

Clustering of facebook-users with k-means

i got a facebook-list of user-ids from following page:
Stanford Facebook-Data
If you look at the facebook_combined data, you can see that it is a list of user-connections (edges). So for instance user 0 has something to do with user 1,2,3 and so on.
Now my work is to find clusters in the dataset.
In the first step i used node.js to read the file and save the data in an array like this:
array=[[0,1],[0,2], ...]
In the second step i used a k-means plugin for node.js to cluster the data:
Cluster-Plugin
But i dont know if the result is right, because now i get clusters of edges and not clusters of users.
UPDATE:
I am trying out a markov implementation for node js. The Markov Plugin however needs an adjacency matrix to build clusters. I implemented an algorithm with java to save the matrix in a file.
Maybe you got any other suggestion how i could get clusters out of edges.
K-means assumes your input data issue an R^d vector space.
In fact, it requires the data to be this way, because it computes means as cluster centers, hence the name k-means.
So if you want to use k-means, then you need
One row per datapoint (not an edge list)
A fixed dimensionality data space where the mean is a useful center (usually, you should have continuous attributes, on binary data the mean does not make too much sense) and where least-squares is a meaningful optimization criterion (again, on binary data, least-squares does not have a strong theoretical support)
On your Faceboook data, you could try some embedding, but I'd have doubts about the trustworthiness.

how to calculate distance between any two elements in more than 10^8 data to Clustering them using spark?

I have more than 10^8 records stored in elasticSearch. Now I want to clustering them by writing a hierarchical algorithm or using PIC based on spark MLlib.
However, I can't use some efficient algorithm like K-means because every record is stored in the form of
{mainID:[subId1,subId2,subId3,...]}
which obviously is not in euclidean space.
I need to calculate the distance of every pair of records which will take a very LONG time I guess (10^8 * 10^8). I know the cartesian product in spark to do such computing , but there will appear the duplicated ones like (mainID1,mainID2) and (mainID2,mainID1), which is not suitable to PIC.
Does anyone know a better way to cluster these records? Or any method to delete the duplicated ones in the result RDD of cartesian product?
Thanks A lot!
First of all, don't take the full Cartesian product:
select where a.MainID > b.MainID
This doesn't reduce the complexity, but it does save about 2x in generation time.
That said, consider your data "shape" and select the clustering algorithm accordingly. K-means, HC, and PIC have three different applications. You know K-means already, I'm sure.
PIC basically finds gaps in the distribution of distances. It's great for well-defined sets (clear boundaries), even when those curl around each other or nest. However, if you have a tendril of connecting points (like a dumbbell with a long, thin bar), PIC will not separate the obvious clusters.
HC is great for such sets, and is a good algorithm in general. Most HC algorithms have an "understanding" of density, and tend to give clusterings that fit human cognition's interpretation. However, HC tends to be slow.
I strongly suggest that you consider a "seeded" algorithm: pick a random subset of your points, perhaps
sqrt(size) * dim
points, where size is the quantity of points (10^8) and dim is the number of dimensions. For instance, your example has 5 dimensions, so take 5*10^4 randomly selected points. Run the first iterations on those alone, which will identify centroids (K-means), eigenvectors (PIC), or initial hierarchy (HC). With those "seeded" values, you can now characterize each of the candidate clusters with 2-3 parameters. Classifying the remaining 10^8 - 5*10^4 points against 3 parameters is a lot faster, being O(size) time instead of O(size^2).
Does that get you moving toward something useful?

How to scale input DBSCAN in scikit-learn

Should the input to sklearn.clustering.DBSCAN be pre-processeed?
In the example http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-py the distances between the input samples X are calculated and normalized:
D = distance.squareform(distance.pdist(X))
S = 1 - (D / np.max(D))
db = DBSCAN(eps=0.95, min_samples=10).fit(S)
In another example for v0.14 (http://jaquesgrobler.github.io/online-sklearn-build/auto_examples/cluster/plot_dbscan.html) some scaling is done:
X = StandardScaler().fit_transform(X)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
I base my code on the latter example and have the impression clustering works better with this scaling. However, this scaling "Standardizes features by removing the mean and scaling to unit variance". I try to find 2d clusters. If I have my clusters distributed in a squared area - let's say 100x100 I see no problem in the scaling. However, if the are distributed in an rectangled area e.g. 800x200 the scaling 'squeezes' my samples and changes the relative distances between them in one dimension. This deteriorates the clustering, doesn't it? Or am I understanding sth. wrong?
Do I need to apply some preprocessing at all, or can I simply input my 'raw' data?
It depends on what you are trying to do.
If you run DBSCAN on geographic data, and distances are in meters, you probably don't want to normalize anything, but set your epsilon threshold in meters, too.
And yes, in particular a non-uniform scaling does distort distances. While a non-distorting scaling is equivalent to just using a different epsilon value!
Note that in the first example, apparently a similarity and not a distance matrix is processed. S = (1 - D / np.max(D)) is a heuristic to convert a similarity matrix into a dissimilarity matrix. Epsilon 0.95 then effectively means at most "0.05 of the maximum dissimilarity observed". An alternate version that should yield the same result is:
D = distance.squareform(distance.pdist(X))
S = np.max(D) - D
db = DBSCAN(eps=0.95 * np.max(D), min_samples=10).fit(S)
Whereas in the second example, fit(X) actually processes the raw input data, and not a distance matrix. IMHO that is an ugly hack, to overload the method this way. It's convenient, but it leads to misunderstandings and maybe even incorrect usage sometimes.
Overall, I would not take sklearn's DBSCAN as a referene. The whole API seems to be heavily driven by classification, not by clustering. Usually, you don't fit a clustering, you do that for supervised methods only. Plus, sklearn currently does not use indexes for acceleration, and needs O(n^2) memory (which DBSCAN usually would not).
In general, you need to make sure that your distance works. If your distance function doesn't work no distance-based algorithm will produce the desired results. On some data sets, naive distances such as Euclidean work better when you first normalize your data. On other data sets, you have a good understanding on what distance is (e.g. geographic data. Doing a standardization on this obivously does not make sense, nor does Euclidean distance!)

Resources