How to extrapolate a sample of 10,000 rows to the entire population (100,000) in python. I did agglomerative clustering on the sample in python, stuck with extrapolating the result to the entire population.
There is no general rule.
For hierarchical clustering, this very much depends on your linkage, and the clustering of a different sample or the whole population may be very different. (For a starter, try a different sample and compare!)
Generalizing a clustering result to new data is usually contradicting the very assumptions made for the clustering. This is not classification, but explorative data analysis.
However, if you have found good clustering results, and you have verified them to be desirable, then you can train a classifier on the cluster labels to predict the cluster label of new data.
Related
I am trying to perform clustering on the Market-1501 dataset. The approach that I am using is as follows:
I train a Person-Reid Model (using this repository: Reid-Strong-Baseline
Use a version of depth first search for clustering data (not part of the training set) into individual classes.
Although the Rank-1, Rank-5 metrics of the ReID model are very good, the overall effect of clustering is rather disappointing. I am also struggling to find relevant literature that could help me.
Does anyone have any pointers on where I could at least find relevant literature (i.e Person-Reid followed by clustering).
Thanks in advance.
I am building a K means algorithm and have multiple variables to feed into it. As of this I am using PCA to transform the data to two dimensions. When I display the PCA biplot I don't understand what similarities the data has to be grouped into a specific cluster. I am using a customer segmentation dataset. I.E: I want to be able to know that a specific cluster is a cluster as a customer has a low income but spends a lot of money on products.
Since you are using k-means:
Compute the mean of each cluster on the original data. Now you can compare these attributes.
Alternatively: don't use PCA in the first place, if it had your analysis... k-means is as good as PCA at coping with several dozen variables.
I have a set of 2000 points which are basically x,y coordinates of pass origins from association football. I want to run a k-means clustering algorithm on it to just classify it to get which 10 passes are the most common (k=10). However, I don't want to predict any points for future values. I simply want to work with the existing data. Do I still need to split it into testing-training sets? I assume they're only done when we want to train the model on a particular set to calculate for future values (?)
I'm new to clustering (and Python as a whole) so any help would be appreciated.
No, in clustering (i.e unsupervised learning ) you do not need to split the data
I disagree with the answer. Clustering has accuracy as a metric. If you do not split the data into train and test then most likely you'll be overfitting the model. See these similar question 1, 2, 3. Please note, data splitting into train/test set is unrelated to the supervised or unsupervised problem.
Can anyone help me to find a dataset have scores as attribute values and having the class labels(Ground Truth for cluster validation).I want to find the probability of each data item and inturn use it for clustering.
The preferable attribute values are scores like user survey scores(1-bad,2-satisfactory,3-good,4-very good) for each of the attributes.I am preferring score values(say 1,2,3,4) as attribute values as it is easy to calculate probability of each attribute value from these score values.
I found some datasets from UCI Repository but not all attribute values were score values.
Most (if not all) clustering algorithms are density based.
There is plenty of survey literature on clustering algorithm that you need to check. There are literary hundreds of density based algorithms, including DBSCAN, OPTICS, DENCLUE, ...
However, I have the impression you are using the term "density based" different than literature. You seem to refer to probability, not density?
Do not expect clustering to give class labels. Classes are not clusters. Classes can be inseparable, or a single class may consists of multiple clusters. The famous iris data set, for example, intuitively consists only of 2 clusters (but 3 classes).
For evaluation and all that, check existing questions and answers, please.
Say we have a dataset of a large dimension, which we have reduced to a lower dimension using PCA, would it be wise/accurate to then use a clustering algorithm on said data? Assuming that we do not know how many clusters to expect.
Using PCA on the Iris dataset(with the data in the csv ordered such that all of the first class are listed, then the second, then the third) yields the following plot:-
It can be seen that the three classes in the Iris dataset have been retained. However, when the order of the samples is randomised, the following plot is produced:-
Above, it is not clear how many clusters/classes are contained in the data set. In this case(the more real world case), how would one identify the number of classes, would a clustering algorithm such as K-Means be effective?
Would there be innacuracies due to the discarding of lower order Principal Components?
EDIT:- To be clear, I am asking if a dataset can be clustered after running PCA, and if so, what the most accurate method would be.
Say we have a dataset of a large dimension, which we have reduced to a lower
dimension using PCA, would it be wise/accurate to then use a clustering
algorithm on said data? Assuming that we do not know how many clusters to
expect.
Your data might well separate in a low-variance dimension. I would not recommend running PCA prior to clustering.
Above, it is not clear how many clusters/classes are contained in the data
set. In this case(the more real world case), how would one identify the
number of classes, would a clustering algorithm such as K-Means be effective?
There are effective clustering algorithms that do not require prior knowledge of the number of classes, such as Mean Shift and DBSCAN.
Try sorting the dataset after PCA, then plotting it.
The iris data set is much to simple to draw any valid conclusions about the behaviour of high-dimensional data, and the benefits of PCA.
Plus, "wise" - in which sense? If you want to eat pizza, it is not wise to plot the iris data set.