I'm using clustering to identify different groups in a certain dataset. I then plan to use the groups I find as group-variable in a Bayesian logistic multilevel model.
My question is if using some of the variables I use in theclustering process in the fitting of the model will create problems? Do I have to use different variables in my model than when clustering the different groups?
Related
I plan on using a data set which contains 3 target values of interest. Ultimately I will be trying classification methods on a binary target and also plan on using regression methods for two separate continuous targets.
Is is it a bad practice to do a different train/test split for each target variable?
Otherwise, I am not sure how to split the data in a way that will allow me to predict each target, separately.
If they're effectively 3 different models trained and evaluated separately then for the purposes of scientifically evaluating each model's performance it doesn't matter if you use different test-train splits for each model, as no information will be leaking from model to model. But if you plan on comparing the results of the 3 models or combining all 3 scores into some aggregate metric then you would want to use the same test-train split so that all 3 models are working from the same training data, as otherwise the performance of each model will likely depend to some extent on the test data for the other models, and therefore your combined score will to some extent be a function of your test data.
I have a set of 2000 points which are basically x,y coordinates of pass origins from association football. I want to run a k-means clustering algorithm on it to just classify it to get which 10 passes are the most common (k=10). However, I don't want to predict any points for future values. I simply want to work with the existing data. Do I still need to split it into testing-training sets? I assume they're only done when we want to train the model on a particular set to calculate for future values (?)
I'm new to clustering (and Python as a whole) so any help would be appreciated.
No, in clustering (i.e unsupervised learning ) you do not need to split the data
I disagree with the answer. Clustering has accuracy as a metric. If you do not split the data into train and test then most likely you'll be overfitting the model. See these similar question 1, 2, 3. Please note, data splitting into train/test set is unrelated to the supervised or unsupervised problem.
Maybe it is obvious but I would like to be sure of what I am doing:
I understand that Group K-fold implemented in sklearn, is a variation of k-fold cross validation where it is ensured that data belonging to the same group will not be represented in train and sets at the same time.
That is what I also need. However, before I discover the aforementioned implementation of group k-fold, as i was trying to calculate the validation curve concerning a problem, I noticed the following parameter (the highlighted one):
validation_curve(estimator, X, y, param_name, param_range, groups=None, cv=None...)
According to the documentation if I provide a list of size [n_samples] providing the labels for the corresponding groups, then train/test dataset splitting will be done according to these labels.
And here comes the question. Since a such convenient variable is provided, why - according to my searches- everyone in need of group k-fold validation is first using sklearn.model_selection.GroupKFold ?
Am I missing something here?
I am currently a student and I am developing a project of a Neural Network to classify a dataset of images. Since this images are not labeled I would need a unsupervised method of learning.
It has been suggested to me I should use Auto-Encoders, is it possible to use an Auto-Encoder to 'discover' important features and then use the features learnt in the 'Hidden Layer' into a Multilayer Perceptron Network for instance, so I can classify images?
Thank you all for your help.
Classification is inherently a supervised problem. To do this, you would need to have labeled images that the classifier can learn to predict. Your problem sounds like clustering. Here, you'd assign images to discrete categories (clusters) based on some notion of similarity; images assigned to the same cluster are more similar to each other than those assigned to different clusters. Many clustering algorithms are available. If you wanted, you could perform clustering on the hidden layer representations of an autoencoder. You could think of this as clustering the images after mapping them nonlinearly into a feature space.
Can anyone help me to find a dataset have scores as attribute values and having the class labels(Ground Truth for cluster validation).I want to find the probability of each data item and inturn use it for clustering.
The preferable attribute values are scores like user survey scores(1-bad,2-satisfactory,3-good,4-very good) for each of the attributes.I am preferring score values(say 1,2,3,4) as attribute values as it is easy to calculate probability of each attribute value from these score values.
I found some datasets from UCI Repository but not all attribute values were score values.
Most (if not all) clustering algorithms are density based.
There is plenty of survey literature on clustering algorithm that you need to check. There are literary hundreds of density based algorithms, including DBSCAN, OPTICS, DENCLUE, ...
However, I have the impression you are using the term "density based" different than literature. You seem to refer to probability, not density?
Do not expect clustering to give class labels. Classes are not clusters. Classes can be inseparable, or a single class may consists of multiple clusters. The famous iris data set, for example, intuitively consists only of 2 clusters (but 3 classes).
For evaluation and all that, check existing questions and answers, please.