Different results after repeating TSNE after KMeans clustering - scikit-learn

I'm using sklearn.manifold.TSNE to project onto 2-dimensional space a dataset that I've separately clustered using sklearn.clustering.KMeans. My code is the following:
clustering = KMeans(n_clusters=5, random_state=5)
clustering.fit(X)
tsne = TSNE(n_components=2)
result = tsne.fit_transform(X)
sc = plt.scatter(x=result[:,0], y=result[:,1],
s=10, c=clustering.labels_)
The perplexity that I have is, that by repeating the process more and more, it seems that my data get clustered in totally different ways as you can see below:
I'm not an expert on clustering nor dimensionality reduction techniques, so I guess that it might be partly due to the stochastic nature of TSNE. Might it also be that I'm using too many features to perform the clustering? (132)

Did you try to set random_state parameter in TSNE ? It should probably fix it.
Fonctions that use randomness at some point have generaly an input parameter to insure that same inputs generate same outputs. This argument is generaly called random_state or seed.
Hope this will help.

Related

gridsearchCV - shuffle data for every single parameter combination

I am using gridsearchCV to determine model hyper-parameters:
pipe = Pipeline(steps=[(self.FE, FE_algorithm), (self.CA, Class_algorithm)])
param_grid = {**FE_grid, **CA_grid}
scorer = make_scorer(f1_score, average='macro')
search = GridSearchCV(pipe, param_grid, cv=ShuffleSplit(test_size=0.20, n_splits=5,random_state=0), n_jobs=-1,
verbose=3, scoring=scorer)
search.fit(self.data_input, self.data_output)
However, I believe I am running into some problems with overfitting:
results
I would like to shuffle the data under every single parameter combination, is there any way to do this? Currently, with the k-fold cross validation the same sets of validation data are being evaluated for each parameter combination, k-fold, and so overfitting is becoming an issue.
No, there isn't. The search splits the data once and creates a task for each combination of fold and parameter combination (source).
Shuffling per parameter combination is probably not desirable anyway: the selection might then just pick the "easiest" split instead of the "best" parameter. If you think you are overfitting to the validation folds, then consider using
fewer parameter options
more folds, or repeated splits*
a scoring callable that customizes evaluation
models that are more conservative
*my favorite among these, although the computation cost may be too high

Do I need a test-train split for K-means clustering even if I'm not looking to predict anything?

I have a set of 2000 points which are basically x,y coordinates of pass origins from association football. I want to run a k-means clustering algorithm on it to just classify it to get which 10 passes are the most common (k=10). However, I don't want to predict any points for future values. I simply want to work with the existing data. Do I still need to split it into testing-training sets? I assume they're only done when we want to train the model on a particular set to calculate for future values (?)
I'm new to clustering (and Python as a whole) so any help would be appreciated.
No, in clustering (i.e unsupervised learning ) you do not need to split the data
I disagree with the answer. Clustering has accuracy as a metric. If you do not split the data into train and test then most likely you'll be overfitting the model. See these similar question 1, 2, 3. Please note, data splitting into train/test set is unrelated to the supervised or unsupervised problem.

How to take the average of n random forest iterations?

Is there a parameter in sklearn that can be tweaked to run a random forest (or other estimator) multiple times to smooth out variation between runs? What's the simplest way to do this?
You can't just simply smooth out the variations between the runs manually. What you can do is perform hyper parameter tuning using GridSearchCV ( or you can look at other similar methods as well at this link. Also you can also look at doing Cross-validation of your dataset for better performance of your estimator. You can have a look at the methods in Sklearn for cross-validation.
Also please provide more information for your problem, like the type of problem you are solving, dataset, etc. so that we can help you better.
VotingClassifier with soft voting may be what you are looking for. In general, given two sets of predictions, you may take the geometric mean of the prediction to smooth it out.
from scipy.stats.mstats import gmean
df = pd.DataFrame()
#prediction renamed in 1.csv,2.csv... for convenience
for i in range(1,4):
data = pd.read_csv('{}.csv'.format(i),index_col='id')
data = data.rename(columns={'proba':i})
df = pd.concat([df,data],axis=1)
df['proba'] = gmean(df.iloc[:,1:4],axis=1)
output = pd.DataFrame(data={'id':df.index,'proba':df.proba})
output.to_csv('submissions.csv',index=False)

Sklearn overfitting

I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data

scikit-learn clustering: predict(X) vs. fit_predict(X)

In scikit-learn, some clustering algorithms have both predict(X) and fit_predict(X) methods, like KMeans and MeanShift, while others only have the latter, like SpectralClustering. According to the doc:
fit_predict(X[, y]): Performs clustering on X and returns cluster labels.
predict(X): Predict the closest cluster each sample in X belongs to.
I don't really understand the difference between the two, they seem equivalent to me.
In order to use the 'predict' you must use the 'fit' method first. So using 'fit()' and then 'predict()' is definitely the same as using 'fit_predict()'. However, one could benefit from using only 'fit()' in such cases where you need to know the initialization parameters of your models rather than if you use 'fit_predict()', where you will just be obtained the labeling results of running your model on the data.
fit_predict is usually used for unsupervised machine learning transductive estimator.
Basically, fit_predict(x) is equivalent to fit(x).predict(x).
This might be very late to add an answer here, It just that someone might get benefitted in future
The reason I could relate for having predict in kmeans and only fit_predict in dbscan is
In kmeans you get centroids based on the number of clusters considered. So once you trained your datapoints using fit(), you can use that to predict() a new single datapoint to assign to a specific cluster.
In dbscan you don't have centroids , based on the min_samples and eps (min distance between two points to be considered as neighbors) you define, clusters are formed . This algorithm returns cluster labels for all the datapoints. This behavior explains why there is no predict() method to predict a single datapoint. Difference between fit() and fit_predict() was already explained by other user -
In another spatial clustering algorithm hdbscan gives us an option to predict using approximate_predict(). Its worth to explore that.
Again its my understanding based on the source code I explored. Any experts can highlight any difference.

Resources