How to take the average of n random forest iterations? - scikit-learn

Is there a parameter in sklearn that can be tweaked to run a random forest (or other estimator) multiple times to smooth out variation between runs? What's the simplest way to do this?

You can't just simply smooth out the variations between the runs manually. What you can do is perform hyper parameter tuning using GridSearchCV ( or you can look at other similar methods as well at this link. Also you can also look at doing Cross-validation of your dataset for better performance of your estimator. You can have a look at the methods in Sklearn for cross-validation.
Also please provide more information for your problem, like the type of problem you are solving, dataset, etc. so that we can help you better.

VotingClassifier with soft voting may be what you are looking for. In general, given two sets of predictions, you may take the geometric mean of the prediction to smooth it out.
from scipy.stats.mstats import gmean
df = pd.DataFrame()
#prediction renamed in 1.csv,2.csv... for convenience
for i in range(1,4):
data = pd.read_csv('{}.csv'.format(i),index_col='id')
data = data.rename(columns={'proba':i})
df = pd.concat([df,data],axis=1)
df['proba'] = gmean(df.iloc[:,1:4],axis=1)
output = pd.DataFrame(data={'id':df.index,'proba':df.proba})
output.to_csv('submissions.csv',index=False)

Related

Different results after repeating TSNE after KMeans clustering

I'm using sklearn.manifold.TSNE to project onto 2-dimensional space a dataset that I've separately clustered using sklearn.clustering.KMeans. My code is the following:
clustering = KMeans(n_clusters=5, random_state=5)
clustering.fit(X)
tsne = TSNE(n_components=2)
result = tsne.fit_transform(X)
sc = plt.scatter(x=result[:,0], y=result[:,1],
s=10, c=clustering.labels_)
The perplexity that I have is, that by repeating the process more and more, it seems that my data get clustered in totally different ways as you can see below:
I'm not an expert on clustering nor dimensionality reduction techniques, so I guess that it might be partly due to the stochastic nature of TSNE. Might it also be that I'm using too many features to perform the clustering? (132)
Did you try to set random_state parameter in TSNE ? It should probably fix it.
Fonctions that use randomness at some point have generaly an input parameter to insure that same inputs generate same outputs. This argument is generaly called random_state or seed.
Hope this will help.

Why does LogisticRegression give the same result every time, even with different random state?

I am not an expert on logistic regression, but I thought when solving it using lgfgs it was doing optimization, finding local minima for the objective function. But every time I run it using scikit-learn, it is returning the same results, even when I feed it a different random state.
Below is code that reproduces my issue.
First set up the problem by generating data
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import datasets
# generate data
X, y = datasets.make_classification(n_samples=1000,
n_features=10,
n_redundant=4,
n_clusters_per_class=1,
random_state=42)
# Set up the test/training data
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25)
Second, train the model and inspect results
# Set up a different random state each time
rand_state = np.random.randint(1000)
print(rand_state)
model = LogisticRegression(max_iter=1000,
solver='lbfgs',
random_state=rand_state)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
conf_mat = metrics.confusion_matrix(y_test, y_pred)
print(y_pred[:20],"\n", conf_mat)
I get the same y_pred (and obviously confusion matrix) every time I run this even though I'm using the lbfgs solver with a different random state each run. I'm confused, as I thought this was a stochastic solver that was traveling down a gradient into a local minimum.
Maybe I'm not properly randomizing the initial state? I haven't been able to figure it out from the documentation.
Discussion of Related Question
There is a related question, which I didn't find during my research:
Does logistic regression always find global optimum, assuming that the optimisation converges?
The answer there is that the cost function is convex, so if the numerical solution is well-behaved, it will find a global minimum. That is, there aren't a bunch of local minima that your optimization algorithm will get stuck in: it will reach the same (global) minimum each time (perhaps depending on the solver you choose?).
However, in the comments someone pointed out, depending on what solvers you choose there are cases when you will not reach the same solution, that it depends on the random_state parameter. At the very least, I think this would be helpful to resolve.
First, let me put in the answer what got this closed as duplicate earlier: a logistic regression problem (without perfect separation) has a global optimum, and so there are no local optima to get stuck in with different random seeds. If the solver converges satisfactorily, it will do so on the global optimum. So the only time random_state can have any effect is when the solver fails to converge.
Now, the documentation for LogisticRegression's parameter random_state states:
Used when solver == ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the data. [...]
So for your code, with solver='lbfgs', indeed there is no expected effect.
It's not too hard to make sag and saga fail to converge, and with different random_states to end at different solutions; to make it easier, set max_iter=1. liblinear apparently does not use the random_state unless solving the dual, so also setting dual=True admits different solutions. I found that thanks to this comment on a github issue (the rest of the issue may be worth reading for more background).

XGboost classifier

I am new to XGBoost and I am currently working on a project where we have built an XGBoost classifier. Now we want to run some feature selection techniques. Is backward elimination method a good idea for this? I have used it in regression but I am not sure if/how to use it in a classification problem. Any leads will be greatly appreciated.
Note: I have already tried permutation line importance and it has yielded good results! Looking for another method to evaluate the features in the model.
Consider asking your question on Cross Validated since feature selection is more about theory/practice than code.
What is your concern ? Remove "noisy" features who drive down your results, obtain a sparse model ? Backward selection is one way to do of course. That being said, not sure if you are aware of this but XGBoost computes its own "variable importance" values.
# plot feature importance using built-in function
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model)
pyplot.show()
Something like this. This importance is based on how many times a feature is used to make a split. You can then define for instance a threshold below which you do not keep the variables. However do not forget that :
This variable importance has been obtained on the training data only
The removal of a variable with high importance may not affect your prediction error, e.g. if it is correlated with another highly important variable. Other tricks such as this one may exist.

Sklearn overfitting

I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data

scikit-learn clustering: predict(X) vs. fit_predict(X)

In scikit-learn, some clustering algorithms have both predict(X) and fit_predict(X) methods, like KMeans and MeanShift, while others only have the latter, like SpectralClustering. According to the doc:
fit_predict(X[, y]): Performs clustering on X and returns cluster labels.
predict(X): Predict the closest cluster each sample in X belongs to.
I don't really understand the difference between the two, they seem equivalent to me.
In order to use the 'predict' you must use the 'fit' method first. So using 'fit()' and then 'predict()' is definitely the same as using 'fit_predict()'. However, one could benefit from using only 'fit()' in such cases where you need to know the initialization parameters of your models rather than if you use 'fit_predict()', where you will just be obtained the labeling results of running your model on the data.
fit_predict is usually used for unsupervised machine learning transductive estimator.
Basically, fit_predict(x) is equivalent to fit(x).predict(x).
This might be very late to add an answer here, It just that someone might get benefitted in future
The reason I could relate for having predict in kmeans and only fit_predict in dbscan is
In kmeans you get centroids based on the number of clusters considered. So once you trained your datapoints using fit(), you can use that to predict() a new single datapoint to assign to a specific cluster.
In dbscan you don't have centroids , based on the min_samples and eps (min distance between two points to be considered as neighbors) you define, clusters are formed . This algorithm returns cluster labels for all the datapoints. This behavior explains why there is no predict() method to predict a single datapoint. Difference between fit() and fit_predict() was already explained by other user -
In another spatial clustering algorithm hdbscan gives us an option to predict using approximate_predict(). Its worth to explore that.
Again its my understanding based on the source code I explored. Any experts can highlight any difference.

Resources