I wanted to know the meaning of
train_data,test_data=train_test_split(data,
test_size=TEST_SIZE,
stratify=data[TARGET_NAME],
random_state=RANDOM_STATE)
until now I have only seen train_test_split into xtrain,x_test,y_train,y_test.
I am not able to understand how y_train,x_train match with the code above?
Related
I followed this scikit learn guidance to find feature importance for a classification problem. Here's the code from the link:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
X, y = load_iris(return_X_y=True)
X.shape
clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(X, y)
clf.feature_importances_
The problem is that, it's not actually what I really want. What I'd like to do is to discover feature importance per class.
One idea that comes to my mind is to turn the data into a binary classification, per class and to train a DecisionTree per class.
Is that a good approach? What are common ideas to deal with this problem?
Thanks!
Yes, one-vs-all classification is a common way of dealing with that issue. You could take that approach. While I don't think there is a principled way of obtaining class-specific feature importance for random forests, you could use the SHAP package to get Shapley values empirically.
I am trying to understand the meaning of GridSearchCV attributes, can someone help me understand the mathematical representation of "split0_test_score'?
I want to use MLPregression in sklearn and I have input with different scale. I am using MLPRegressor in scikit-learn in Python.
Here is my code:
smlp = MLPRegressor(hidden_layer_sizes=(committee,),
activation='relu',
solver='adam',
learning_rate='adaptive',
max_iter=3000,
learning_rate_init=0.01,
alpha=0.01)
It is better to standardize the data in order to improve the convergence.
from sklearn.preprocessing import StandardScaler
Regarding the output values - You might want to standardize them too. It might help the convergence. However. it will be harder to interpret the results afterwards.
Nevertheless, if You are aiming neural networks, it might be worth looking into keras library, allowing much more up-to-date functionality, usage of GPU for training, etc.
When training my model, I'm getting very different results when I use something like sklearn.model_selection.train_test_split(X, y, stratify=y, train_size=0.9) vs. sklearn.model_selection.StratifiedKFold(n_splits=10) and was wondering if there was a difference between how they stratified their data. I'm almost certain I implemented everything according to the docs, but strangely enough, the latter gives much worse testing accuracy than the first.
When stratify is not None train_test_split uses StratifiedShuffleSplit internally, not StratifiedKFold. So yeah, there is a big difference.
Im trying to model in Python 3.5 and am following an example that can be found at here.
I have imported all the required libraries from sklearn.
However I'm getting the following error.
Code:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics
outcome_var = 'Loan_Status'
model = LogisticRegression()
predictor_var = ['Credit_History']
classification_model(model, loan,predictor_var,outcome_var)
When I run the above code I get the following error:
NameError: name 'classification_model' is not defined
I'm not sure how to resolve this as I tried importing sklearn and all the sub libraries.
P.S. I'm new to Python, hence I'm trying to figure out basic steps
Depending on the exact details this may not be what you want but I have never had a problem with
import sklearn.linear_model as sk
logreg = sk.LogisticRegressionCV()
logreg.fit(predictor_var,outcome_var)
This means you have to explicitly separate your training and test set, but having fit to a training set (the process in the final line of my code), you can then use the methods detailed in the documentation [1].
For example figuring out what scores (how many did I get correct) you get on unseen data with the .score method
[1] http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html
It appears this code came from this tutorial.
The issue is exactly as the error describes. classification_model is currently undefined. You need to create this function yourself before you can call it. Check out this part of that tutorial so you can see how it's defined. Good luck!
from sklearn.metrics import classification_report