Data Modelling - SVM - python-3.x

I am currently doing Data modelling and I am getting an error and couldn't find a solution to it. So I am hoping I would get some help from this platform!
Thanks in advance.
My code:-
from sklearn import cross_validation
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split
from sklearn import svm
X = np.array(observables) #X are features
y = np.array(df['diagnosis']) # y is label
X_train, y_train, X_test, y_test= cross_validation.train_test_split(X, y, test_size=0.2)
clf= svm.SVC()
clf.fit(X_train, y_train)
accuracy= clf.score(X_test, y_test)
print (accuracy)
However I get this error:
ValueError: bad input shape (114, 8)

It seems like you mixed up the order of the return values of train_test_split, use
X_train, X_test, y_train, y_test= cross_validation.train_test_split(X, y, test_size=0.2)
instead of
X_train, y_train, X_test, y_test= cross_validation.train_test_split(X, y, test_size=0.2)

Related

I am using lazy classifier for my dataset but it returns empty frame

from sklearn.model_selection import train_test_split
import lazypredict
from lazypredict.Supervised import LazyClassifier
y = np.array(skin_new_df['diagnostic'])
X = np.array(skin_new_df.drop(['diagnostic'], axis=1))
print(X.shape)
print(y.shape)
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)
clf = LazyClassifier(verbose=0,
ignore_warnings=True,
custom_metric = None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)
print(models)
I run this code and get empty frame at the output
(2298, 25)
(2298,)
100%|██████████| 29/29 [00:08<00:00, 3.61it/s]
Accuracy Balanced Accuracy ROC AUC F1 Score Time Taken
Model
I want to get all models accuracy

How to extract the actual values from the shap summary plot

How to go about extracting the numerical values for the shap summary plot so that the data can be viewed in a dataframe?:
Here is a MWE:
from sklearn.datasets import make_classification
from shap import Explainer, waterfall_plot, Explanation
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Generate noisy Data
X, y = make_classification(n_samples=1000,
n_features=50,
n_informative=9,
n_redundant=0,
n_repeated=0,
n_classes=10,
n_clusters_per_class=1,
class_sep=9,
flip_y=0.2,
random_state=17)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = RandomForestClassifier()
model.fit(X_train, y_train)
explainer = Explainer(model)
sv = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_train, plot_type="bar")
I tried
np.abs(shap_values.values).mean(axis=0)
but I get a shape of (50,10). How do I get just the aggerated value for each feature to then sort for the feature importance?
You've done this:
from sklearn.datasets import make_classification
from shap import Explainer, waterfall_plot, Explanation
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from shap import summary_plot
# Generate noisy Data
X, y = make_classification(n_samples=1000,
n_features=50,
n_informative=9,
n_redundant=0,
n_repeated=0,
n_classes=10,
n_clusters_per_class=1,
class_sep=9,
flip_y=0.2,
random_state=17)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = RandomForestClassifier()
model.fit(X_train, y_train)
explainer = Explainer(model)
sv = explainer.shap_values(X_test)
summary_plot(sv, X_train, plot_type="bar")
Note, you have features 3, 29, 34 and so on at the top.
If you do:
np.abs(sv).shape
(10, 250, 50)
You'll find out you've got 10 classes for 250 datapoints for 50 features.
If you aggregate, you'll get everything you need:
aggs = np.abs(sv).mean(1)
aggs.shape
(10, 50)
You can draw it:
sv_df = pd.DataFrame(aggs.T)
sv_df.plot(kind="barh",stacked=True)
And if it still doesn't look familiar, you can rearrange and filter:
sv_df.loc[sv_df.sum(1).sort_values(ascending=True).index[-10:]].plot(kind="barh",stacked=True)
Conclusion:
sv_df are aggregated SHAP values, as in summary plot, arranged as features per row and classes per column.
Does it help?

LogisticRegression classifier

I need to use Logistic Regression classifier I have dataset the length of each column 2000 this is all my code:
from statistics import mode
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
# Importing the datasets
###Social_Network_Ads
datasets = pd.read_csv('C:/Users/n3.csv',header=None)
X = datasets.iloc[:, 0:5].values
Y = datasets.iloc[:, 5].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y, test_size = 0.25, random_state = 0)
# instantiate the model (using the default parameters)
model = LogisticRegression()
# fit the model with data
model.fit(X_Train, Y_Train)
predicted = cross_val_predict(mode, X_Train, Y_Train, cv=5)
train_acc = model.score(X_Train, Y_Train)
print("The Accuracy for Training Set is {}".format(train_acc*100))
But in I got on this error:
TypeError: Cannot clone object '<function mode at 0x000000FD6579B9D0>'
(type <class 'function'>): it does not seem to be a scikit-learn
estimator as it does not implement a 'get_params' method.
How solve this?
Change this line
predicted = cross_val_predict(mode, X_Train, Y_Train, cv=5)
to
predicted = cross_val_predict(model, X_Train, Y_Train, cv=5)
You have a simple typo. You want to pass your estimator to the function but instead you passed mode which is imported from statistics. That's why the error tells you that it can not clone an object of type function. You are passing a function but it expects an estimator.

creating baseline regression model with average and min values in python

I want to compare results of my regression analysis with encoded categorical variables with two baseline models where the baseline predictions are specified as the average or min values of the groups. I've chosen Rsquare and MAE for comparison. Below is a simplified example of my code for illustration. It works in that it gives me an output which I think achieves my goal. Is this the correct and/or best way to do this?
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
df = pd.DataFrame([['a1','c1',10],
['a1','c2',15],
['a1','c3',20],
['a1','c1',15],
['a2','c2',20],
['a2','c3',15],
['a2','c1',20],
['a2','c2',15],
['a3','c3',20],
['a3','c3',15],
['a3','c3',15],
['a3','c3',20]], columns=['aid','cid','T'])
df_dummies = pd.get_dummies(df, columns=['aid','cid'],prefix_sep='',prefix='')
df_dummies
X = df_dummies
y = df_dummies['T']
# train test split 80-20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regr = LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('R-squared:', metrics.r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
# Baseline model with group average as prediction
y_pred = df.groupby('aid').agg({'T': ['mean']})
print('R-squared:', metrics.r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
# Baseline model with group min as prediction
y_pred = df.groupby('aid').agg({'T': ['min']})
print('R-squared:', metrics.r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
First of all, I would rename y_predall the time to not get confused.
In general:
y_pred = df.groupby('aid').agg({'T': ['mean']})
will give you the mean of the column 'aid'.
And y_pred = df.groupby('aid').agg({'T': ['min']}) will give you the minimum.
There is an interessting package for you: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html
This is helpful for dummy regression and has also other methods inside.
In your case it should work like this:
df_dummies = pd.get_dummies(df, columns=['aid','cid'],prefix_sep='',prefix='')
X = df_dummies
y = df['T']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
dummy_min=DummyRegressor(strategy='constant',constant=min_value)
dummy_min.fit(X_train,y_train)

Error encountered: Classification metrics can't handle a mix of multiclass-multioutput and binary targets

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
file = './BBC.csv'
df = read_csv(file)
array = df.values
X = array[:, 0:11]
Y = array[:, 11]
test_size = 0.30
seed = 45
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = RandomForestClassifier()
model.fit(X_train, Y_train)
result = model.score(X_test, X_test)
print("Accuracy: %.3f%%") % (result*100.0)
dataset: https://www.dropbox.com/s/ar1c9yuv5x774cv/BBC.csv?dl=0
I have encountered this error:
Classification metrics can't handle a mix of multiclass-multioutput and binary targets
If i'm not wrong RandomForest should be able to handle both classes (classification) and means (regression). Am i wrong?
Edit:
Checked your dataset. So for classification task, your problem lies in your code.
result = model.score(X_test, X_test)
Note that the parameter here should be X_test and Y_test
-----kind of off-topic-----
If you want to use RandomForest for regression, you probably should call RandomForestRegressor

Resources