to Find out accuracy of a model in Python using decision tree - python-3.x

I am very new in Python, So I was trying to run the below code and I want to find accuracy but this program is not displaying accuracy nor any error.
I have tried again and again but its still not displaying. I use jupyter notebook. I am using two datasets(csv) in this program.
def DecisionTree():
from sklearn import tree
clf3 = tree.DecisionTreeClassifier() # empty model of the decision tree
clf3 = clf3.fit(X,y)
from sklearn.metrics import accuracy_score
y_pred=clf3.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(accuracy_score(y_test, y_pred,normalize=False))

You can simply use:
print(clf3.score(X_test, y_test))

Related

How to split dataset and cross-validate in Surprise?

I wrote the following code below which works:
from surprise.model_selection import cross_validate
cross_validate(algo,dataset,measures=['RMSE', 'MAE'],cv=5, verbose=False, n_jobs=-1)
However when I do this: (notice the trainset is passed here in cross_validate instead of whole dataset)
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(dataset, test_size=test_size)
cross_validate(algo, trainset, measures=['RMSE', 'MAE'],cv=5, verbose=False, n_jobs=-1)
It gives the following error:
AttributeError: 'Trainset' object has no attribute 'raw_ratings'
I looked it up and
Surprise documentation says that Trainset objects are not the same as dataset objects, which makes sense.
However, the documentation does not say how to convert the trainset to dataset.
My question is:
1. Is it possible to convert Surprise Trainset to surprise Dataset?
2. If not, what is the correct way to train-test split the whole dataset and cross-validate?
I believe trainset is not for cross-validation. Dataset is for cross-validation. if you print your dataset print(dataset), it will give:
<surprise.dataset.DatasetAutoFolds object at 0x7fe7fc06cd50>
which has already been configured for auto cross-validation.
The trainset which you got from train_test_split is not for cross-validation. You have to run the .fit method.
algo.fit(trainset)
If you want to use cross-validation on your trainset, I believe you have to do a train_test_split (through sklearn?) on your raw data, before you use the reader to convert them.

Is it possible to use a custom-defined decision tree classifier in Scikit-learn?

I have a predefined decision tree, which I built from knowledge-based splits, that I want to use to make predictions. I could try to implement a decision tree classifier from scratch, but then I would not be able to use build in Scikit functions like predict. Is there a way to convert my tree in pmml and import this pmml to make my prediction with scikit-learn? Or do I need to do something completely different?
My first attempt was to use “fake training data” to force the algorithm to build the tree the way I like it, this would end up in a lot of work because I need to create different trees depending on the user input.
You can create your own decision tree classifier using Sklearn API. Please read this documentation following the predictor class types. As explained in this section, you can build an estimator following the template:
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import euclidean_distances
class TemplateClassifier(BaseEstimator, ClassifierMixin):
def __init__(self, demo_param='demo'):
self.demo_param = demo_param
def fit(self, X, y):
# Two paths. Just return the object, or implement here your decision rules
return self
def predict(self, X):
# Check is fit had been called
check_is_fitted(self)
# Input validation
X = check_array(X)
# Change this to your decision tree "rules"
closest = np.argmin(euclidean_distances(X, self.X_), axis=1)
return self.y_[closest]

Single prediction using a model pre-trained with scaled features

I trained a SVM scikit-learn model with scaled features and persist it to be used later. In another file I loaded the saved model and I want to submit a new set of features to perform a prediction. Do I have to scale this new set of features? How can I do this with only one set of features?
I am not scaling the new values and I am getting weird outcomes and I cannot do the predictions. Despite of this, the prediction with a large test set generated by StratifiedShuffleSplit is working fine and I am getting a 97% of accuracy.
The problem is with the single predictions using a persisted SVM model trained with scaled features. Some idea of what am I doing wrong?
Yes, you should absolutely perform the same scaling on the new data. However, this might be impossible if you haven't saved the scaler you trained before.
This is why instead of training and saving your SVM, you should train and save your scaler with your SVM together. In the machine learning jargon, this is called a Pipeline.
This is how you would use it on a toy example:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X,y)
pipe = Pipeline([('scaler',StandardScaler()), ('svc', SVC())])
This pipeline then supports the same operations as a regular scikit-learn model:
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
When fitting the pipe, it first scales and then feeds the scaled features into the classifier.
Once it is trained, you can save the pipe object just like you saved the SVM before. When you will load it and apply it to new data, it will do the scaling as desired before the predictions.

Data Underfitting Or Not?

Is the line of regression underfitting and if yes what can I do for accurate results? I have not been able to identify such things like if the line of regression is overfitting or underfitting or accurate so suggestions regarding those will also be appreciated. The File "Advertising.csv":-https://github.com/marcopeix/ISL-linear-regression/tree/master/data
#Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_squared_error
#reading and knowing the data
data=pd.read_csv('Advertising.csv')
#print(data.head())
#print(data.columns)
#print(data.shape)
#plotting the data
plt.figure(figsize=(10,8))
plt.scatter(data['TV'],data['sales'], c='black')
plt.xlabel('Money Spent on TV ads')
plt.ylabel('Sales')
plt.show()
#storing data into variable and shaping data
X=data['TV'].values.reshape(-1,1)
Y=data['sales'].values.reshape(-1,1)
#calling the model and fitting the model
reg=LinearRegression()
reg.fit(X,Y)
#making predictions
predictions=reg.predict(X)
#plotting the predicted data
plt.figure(figsize=(16,8))
plt.scatter(data['TV'],data['sales'], c='black')
plt.plot(data['TV'],predictions, c='blue',linewidth=2)
plt.xlabel('Money Spent on TV ads')
plt.ylabel('Sales')
plt.show()
r2= r2_score(Y,predictions)
print("R2 score is: ",r2)
print("Accuracy: {:.2f}".format(reg.score(X,Y)))
To work out if your model is underfitting (or overfitting) you need to look at the bias of the model (the distance between the output predicted by your model and the expected output). You can't (to the best of my knowledge) do it just by looking at your code, you need to evaluate your model as well (run it).
As it's a linear regression it's likely that you're underfitting.
I'd suggest splitting your data into a training set and a testing set. You can fit your model on the training set, and see how well it performs on unseen data using the testing set. A model is underfitting if it performs miserably on both the training data as well as the testing data. It's overfitting if it performs brilliantly on the training data but less well on the testing data.
Try something along the lines of:
from sklearn.model_selection import train_test_split
# This will split the data into a train set and a test set, leaving 20% (the test_size parameter) for testing
X, X_test, Y, Y_test = train_test_split(data['TV'].values.reshape(-1,1), data['sales'].values.reshape(-1,1), test_size=0.2)
# Then fit your model ...
# e.g. reg.fit(X,Y)
# Finally evaluate how well it does on the training and test data.
print("Test score " + str(reg.score(X_test, Y_test)))
print("Train score " + str(reg.score(X_test, Y_test)))
Instead of training and testing on same data.
Split your data set into 2,3 sets (train,validation,test)
You may only need to split it in 2 (train,test) use sklearn library function train_test_split
Train your model on training data. Then test on testing data and see if you get good result.
If model's training accuracy is very high but testing is very low then you may say it have overfit. Or if model don't even get high accuracy on train then it is underfitting.
Hope it will you. :)

sklearn: vectorizing in cross validation for text classification

I have a question about using cross validation in text classification in sklearn. It is problematic to vectorize all data before cross validation, because the classifier would have "seen" the vocabulary occurred in the test data. Weka has filtered classifier to solve this problem. What is the sklearn equivalent for this function? I mean for each fold, the feature set would be different because the training data are different.
The scikit-learn solution to this problem is to cross-validate a Pipeline of estimators, e.g.:
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import LinearSVC
>>> clf = Pipeline([('vect', TfidfVectorizer()), ('svm', LinearSVC())])
clf is now a composite estimator that does feature extraction and SVM model fitting. Given a list of documents (i.e. an ordinary Python list of string) documents and their labels y, calling
>>> cross_val_score(clf, documents, y)
will do feature extraction in each fold separately so that each of the SVMs knows only the vocabulary of its (k-1) folds training set.

Resources