How to convert scalar array to 2d array? - python-3.x

I am new to machine learning and facing some issues in converting scalar array to 2d array.
I am trying to implement polynomial regression in spyder. Here is my code, Please help!
# Polynomial Regression
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
poly_reg.fit(X_poly, y)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
# Predicting a new result with Linear Regression
lin_reg.predict(6.5)
# Predicting a new result with Polynomial Regression
lin_reg_2.predict(poly_reg.fit_transform(6.5))
ValueError: Expected 2D array, got scalar array instead: array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.

You get this issue in Jupyter only.
To resolve in jupyter make the value into np array using below code.
lin_reg.predict(np.array(6.5).reshape(1,-1))
lin_reg_2.predict(poly_reg.fit_transform(np.array(6.5).reshape(1,-1)))
For spyder it work same as you expected:
lin_reg.predict(6.5)
lin_reg_2.predict(poly_reg.fit_transform(6.5))

The issue with your code is linreg.predict(6.5).
If you read the error statement it says that the model requires a 2-d array , however 6.5 is scalar.
Why? If you see your X data is having 2-d so anything that you want to predict with your model should also have two 2d shape.
This can be achieved either by using .reshape(-1,1) which creates a column vector (feature vector) or .reshape(1,-1) If you have single sample.
Things to remember in order to predict I need to prepare my data in the same way as my original training data.
If you need any more info let me know.

You have to give the input as 2D array, Hence try this!
lin_reg.predict([6.5])
lin_reg_2.predict(poly_reg.fit_transform([6.5]))

Related

A function to insert data in dataset using python

I create a program that predict digits from in a dataset. I want when it predict data their should be two cases if it predict right then data should added automatically in dataset otherwise it takes right answer throw user and insert to dataset.
code
import numpy as np
import pandas as pd
import matplotlib.pyplot as pt
from sklearn.tree import DecisionTreeClassifier
data = pd.read_csv("train.csv").values
clf = DecisionTreeClassifier()
xtrain = data[0:21000,1:]
train_label=data[0:21000,0]
clf.fit(xtrain,train_label)
xtest = data[21000: ,1:]
actual_label=data[21000:,0]
d = xtest[9]
d.shape = (28,28)
pt.imshow(d,cmap='gray')
print(clf.predict([xtest[9]]))
pt.show()
I'm not sure I'm following your question, but if you want to distinguish between good and wrong predictions and take different ways, you should specific do that.
predictions = clf.predict(xtest)
good_predictions = xtest[pd.Series(predictions == actual_label)]
bad_predictions = xtest[pd.Series(predictions != actual_label)]
So, in good_predictions will be all the rows in xtest that where predicted right.

Why am I getting a score of 0.0 when finding the score of test data using Gaussian NB classifier?

I have two different data sets. One for training my classifier and the other one is for testing. Both the datasets are text files with two columns separated by a ",". FIrst column (numbers) is for the independent variable (group) and the second column is for the dependent variable.
Training data set
(just few lines for example. there are no empty lines between each row):
EMI3776438,1
EMI3776438,1
EMI3669492,1
EMI3752004,1
Testing data setup
(as you can see, i have picked data from the training data to be sure that the score surely can't be zero)
EMI3776438,1
Code in Python 3.6:
# #all the import statements have been ignored to keep the code short
# #loading the training data set
training_file_path=r'C:\Users\yyy\Desktop\my files\python\Machine learning\Carepack\modified_columns.txt'
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
training_file_data = pandas.read_table(training_file_path,
header=None,
names=['numbers','group'],
sep=',')
training_file_data = training_file_data.apply(le.fit_transform)
features = ['numbers']
x = training_file_data[features]
y = training_file_data["group"]
from sklearn.model_selection import train_test_split
training_x,testing_x, training_y, testing_y = train_test_split(x, y,
random_state=0,
test_size=0.1)
from sklearn.naive_bayes import GaussianNB
gnb= GaussianNB()
gnb.fit(training_x, training_y)
# #loading the testing data
testing_final_path=r"C:\Users\yyy\Desktop\my files\python\Machine learning\Carepack\testing_final.txt"
testing_sample_data=pandas.read_table(testing_final_path,
sep=',',
header=None,
names=['numbers','group'])
testing_sample_data = testing_sample_data.apply(le.fit_transform)
category = ["numbers"]
testing_sample_data_x = testing_sample_data[category]
# #finding the score of the test data
print(gnb.score(testing_sample_data_x, testing_sample_data["group"]))
First, the above data samples dont show how many classes are there in it. You need to describe more about it.
Secondly, you are calling le.fit_transform again on test data which will forget all the training samples mappings from strings to numbers. The LabelEncoder le will start encoding the test data again from scratch, which will not be equal to how it mapped training data. So the input to GaussianNB is now incorrect and hence incorrect results.
Change that to:
testing_sample_data = testing_sample_data.apply(le.transform)
UPDATE:
I'm sorry I overlooked the fact that you had two columns in your data. LabelEncoder only works on a single column of data. For making it work on multiple pandas columns at once, look at the answers of following question:
Label encoding across multiple columns in scikit-learn
If you are using the latest version of scikit (0.20) or can update to it, then you would not need any such hacks and directly use the OrdinalEncoder:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
training_file_data = enc.fit_transform(training_file_data)
And during testing:
training_file_data = enc.transform(training_file_data)

Using Keras/sklearn with CalibratedClassifierCV from sklearn.calibration

Is it possible to use Keras model objects with CalibratedClassifierCV from sklearn.calibration? Or is there another way to performa isotonic regression in sklearn/other python packages without having to pass it a model object.
I tried using the sklearn wrapper for Keras, but it didn't work. Here is the doc for the CalibratedClassifierCV class.
You can train an isotonic regression a posteriori, after prediction. Let 'file1' be a csv containing your predictions pred and real observed events obs on a subset of data. Ideally, this subset has never been used before (not even in Keras training). Let file2 contain the predictions you want to calibrate (Keras predictions for the test set).
import pandas as pd
from sklearn.isotonic import IsotonicRegression
never_seen=pd.read_csv('file1')
uncalibrated=pd.read_csv('file2')
ir = IsotonicRegression( out_of_bounds = 'clip' )
ir.fit( never_seen.pred,never_seen.obs )
p_calibrated = ir.transform( uncalibrated.pred )

Plot polynomial regression in Python with Scikit-Learn

I am writing a python code for investigating the over-fiting using the function sin(2.pi.x) in range of [0,1]. I first generate N data points by adding some random noise using Gaussian distribution with mu=0 and sigma=1. I fit the model using M-th polynomial. Here is my code
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# generate N random points
N=30
X= np.random.rand(N,1)
y= np.sin(np.pi*2*X)+ np.random.randn(N,1)
M=2
poly_features=PolynomialFeatures(degree=M, include_bias=False)
X_poly=poly_features.fit_transform(X) # contain original X and its new features
model=LinearRegression()
model.fit(X_poly,y) # Fit the model
# Plot
X_plot=np.linspace(0,1,100).reshape(-1,1)
X_plot_poly=poly_features.fit_transform(X_plot)
plt.plot(X,y,"b.")
plt.plot(X_plot_poly,model.predict(X_plot_poly),'-r')
plt.show()
Picture of polynomial regression
I don't know why I have M=2 lines of m-th polynomial line? I think it should be 1 line regardless of M. Could you help me figure out this problem.
Your data after polynomial feature transformation is of shape (n_samples,2).
So pyplot is plotting the predicted variable with both columns.
Change the plot code to
plt.plot(X_plot_poly[:,i],model.predict(X_plot_poly),'-r')
where i your column number

How to visualize an sklearn GradientBoostingClassifier?

I've trained a gradient boost classifier, and I would like to visualize it using the graphviz_exporter tool shown here.
When I try it I get:
AttributeError: 'GradientBoostingClassifier' object has no attribute 'tree_'
this is because the graphviz_exporter is meant for decision trees, but I guess there's still a way to visualize it, since the gradient boost classifier must have an underlying decision tree.
Does anybody know how to do that?
The attribute estimators contains the underlying decision trees. The following code displays one of the trees of a trained GradientBoostingClassifier. Notice that although the ensemble is a classifier as a whole, each individual tree computes floating point values.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import export_graphviz
import numpy as np
# Ficticuous data
np.random.seed(0)
X = np.random.normal(0,1,(1000, 3))
y = X[:,0]+X[:,1]*X[:,2] > 0
# Classifier
clf = GradientBoostingClassifier(max_depth=3, random_state=0)
clf.fit(X[:600], y[:600])
# Get the tree number 42
sub_tree_42 = clf.estimators_[42, 0]
# Visualization
# Install graphviz: https://www.graphviz.org/download/
from pydotplus import graph_from_dot_data
from IPython.display import Image
dot_data = export_graphviz(
sub_tree_42,
out_file=None, filled=True, rounded=True,
special_characters=True,
proportion=False, impurity=False, # enable them if you want
)
graph = graph_from_dot_data(dot_data)
Image(graph.create_png())
Tree number 42:

Resources