ValueError: Expected 2D array, got 1D array instead: - python-3.x

While practicing Simple Linear Regression Model I got this error,
I think there is something wrong with my data set.
Here is my data set:
Here is independent variable X:
Here is dependent variable Y:
Here is X_train
Here Is Y_train
This is error body:
ValueError: Expected 2D array, got 1D array instead:
array=[ 7. 8.4 10.1 6.5 6.9 7.9 5.8 7.4 9.3 10.3 7.3 8.1].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
And this is My code:
import pandas as pd
import matplotlib as pt
#import data set
dataset = pd.read_csv('Sample-data-sets-for-linear-regression1.csv')
x = dataset.iloc[:, 1].values
y = dataset.iloc[:, 2].values
#Spliting the dataset into Training set and Test Set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)
#linnear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
Thank you

You need to give both the fit and predict methods 2D arrays. Your x_train and x_test are currently only 1 dimensional. What is suggested by the console should work:
x_train= x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
This uses numpy's reshape to transform your array. For example, x = [1, 2, 3] wopuld be transformed to a matrix x' = [[1], [2], [3]] (-1 gives the x dimension of the matrix, inferred from the length of the array and remaining dimensions, 1 is the y dimension - giving us a n x 1 matrix where n is the input length).
Questions about reshape have been answered in the past, this for example should answer what reshape(-1,1) fully means: What does -1 mean in numpy reshape? (also some of the other below answers explain this very well too)

A lot of times when doing linear regression problems, people like to envision this graph
On the input, we have an X of X = [1,2,3,4,5]
However, many regression problems have multidimensional inputs. Consider the prediction of housing prices. It's not one attribute that determines housing prices. It's multiple features (ex: number of rooms, location, etc. )
If you look at the documentation you will see this
It tells us that rows consist of the samples while the columns consist of the features.
However, consider what happens when he have one feature as our input. Then we need an n x 1 dimensional input where n is the number of samples and the 1 column represents our only feature.
Why does the array.reshape(-1, 1) suggestion work? -1 means choose a number of rows that works based on the number of columns provided. See the image for how it changes in the input.

If you look at documentation of LinearRegression of scikit-learn.
fit(X, y, sample_weight=None)
X : numpy array or sparse matrix of shape [n_samples,n_features]
predict(X)
X : {array-like, sparse matrix}, shape = (n_samples, n_features)
As you can see X has 2 dimensions, where as, your x_train and x_test clearly have one.
As suggested, add:
x_train = x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
Before fitting and predicting the model.

Use
y_pred = regressor.predict([[x_test]])

I would suggest to reshape X at the beginning before you do the split into train and test dataset:
import pandas as pd
import matplotlib as pt
#import data set
dataset = pd.read_csv('Sample-data-sets-for-linear-regression1.csv')
x = dataset.iloc[:, 1].values
y = dataset.iloc[:, 2].values
# Here is the trick
x = x.reshape(-1,1)
#Spliting the dataset into Training set and Test Set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)
#linnear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)

This is what I use
X_train = X_train.values.reshape(-1, 1)
y_train = y_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
y_test = y_test.values.reshape(-1, 1)

This is the solution
regressor.predict([[x_test]])
And for polynomial regression:
regressor_2.predict(poly_reg.fit_transform([[x_test]]))

Modify
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
to
regressor.fit(x_train.values.reshape(-1,1),y_train)
y_pred = regressor.predict(x_test.values.reshape(-1,1))

Related

Generate Random Forest feature importance plots from 3D arrays

After carrying our a librosa MFCC feature extraction on 1000 audio files, I end up with an X_test array of size 1000 x 40 x 174 (40 features as I set n_mfcc=40). In order for me to pass this through the random forest classifier, I scaled and then flattened the array. My new X_test now has a size of 1000 x 6960. How do I go about correctly generating the feature importance histogram?
This is the code that I used for the feature importance plot but not sure if this is the correct approach:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
x_train # this has a shape of 1000 x 40 x 174
X_train_scaled = scaler.fit_transform(x_train.reshape(-1, x_train.shape[-1])).reshape(x_train.shape)
X_train = np.array([features_2d.flatten() for features_2d in X_train_scaled])
X = pd.DataFrame(X_train) # X_train here is already flattened to 1000 x 6960
feature_names = [f"feature {i}" for i in range(X.shape[1])]
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0)
forest_importances = pd.Series(importances, index=feature_names)
fig, ax = plt.subplots()
plt.figure(figsize=(13, 10))
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
With this code, I get this plot:
Can you tell me if this is the correct approach? If this approach is correct, how can I generate a more "readable" plot for the Feature Importance? Thanks!

How to use the statsmodels robust regression predict function and output a prediction interval?

I am trying to predict out of sample data with statsmodels robust linear regression. I am having
difficulties doing so with the predict function. Below is the code
import pandas as pd
import numpy as np
import statsmodels.api as sm
x = np.random.rand(500)
y = np.random.rand(500)
x_train = x[0:200]
x_test = x[200:]
y_train = y[0:200]
y_test = y[200:]
X = sm.add_constant(x_train)
model = sm.RLM(y_train, X, M=sm.robust.norms.HuberT())
result = model.fit()
print(result.summary())
predict = result.predict(x_test)
When I try to use the predict function I get the following error:
ValueError: shapes (1,300) and (2,) not aligned: 300 (dim 1) != 2 (dim 0)
When I try to reshape the x_test variable with the following code:
X_test = x_test.reshape(-1,2)
predict = resutl.predict(X_test)
It works but the predict variable array has only 150 predictions and I want the number of predictions to be 300 to be able to compare them with the y_test variable.
How can I use the predict function to give me same length of output as the y_test variable?
How can I print a prediction interval for each of the out of sample data?

Prediction with linear regression is very inaccurate

This is the csv that im using https://gist.github.com/netj/8836201 currently, im trying to predict the variety which is categorical data with linear regression but somehow the prediction is very very inaccurate. While you know, the actual label is just combination of 0.0 and 1. but the prediction is 0.numbers and 1.numbers even with minus numbers which in my opinion is very inaccurate, what part did i make the mistake and what is the solution for this inaccuracy? this is the assignment my teacher gave me, he said we could predict the categorical data with linear regression not only logistic regression
import pandas as pd
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
from sklearn import metrics
path= r"D:\python projects\iris.csv"
df = pd.read_csv(path)
array = df.values
X = array[:,0:3]
y = array[:,4]
le = preprocessing.LabelEncoder()
ohe = preprocessing.OneHotEncoder(categorical_features=[0])
y = le.fit_transform(y)
y = y.reshape(-1,1)
y = ohe.fit_transform(y).toarray()
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2, random_state=0)
sc = preprocessing.StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
y_train = sc.fit_transform(y_train)
model = LinearRegression(n_jobs=-1).fit(X_train, y_train)
y_pred = model.predict(X_test)
df = pd.DataFrame({'Actual': X_test.flatten(), 'Predicted': y_pred.flatten()})
the output :
y_pred
Out[46]:
array([[-0.08676055, 0.43120144, 0.65555911],
[ 0.11735424, 0.72384335, 0.1588024 ],
[ 1.17081347, -0.24484483, 0.07403136],
X_test
Out[61]:
array([[-0.09544771, -0.58900572, 0.72247648],
[ 0.14071157, -1.98401928, 0.10361279],
[-0.44968663, 2.66602591, -1.35915595],
Linear Regression is used to predict continuous output data. As you correctly said, you are trying to predict categorical (discrete) output data. Essentially, you want to be doing classification instead of regression - linear regression is not appropriate for this.
As you also said, logistic regression can and should be used instead as it is applicable to classification tasks.

Scikitlearn LinearSVC Bad input shape

I'm trying to use LinearSVC on my data! My code below:
from sklearn import svm
clf2 = svm.LinearSVC()
clf2.fit(X_train, y_train)
Results in the following error:
ValueError: bad input shape (2190, 9)
I've used one-hot encoding on my y value before splitting into y_test and y_train, and believe this to be the issue. I've tried implementing similar fixes (sklearn (Bad Input Shape) ValueError) but still get errors when I try and re-shape.
After one hot-encoding, I have a target variable (y) that has 9 classes, and there are a total of 2190 samples i'm running. It seems I need to reduce these 9 classes to 1 class in order to fit the SVM.
Any suggestions would be greatly appreciated!
LinearSVC dont accept 2-d values for y. As documented:
Parameters:
y : array-like, shape = [n_samples]
Target vector relative to X
So you don't need to convert into one-hot encoded matrix. Just supply them as is, even if its strings. They will be internally handled correctly.
Accoding to the document,
you may try sklearn.multiclass.OneVsRestClassifier as follows:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
clf = OneVsRestClassifier(LinearSVC())
clf.fit(X_train, y_train)
You need to reshape the arrays. Here is an example using random data and as target variable a variable that contains 5 classes:
import numpy as np
from sklearn import svm
# 100 samples and 10 features
x = np.random.rand(100, 10)
#5 classes
y = [1,2,3,4,5] * 20
x = np.asarray(x)
y = np.asarray(y)
print(x.shape)
print(y.shape)
clf2 = svm.LinearSVC()
clf2.fit(x, y)
Results:
(100, 10)
(100,)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
verbose=0)

Interpretation of coef_ attribute of LogisticRegression class

I have a (might be silly) question regarding coef_ attribute of sklearn.linear_model.LogisticRegression.
I fit LogisticRegression model to Iris dataset using only two features(petal width and length). To obtain weights of each feature I use coef_ attribute and it returns 3x2 array. I understand that the reason I get 3 rows is because of 3 classes and one-vs-rest rule.
However, I can not understand why it includes only w_1 and w_2 (or theta_1 and theta_2, depending on which notation you use), coefficients of feature 1 and 2, but missing w_0 (or theta_0), which is intercept.
Code:
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
iris = datasets.load_iris()
X = iris.data[:, [2,3]]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
lr = LogisticRegression(C = 1000, random_state=0)
lr.fit(X_train_std, y_train)
lr.coef_
the attribute 'coef_' gives only the Coefficients of the features in the decision function.
you can get the intercept by using :
lr.intercept_
You might notice the strange-looking trailing underscore at the end
of coef_ and intercept_. Scikit-learn always stores anything
that is derived from the training data in attributes that end with a
trailing underscore. That is to separate them from the parameters that
are set by the user.
lr.coef_
coef_ndarray of shape (1, n_features) or (n_classes, n_features)
Coefficient of the features in the decision function.
coef_ is of shape (1, n_features) when the given problem is binary. In particular, when multi_class='multinomial', coef_ corresponds to outcome 1 (True) and -coef_ corresponds to outcome 0 (False).
lr.intercept_
intercept_ndarray of shape (1,) or (n_classes,)
Intercept (a.k.a. bias) added to the decision function.
If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape (1,) when the given problem is binary. In particular, when multi_class='multinomial', intercept_ corresponds to outcome 1 (True) and -intercept_ corresponds to outcome 0 (False).
SOURCE: Scikit learn Official Documentation

Resources