Numpy Python: Exception: Data must be 1-dimensional - python-3.x

Getting exception Exception: Data must be 1-dimensional
using NumPy in Python 3.7
Same code is working for others but not in my case. Bellow is my code please help
Working_code_in_diff_system
Same_code_not_working_in_my_system
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('./Data/new-data.csv', index_col=False)
x_train, x_test, y_train, y_test = train_test_split(df['Hours'], df['Marks'], test_size=0.2, random_state=42)
sns.jointplot(x=df['Hours'], y=df['Marks'], data=df, kind='reg')
x_train = np.reshape(x_train, (-1,1))
x_test = np.reshape(x_test, (-1,1))
y_train = np.reshape(y_train, (-1,1))
y_test = np.reshape(y_test, (-1,1))
#
print('Train - Predictors shape', x_train.shape)
print('Test - Predictors shape', x_test.shape)
print('Train - Target shape', y_train.shape)
print('Test - Target shape', y_test.shape)
Expected output should be
Train - Predictors shape (80, 1)
Test - Predictors shape (20, 1)
Train - Target shape (80, 1)
Test - Target shape (20, 1)
As output getting exception Exception: Data must be 1-dimensional

I think you need to call np.reshape on the underlying numpy array rather than on the Pandas series - you can do this using .values:
x_train = np.reshape(x_train.values, (-1, 1))
Repeat the same idea for the next three lines.
Or, if you are on a recent version of Pandas >= 0.24, to_numpy is preferred:
x_train = np.reshape(x_train.to_numpy(), (-1, 1))

numpy.squeeze() removes all dimensions of size 1 from a NumPy array.
x_train = numpy.squeeze(x_train)
Converts a (80,1) array to (80,)

Related

Residual plot for MultiOutputRegressor with yellowbrick

I am dealing with a multi-output regression problem and applied "MultiOutputRegressor" accompanied by "XGBRegressor" algorithms on the corresponding data.
import numpy as np
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
# Create a random dataset
rng = np.random.RandomState(1)
X = np.sort(200 * rng.rand(600, 1) - 100, axis=0)
y = np.array([np.pi * np.sin(X).ravel(), np.pi *
np.cos(X).ravel()]).T
y += 0.5 - rng.rand(*y.shape)
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=400, test_size=200, random_state=4)
regr_multi = MultiOutputRegressor(XGBRegressor())
regr_multi.fit(X_train, y_train)
y_pred = regr_multi.predict(X_test)
What I would like to visualize is the residual of model prediction using ResidualPlot from yellowbrick package.
When I use the following code
from yellowbrick.regressor import ResidualsPlot
vis = ResidualsPlot(regr_multi)
vis.fit(X_train, y_train)
vis.score(X_test, y_test)
vis.show()
I faced with an error mentioned The 'color' keyword argument must have one color per dataset, but 2 datasets and 1 colors were provided.
I was wondering that MultiOutput Residual plot is supported by yellowbriks or it is just an error that can be solved easily?

List object not callable in SVM

I'm trying to run this SVM using stratified K fold in Python,however I keep on getting the error like below
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.utils import shuffle
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, zero_one_loss, confusion_matrix
import pandas as pd
import numpy as np
z = pd.read_csv('/home/User/datasets/gtzan.csv', header=0)
X = z.iloc[:, :-1]
y = z.iloc[:, -1:]
X = np.array(X)
y = np.array(y)
# Performing standard scaling
scaler = preprocessing.MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# Defining the SVM with 'rbf' kernel
svc = SVC(kernel='rbf', C=100, random_state=50)
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, shuffle=True)
skf = StratifiedKFold(n_splits=10, shuffle=True)
accuracy_score = []
#skf.get_n_splits(X, y)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X_scaled[train_index], X_scaled[test_index]
y_train, y_test = y[train_index], y[test_index]
# Training the model
svc.fit(X_train, np.ravel(y_train))
# Prediction on test dataste
y_pred = svc.predict(X_test)
# Obtaining the accuracy scores of the model
score = accuracy_score(y_test, y_pred)
accuracy_score.append(score)
# Print the accuarcy of the svm model
print('accuracy score: %0.3f' % np.mean(accuracy_score))
however, it gives me an error like below
Traceback (most recent call last):
File "/home/User/Test_SVM.py", line 55, in <module>
score = accuracy_score(y_test, y_pred)
TypeError: 'list' object is not callable
What makes this score list uncallable and how do I fix this error?
accuracy_scoreis a list in my code and I was also calling the same list as a function, which is overriding the existing functionality of function accuarcy_score. Changed the list name to acc_score which solved the problem.

saving polynomial model , doesn't save polynomial degree

How can I deal with polynomial degree when I want to save a polynomial model, sicne this info is not being saved!
import pandas as pd
import numpy as np
import joblib
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.DataFrame({
"a": np.random.uniform(0.0, 1.0, 1000),
"b": np.random.uniform(10.0, 14.0, 1000),
"c": np.random.uniform(100.0, 1000.0, 1000)})
def data():
X_train, X_val, y_train, y_val = train_test_split(df.iloc[:, :2].values,
df.iloc[:, 2].values,
test_size=0.2,
random_state=1340)
return X_train, X_val, y_train, y_val
X_train, X_val, y_train, y_val = data()
poly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X_train)
poly_reg_model = LinearRegression().fit(X_poly, y_train)
poly_model = joblib.dump(poly_reg_model, 'themodel')
y_pred = poly_reg_model.predict(poly_reg.fit_transform(X_val))
themodel = joblib.load('themodel')
Now, if I try to predict:
themodel.predict(X_val), I am receiving:
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 6 is different from 2)
I have to do:
pol_feat = PolynomialFeatures(degree=2)
themodel.predict(pol_feat.fit_transform(X_val))
in order to work.
So, how can i store this info in order to be able to use the model for prediction?
You have to pickle trained PolynomialFeatures also:
# train and pickle
poly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X_train)
poly_reg_model = LinearRegression().fit(X_poly, y_train)
joblib.dump(poly_reg_model, 'themodel')
joblib.dump(poly_reg, 'poilynomia_features_model')
# load and predict
poilynomia_features_model = joblib.load('poilynomia_features_model')
themodel = joblib.load('themodel')
X_val_prep = poilynomia_features_model.transform(X_val)
predictions = themodel.predict(X_val_prep)
But better will wrap all the steps in the single pipeline:
pipeline = Pipeline(steps=[('poilynomia', PolynomialFeatures()),
('lr', LinearRegression())])
pipeline.fit(X_train, y_train)
pipeline.predict(X_val)

Unable to calculate Model performance for Decision Tree Regressor

Although my code run fine on repl and did giving me results but it miserably fails on the Katacoda testing environment.
I am attaching the repl file here for your review as well, which also contains the question which is commented just above the code I have written.
Kindly review and let me know what mistakes I am making here.
Repl Link
https://repl.it/repls/WarmRobustOolanguage
Also sharing code below
Commented is Question Instructions
#Import two modules sklearn.datasets, and #sklearn.model_selection.
#Import numpy and set random seed to 100.
#Load popular Boston dataset from sklearn.datasets module #and assign it to variable boston.
#Split boston.data into two sets names X_train and X_test. #Also, split boston.target into two sets Y_train and Y_test.
#Hint: Use train_test_split method from #sklearn.model_selection; set random_state to 30.
#Print the shape of X_train dataset.
#Print the shape of X_test dataset.
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
import numpy as np
np.random.seed(100)
max_depth = range(2, 6)
boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, random_state=30)
print(X_train.shape)
print(X_test.shape)
#Import required module from sklearn.tree.
#Build a Decision tree Regressor model from X_train set and #Y_train labels, with default parameters. Name the model as #dt_reg.
#Evaluate the model accuracy on training data set and print #it's score.
#Evaluate the model accuracy on testing data set and print it's score.
#Predict the housing price for first two samples of X_test #set and print them.(Hint : Use predict() function)
dt_reg = DecisionTreeRegressor(random_state=1)
dt_reg = dt_reg.fit(X_train, Y_train)
print('Accuracy of Train Data :', cross_val_score(dt_reg, X_train,Y_train, cv=10 ))
print('Accuracy of Test Data :', cross_val_score(dt_reg, X_test,Y_test, cv=10 ))
predicted = dt_reg.predict(X_test[:2])
print(predicted)
#Fit multiple Decision tree regressors on X_train data and #Y_train labels with max_depth parameter value changing from #2 to 5.
#Evaluate each model accuracy on testing data set.
#Hint: Make use of for loop
#Print the max_depth value of the model with highest accuracy.
dt_reg = DecisionTreeRegressor()
random_grid = {'max_depth': max_depth}
dt_random = RandomizedSearchCV(estimator = dt_reg, param_distributions = random_grid,
n_iter = 90, cv = 3, verbose=2, random_state=42, n_jobs = -1)
dt_random.fit(X_train, Y_train)
dt_random.best_params_
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * np.mean(errors / test_labels)
accuracy = 100 - mape
print('Model Performance')
print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
print('Accuracy = {:0.2f}%.'.format(accuracy))
return accuracy
best_random = dt_random.best_estimator_
random_accuracy = evaluate(best_random, X_test,Y_test)
print("Accuracy Scores of the Model ",random_accuracy)
best_parameters = (dt_random.best_params_['max_depth']);
print(best_parameters)
The question is asking for default values. Try to remove random_state=1
Current Line:
dt_reg = DecisionTreeRegressor(random_state=1)
Update Line:
dt_reg = DecisionTreeRegressor()
I think it should Work!!!
# ================================================================================
# Machine Learning Using Scikit-Learn | 3 | Decision Trees ================================================================================
import sklearn.datasets as datasets
import sklearn.model_selection as model_selection
import numpy as np
from sklearn.tree import DecisionTreeRegressor
np.random.seed(100)
# Load popular Boston dataset from sklearn.datasets module and assign it to variable boston.
boston = datasets.load_boston()
# print(boston)
# Split boston.data into two sets names X_train and X_test. Also, split boston.target into two sets Y_train and Y_test
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(boston.data, boston.target, random_state=30)
# Print the shape of X_train dataset
print(X_train.shape)
# Print the shape of X_test dataset.
print(X_test.shape)
# Build a Decision tree Regressor model from X_train set and Y_train labels, with default parameters. Name the model as dt_reg
dt_Regressor = DecisionTreeRegressor()
dt_reg = dt_Regressor.fit(X_train, Y_train)
print(dt_reg.score(X_train,Y_train))
print(dt_reg.score(X_test,Y_test))
predicted = dt_reg.predict(X_test[:2])
print(predicted)
# Get the max depth
maxdepth = 2
maxscore = 0
for x in range(2, 6):
dt_Regressor = DecisionTreeRegressor(max_depth=x)
dt_reg = dt_Regressor.fit(X_train, Y_train)
score = dt_reg.score(X_test, Y_test)
if(maxscore < score):
maxdepth = x
maxscore = score
print(maxdepth)

Cross-validation for paragraph-vector model

I just came across an error when trying to apply a cross-validation for a paragraph vector model:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from gensim.sklearn_api import D2VTransformer
data = pd.read_csv('https://pastebin.com/raw/bSGWiBfs')
np.random.seed(0)
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
y_train = data.label
model = D2VTransformer(size=10, min_count=1, iter=5, seed=1)
clf = LogisticRegression(random_state=0)
pipeline = Pipeline([
('vec', model),
('clf', clf)
])
pipeline.fit(X_train, y_train)
score = pipeline.score(X_train, y_train)
print("Score:", score) # This works
cval = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=3)
print("Cross-Validation:", cval) # This doesn't work
KeyError: 0
I experimented by replacing X_train in cross_val_score with model.transform(X_train) or model.fit_transform(X_train). Also, I tried the same with raw input data (data.text), instead of pre-processed text. I suspect that something must be wrong with the format of X_train for the cross-validation, as compared to the .score function for Pipeline, which works just fine. I also noted that the cross_val_score worked with CountVectorizer().
Does anyone spot the mistake?
No, this has nothing to do with transformation from model. Its related to cross_val_score.
cross_val_score will split the supplied data according the the cv param. For this, it will do something like this:
for train, test in splitter.split(X_train, y_train):
new_X_train, new_y_train = X_train[train], y_train[train]
But your X_train is a pandas.Series object in which the index based selection does not work like this. See this:https://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position
Change this line:
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
to:
# Access the internal numpy array
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1).values
OR
# Convert series to list
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1).tolist()

Resources