sklearn train_test_split - ValueError: Found input variables with inconsistent numbers of samples: [2552, 1] /Linear Regression - scikit-learn

I need assistance reshaping my input to match my output.
I wanted to create a model that vectorizes and classifies 'All information' information so that the label'Fall' can be divided into 0 and 1.
However, I keep getting the [ValueError: Found input variables with inconsistent numbers of samples: [2552, 1]] error.
The'shape' looks fine, but I don't know how to fix it.
## Linear Regression
import pandas as pd
import numpy as np
from tqdm import tqdm
#instance->fit->predict
from sklearn.linear_model import LinearRegression
model=LinearRegression(fit_intercept=True)
data=pd.read_csv("Fall_test_0826.csv", encoding='cp949', header=0)
data.head(2)
X=data.drop(["fall"], axis=1)
y= data.fall
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state = 0)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect=TfidfVectorizer()
tfidf_vect.fit(X_train)#단어사전 만듬
X_train_tfidf_vect = tfidf_vect.fit_transform(X_train['All information']).toarray()
X_test_tfidf_vect = tfidf_vect.transform(X_test)
lr_clf=LinearRegression()
lr_clf.fit(X_train_tfidf_vect, y_train)
pred = lr_clf.predict(X_test_tfidf_vect)
from sklearn.metrics import accuracy_score
print('Logisitic Regression _ {0:.3f}'.format(accuracy_score(y_test, pred)))
Error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-85-bec6ead862c8> in <module>
----> 1 print('{0:.3f}'.format(accuracy_score(y_test, pred)))
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
71 FutureWarning)
72 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73 return f(**kwargs)
74 return inner_f
75
~\anaconda3\lib\site-packages\sklearn\metrics\_classification.py in accuracy_score(y_true, y_pred, normalize, sample_weight)
185
186 # Compute accuracy for each possible representation
--> 187 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
188 check_consistent_length(y_true, y_pred, sample_weight)
189 if y_type.startswith('multilabel'):
~\anaconda3\lib\site-packages\sklearn\metrics\_classification.py in _check_targets(y_true, y_pred)
79 y_pred : array or indicator matrix
80 """
---> 81 check_consistent_length(y_true, y_pred)
82 type_true = type_of_target(y_true)
83 type_pred = type_of_target(y_pred)
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
254 uniques = np.unique(lengths)
255 if len(uniques) > 1:
--> 256 raise ValueError("Found input variables with inconsistent numbers of"
257 " samples: %r" % [int(l) for l in lengths])
258
ValueError: Found input variables with inconsistent numbers of samples: [2552, 1]
enter image description here
enter image description here

I think you have to change your the lines in your code from
X_test_tfidf_vect = tfidf_vect.transform(X_test)
to
X_test_tfidf_vect = tfidf_vect.transform(X_test['All information'])
But your approach is wrong. You are going for Linear Regression but trying to use classifciation metrics (accuracy_score) (Reference)
Doing so should lead to the error ValueError: Classification metrics can't handle a mix of binary and continuous targets
So this will not work, because your array pred will hold float values, so for example 0.5, but for the accuracy_score you need class labels as integers, so for example 0,1,2 or 3 etc.
You need to use regression metrics instead to evaluate your Linear Regression.
Have a look at the available regression metrics here.

Related

ValueError: continuous is not supported for RandomForestRegressor

After I had Pipeline preprocessed the weld data, I was able to get clean data in the output. Next, I need to pass the cleaned data through the model for training. Both the data preprocessing and model training steps can be further encapsulated in a Pipeline as follows:
from sklearn.ensemble import RandomForestRegressor
completed_pl = Pipeline(
steps=[
("preprocessor", preprocessor),
("classifier", RandomForestRegressor())
]
)
# training
completed_pl.fit(X_train, y_train)
# accuracy
y_train_pred = completed_pl.predict(X_train)
print(f"Accuracy on train: {accuracy_score(list(y_train), list(y_train_pred)):.2f}")
y_pred = completed_pl.predict(X_test)
print(f"Accuracy on test: {accuracy_score(list(y_test), list(y_pred)):.2f}")
I have used load_boston dataset from sklearn
And the error :
ValueError Traceback (most recent call last)
<ipython-input-86-d0b1928cf1a7> in <module>
12 # accuracy
13 y_train_pred = completed_pl.predict(X_train)
---> 14 print(f"Accuracy on train: {accuracy_score(list(y_train), list(y_train_pred)):.2f}")
15
16 y_pred = completed_pl.predict(X_test)
1 frames
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py in _check_targets(y_true, y_pred)
102 # No metrics support "multiclass-multioutput" format
103 if y_type not in ["binary", "multiclass", "multilabel-indicator"]:
--> 104 raise ValueError("{0} is not supported".format(y_type))
105
106 if y_type in ["binary", "multiclass"]:
ValueError: continuous is not supported

ValueError: Object arrays cannot be loaded when allow_pickle=False for IMDB data for Keras

First I am using Tensorflow 1.15 and Keras 2.2.4.
I ran the following code in Jupyter Notebook:
from keras.datasets import imdb
from keras import preprocessing
max_features = 10000
maxlen = 20
(x_train, y_train), (x_test, y_test) = imdb.load_data(
num_words=max_features)
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
and it gave me this error:
Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz
17465344/17464789 [==============================] - 8s 0us/step
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-609d113f6ed2> in <module>
6
7 (x_train, y_train), (x_test, y_test) = imdb.load_data(
----> 8 num_words=max_features)
9
10 x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
~\.conda\envs\tensorflow_env\lib\site-packages\keras\datasets\imdb.py in load_data(path, num_words, skip_top, maxlen, seed, start_char, oov_char, index_from, **kwargs)
57 file_hash='599dadb1135973df5b59232a0e9a887c')
58 with np.load(path) as f:
---> 59 x_train, labels_train = f['x_train'], f['y_train']
60 x_test, labels_test = f['x_test'], f['y_test']
61
~\.conda\envs\tensorflow_env\lib\site-packages\numpy\lib\npyio.py in __getitem__(self, key)
260 return format.read_array(bytes,
261 allow_pickle=self.allow_pickle,
--> 262 pickle_kwargs=self.pickle_kwargs)
263 else:
264 return self.zip.read(key)
~\.conda\envs\tensorflow_env\lib\site-packages\numpy\lib\format.py in read_array(fp, allow_pickle, pickle_kwargs)
720 # The array contained Python objects. We need to unpickle the data.
721 if not allow_pickle:
--> 722 raise ValueError("Object arrays cannot be loaded when "
723 "allow_pickle=False")
724 if pickle_kwargs is None:
ValueError: Object arrays cannot be loaded when allow_pickle=False
What is wrong with it? I took this code from "Deep Learning with Python" book.
Thanks
When numpy is loading a lot of data, it needs to have allow_pickle set to True. You should change that manually.
np.load("place of bumpy file.npy", allow_pickle=True)
I think imdb has numpy load inside.

Can I use probabilistic label when train model in logistic regression?

I use sklearn.linear_model.LogisticRegression and would like to use probabilistic label when train model.
But as following code I got error when I attempt to use train data with probability label for training logistic regression model.
Is there an any way to use probablity label for training logistic regression model?
import numpy as np
from sklearn.linear_model import LogisticRegression
x = np.array([1966, 1967, 1968, 1969, 1970,
1971, 1972, 1973, 1974, 1975,
1976, 1977, 1978, 1979, 1980,
1981, 1982, 1983, 1984]).reshape(-1, 1)
y = np.array([0.003, 0.016, 0.054, 0.139, 0.263,
0.423, 0.611, 0.758, 0.859, 0.903,
0.937, 0.954, 0.978, 0.978, 0.982,
0.985, 0.989, 0.988, 0.992])
lr = LogisticRegression()
lr.fit(x, y)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-6f0a54f18841> in <module>()
13
14 lr = LogisticRegression()
---> 15 lr.fit(x, y) # => ValueError: Unknown label type: 'continuous'
/home/sudot/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py in fit(self, X, y, sample_weight)
1172 X, y = check_X_y(X, y, accept_sparse='csr', dtype=np.float64,
1173 order="C")
-> 1174 check_classification_targets(y)
1175 self.classes_ = np.unique(y)
1176 n_samples, n_features = X.shape
/home/sudot/anaconda3/lib/python3.6/site-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
170 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
171 'multilabel-indicator', 'multilabel-sequences']:
--> 172 raise ValueError("Unknown label type: %r" % y_type)
173
174
ValueError: Unknown label type: 'continuous'
Logistic Regression is a binary classification model. You can't pass non-categorical values as target.
Just round values of y before fitting.
y = y.round(0) # Add this line
lr = LogisticRegression()
lr.fit(x, y)

Explaining LSTM keras with Eli5 library

I'm trying to use Eli5 for explaining an LSTM keras model for time series prediction. The keras model receives as input an array with shape (nsamples, timesteps, nfeatures).
This is my code:
def baseline_model():
model = Sequential()
model.add(LSTM(32, input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(loss='logcosh', optimizer='adam')
return model
from keras.wrappers.scikit_learn import KerasClassifier, KerasRegressor
import eli5
from eli5.sklearn import PermutationImportance
my_model = KerasRegressor(build_fn= baseline_model, nb_epoch= 30, batch_size= 32, verbose= False)
history = my_model.fit(X_train, y_train)
So far, everything is ok. The problem is when I execute the following line that launchs an error:
# X_train has a shape equal to (nsamples, timesteps, nfeatures) and y_train has a shape (nsamples)
perm = PermutationImportance(my_model, random_state=1).fit(X_train, y_train)
Error:
ValueError Traceback (most recent call last)
in ()
2 d2_train_dataset = X_train.reshape((nsamples, timesteps * features))
3
----> 4 perm = PermutationImportance(my_model, random_state=1).fit(X_train, y_train)
5 #eli5.show_weights(perm, feature_names = X.columns.tolist())
~/anaconda3/lib/python3.6/site-packages/eli5/sklearn/permutation_importance.py in fit(self, X, y, groups, **fit_params)
183 self.estimator_.fit(X, y, **fit_params)
184
--> 185 X = check_array(X)
186
187 if self.cv not in (None, "prefit"):
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
568 if not allow_nd and array.ndim >= 3:
569 raise ValueError("Found array with dim %d. %s expected <= 2."
--> 570 % (array.ndim, estimator_name))
571 if force_all_finite:
572 _assert_all_finite(array,
ValueError: Found array with dim 3. Estimator expected <= 2.
What can I do to fix this error? How can I use eli5 with my LSTM Keras Model?

Using GridSearchCV for NLP Missing Positional Argument Self

I am working on a NLP problem. I've been testing various models and the process has been working fine.
from sklearn.linear_model import SGDClassifier
classifier = SGDClassifier().fit(X_train_tfidf, y_train)
y_predicted_tfidf = classifier.predict(X_test_tfidf)
from sklearn.metrics import precision_score
precision = precision_score(y_test, y_predicted_tfidf, pos_label=None,average='weighted')
print(precision)
>>> 0.79708294305
Now I am trying to employ Grid Search in order find tune parameters and running into an error.
from sklearn.model_selection import GridSearchCV
parameters = {'alpha': [0.00001, 0.0001, 0.001, 0.001, 0.01] }
gs_classifier = GridSearchCV(SGDClassifier, parameters, n_jobs=-1)
gs_classifier = gs_classifier.fit(X_train_tfidf, y_train)
Which results in the following output:
TypeError Traceback (most recent call last)
<ipython-input-25-95b85f78662f> in <module>()
1 gs_classifier = GridSearchCV(SGDClassifier, parameters, n_jobs=-1)
----> 2 gs_classifier = gs_classifier.fit(X_train_tfidf, y_train)
anaconda/lib/python3.6/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups)
943 train/test set.
944 """
--> 945 return self._fit(X, y, groups,
...
/anaconda/lib/python3.6/site-packages/sklearn/base.py in clone(estimator, safe)
65 % (repr(estimator), type(estimator)))
66 klass = estimator.__class__
---> 67 new_object_params = estimator.get_params(deep=False)
68 for name, param in six.iteritems(new_object_params):
69 new_object_params[name] = clone(param, safe=False)
TypeError: get_params() missing 1 required positional argument: 'self'
I've tried various combinations of parameters and all result in the same error. For this example I've kept it simple and am just using a range of alpha values.

Resources