TypeError: len() of unsized object - scikit-learn

I am trying random forest classifier from sklearn, when i want to print the classifier report, it is give me an error.
This was the code :
randomforestmodel = RandomForestClassifier()
randomforestmodel.fit(train_vectors, data_train['label'])
predict_rfmodel = randomforestmodel.predict(test_vectors)
print("classification with randomforest")
print(metrics.classification_report(test_vectors, predict_rfmodel))
And the error was like this :
classification with randomforest
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-34-f976cec884e4> in <module>()
1 print("classification with randomforest")
----> 2 print(metrics.classification_report(test_vectors, predict_rfmodel))
2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py in classification_report(y_true, y_pred, labels, target_names, sample_weight, digits, output_dict, zero_division)
2108 """
2109
-> 2110 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
2111
2112 if labels is None:
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py in _check_targets(y_true, y_pred)
83 """
84 check_consistent_length(y_true, y_pred)
---> 85 type_true = type_of_target(y_true)
86 type_pred = type_of_target(y_pred)
87
/usr/local/lib/python3.7/dist-packages/sklearn/utils/multiclass.py in type_of_target(y)
308
309 # Invalid inputs
--> 310 if y.ndim > 2 or (y.dtype == object and len(y) and not isinstance(y.flat[0], str)):
311 return "unknown" # [[[1, 2]]] or [obj_1] and not ["label_1"]
312
TypeError: len() of unsized object

You're providing the test instances features (test_vectors) instead of the true test instances labels to classification_report.
As per the documentation, the first parameter should be:
y_true: 1d array-like, or label indicator array / sparse matrix.
Ground truth (correct) target values.

Related

sklearn train_test_split - ValueError: Found input variables with inconsistent numbers of samples: [2552, 1] /Linear Regression

I need assistance reshaping my input to match my output.
I wanted to create a model that vectorizes and classifies 'All information' information so that the label'Fall' can be divided into 0 and 1.
However, I keep getting the [ValueError: Found input variables with inconsistent numbers of samples: [2552, 1]] error.
The'shape' looks fine, but I don't know how to fix it.
## Linear Regression
import pandas as pd
import numpy as np
from tqdm import tqdm
#instance->fit->predict
from sklearn.linear_model import LinearRegression
model=LinearRegression(fit_intercept=True)
data=pd.read_csv("Fall_test_0826.csv", encoding='cp949', header=0)
data.head(2)
X=data.drop(["fall"], axis=1)
y= data.fall
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state = 0)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect=TfidfVectorizer()
tfidf_vect.fit(X_train)#단어사전 만듬
X_train_tfidf_vect = tfidf_vect.fit_transform(X_train['All information']).toarray()
X_test_tfidf_vect = tfidf_vect.transform(X_test)
lr_clf=LinearRegression()
lr_clf.fit(X_train_tfidf_vect, y_train)
pred = lr_clf.predict(X_test_tfidf_vect)
from sklearn.metrics import accuracy_score
print('Logisitic Regression _ {0:.3f}'.format(accuracy_score(y_test, pred)))
Error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-85-bec6ead862c8> in <module>
----> 1 print('{0:.3f}'.format(accuracy_score(y_test, pred)))
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
71 FutureWarning)
72 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73 return f(**kwargs)
74 return inner_f
75
~\anaconda3\lib\site-packages\sklearn\metrics\_classification.py in accuracy_score(y_true, y_pred, normalize, sample_weight)
185
186 # Compute accuracy for each possible representation
--> 187 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
188 check_consistent_length(y_true, y_pred, sample_weight)
189 if y_type.startswith('multilabel'):
~\anaconda3\lib\site-packages\sklearn\metrics\_classification.py in _check_targets(y_true, y_pred)
79 y_pred : array or indicator matrix
80 """
---> 81 check_consistent_length(y_true, y_pred)
82 type_true = type_of_target(y_true)
83 type_pred = type_of_target(y_pred)
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
254 uniques = np.unique(lengths)
255 if len(uniques) > 1:
--> 256 raise ValueError("Found input variables with inconsistent numbers of"
257 " samples: %r" % [int(l) for l in lengths])
258
ValueError: Found input variables with inconsistent numbers of samples: [2552, 1]
enter image description here
enter image description here
I think you have to change your the lines in your code from
X_test_tfidf_vect = tfidf_vect.transform(X_test)
to
X_test_tfidf_vect = tfidf_vect.transform(X_test['All information'])
But your approach is wrong. You are going for Linear Regression but trying to use classifciation metrics (accuracy_score) (Reference)
Doing so should lead to the error ValueError: Classification metrics can't handle a mix of binary and continuous targets
So this will not work, because your array pred will hold float values, so for example 0.5, but for the accuracy_score you need class labels as integers, so for example 0,1,2 or 3 etc.
You need to use regression metrics instead to evaluate your Linear Regression.
Have a look at the available regression metrics here.

how to do reshape in custom function in keras

I'm trying to do reshape in custom function in tensorflow keras,
I'm trying to following kind of loss function in tensorflow as custom loss function,
#Since WRMSSE calucated for each stores so we have 3049 rows and 9180 time series
# Function to do quick rollups:
def rollup_nn(v):
'''
v - np.array of size (3049 rows, n day columns)
v_rolledup - array of size (n, 9180)
'''
return roll_mat_csr*v #(v.T*roll_mat_csr.T).T
# Function to calculate WRMSSE:
key = 0
def wrmsse_nn(preds, y_true):
'''
preds - Predictions: pd.DataFrame of size (3049 rows, N day columns)
y_true - True values: pd.DataFrame of size (3049 rows, N day columns)
sequence_length - np.array of size (9180,)
sales_weight - sales weights based on last 28 days: np.array (9180,)
'''
preds = preds[-(3049 * 28):]
y_true = y_true.get_label()[-(3049 * 28):]
preds = preds.reshape(28, 3049).T
y_true = y_true.reshape(28, 3049).T
return 'wrmsse', np.sum(np.sqrt(np.mean(np.square(rollup(preds-y_true)),axis=1)) * SW_store)/12,False
where I need to do the reshape in custom loss function
I 'm doing reshape function by using the following code
tf.reshape(preds,[28, 3049])
I'm getting the following error
AttributeError: 'NoneType' object has no attribute 'get_shape'
The complete error message is
Tensor("dense_23_target:0", shape=(?, ?), dtype=float32) Tensor("dense_23_1/BiasAdd:0", shape=(?, 1), dtype=float32)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-101-05dfd1dadcca> in <module>()
7 # model.add(Dense(units=16,activation='relu',kernel_initializer=initializer.he_normal(seed=0)))
8 model.add(Dense(units=1))
----> 9 model.compile(loss=wrmsse_nn,optimizer='adam')
~/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/checkpointable/base.py in _method_wrapper(self, *args, **kwargs)
440 self._setattr_tracking = False # pylint: disable=protected-access
441 try:
--> 442 method(self, *args, **kwargs)
443 finally:
444 self._setattr_tracking = previous_value # pylint: disable=protected-access
~/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py in compile(self, optimizer, loss, metrics, loss_weights, sample_weight_mode, weighted_metrics, target_tensors, distribute, **kwargs)
447 else:
448 weighted_loss = training_utils.weighted_masked_objective(loss_fn)
--> 449 output_loss = weighted_loss(y_true, y_pred, sample_weight, mask)
450
451 if len(self.outputs) > 1:
~/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_utils.py in weighted(y_true, y_pred, weights, mask)
661 # Update dimensions of weights to match with values if possible.
662 score_array, _, weights = squeeze_or_expand_dimensions(
--> 663 score_array, None, weights)
664 try:
665 # Broadcast weights if possible.
~/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/utils/losses_utils.py in squeeze_or_expand_dimensions(y_pred, y_true, sample_weight)
66 return y_pred, y_true, sample_weight
67
---> 68 y_pred_shape = y_pred.get_shape()
69 y_pred_rank = y_pred_shape.ndims
70 if (y_pred_rank is not None) and (weights_rank is not None):
AttributeError: 'NoneType' object has no attribute 'get_shape'
How can I do it ?

The size of tensor a (2) must match the size of tensor b (39) at non-singleton dimension 1

This is my first time working on text classification. I am working on binary text classification with CamemBert using fast-bert library which is mostly inspired from fastai.
When I run the code below
from fast_bert.data_cls import BertDataBunch
from fast_bert.learner_cls import BertLearner
databunch = BertDataBunch(DATA_PATH,LABEL_PATH,
tokenizer='camembert-base',
train_file='train.csv',
val_file='val.csv',
label_file='labels.csv',
text_col='text',
label_col='label',
batch_size_per_gpu=8,
max_seq_length=512,
multi_gpu=multi_gpu,
multi_label=False,
model_type='camembert-base')
learner = BertLearner.from_pretrained_model(
databunch,
pretrained_path='camembert-base', #'/content/drive/My Drive/model/model_out'
metrics=metrics,
device=device_cuda,
logger=logger,
output_dir=OUTPUT_DIR,
finetuned_wgts_path=None, #WGTS_PATH
warmup_steps=300,
multi_gpu=multi_gpu,
is_fp16=True,
multi_label=False,
logging_steps=50)
learner.fit(epochs=10,
lr=9e-5,
validate=True,
schedule_type="warmup_cosine",
optimizer_type="adamw")
Everything works fine until training.
I get this error message when I try to train my model:
RuntimeError Traceback (most recent call last)
<ipython-input-13-9b5c6ad7c8f0> in <module>()
3 validate=True,
4 schedule_type="warmup_cosine",
----> 5 optimizer_type="adamw")
2 frames
/usr/local/lib/python3.6/dist-packages/fast_bert/learner_cls.py in fit(self, epochs, lr, validate, return_results, schedule_type, optimizer_type)
421 # Evaluate the model against validation set after every epoch
422 if validate:
--> 423 results = self.validate()
424 for key, value in results.items():
425 self.logger.info(
/usr/local/lib/python3.6/dist-packages/fast_bert/learner_cls.py in validate(self, quiet, loss_only)
515 for metric in self.metrics:
516 validation_scores[metric["name"]] = metric["function"](
--> 517 all_logits, all_labels
518 )
519 results.update(validation_scores)
/usr/local/lib/python3.6/dist-packages/fast_bert/metrics.py in fbeta(y_pred, y_true, thresh, beta, eps, sigmoid)
56 y_pred = (y_pred > thresh).float()
57 y_true = y_true.float()
---> 58 TP = (y_pred * y_true).sum(dim=1)
59 prec = TP / (y_pred.sum(dim=1) + eps)
60 rec = TP / (y_true.sum(dim=1) + eps)
RuntimeError: The size of tensor a (2) must match the size of tensor b (39) at non-singleton dimension 1
How can I fix this ?
Thanks
fbeta doesn't work for binary classification. Using only accuracy solved this.

How to use SHAP with a linear SVC model from sklearn using Pipeline?

I am doing text classification using a linear SVC model from sklearn. Now I want to visualize which words/tokens have the highest impact on the classification decision by using SHAP (https://github.com/slundberg/shap).
Right now this does not work because I am getting an error that seems to originate from the vectorizer step in the pipeline I have defined - whats wrong here?
Is my general approach on how to use SHAP in this case correct?
x_Train, x_Test, y_Train, y_Test = train_test_split(df_all['PDFText'], df_all['class'], test_size = 0.2, random_state = 1234)
pipeline = Pipeline([
(
'tfidv',
TfidfVectorizer(
ngram_range=(1,3),
analyzer='word',
strip_accents = ascii,
use_idf = True,
sublinear_tf=True,
max_features=6000,
min_df=2,
max_df=1.0
)
),
(
'lin_svc',
svm.SVC(
C=1.0,
probability=True,
kernel='linear'
)
)
])
pipeline.fit(x_Train, y_Train)
shap.initjs()
explainer = shap.KernelExplainer(pipeline.predict_proba, x_Train)
shap_values = explainer.shap_values(x_Test, nsamples=100)
shap.force_plot(explainer.expected_value[0], shap_values[0][0,:], x_Test.iloc[0,:])
This is the error message I get:
Provided model function fails when applied to the provided data set.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-81-4bca63616b3b> in <module>
3
4 # use Kernel SHAP to explain test set predictions
----> 5 explainer = shap.KernelExplainer(pipeline.predict_proba, x_Train)
6 shap_values = explainer.shap_values(x_Test, nsamples=100)
7
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\shap\explainers\kernel.py in __init__(self, model, data, link, **kwargs)
95 self.keep_index_ordered = kwargs.get("keep_index_ordered", False)
96 self.data = convert_to_data(data, keep_index=self.keep_index)
---> 97 model_null = match_model_to_data(self.model, self.data)
98
99 # enforce our current input type limitations
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\shap\common.py in match_model_to_data(model, data)
80 out_val = model.f(data.convert_to_df())
81 else:
---> 82 out_val = model.f(data.data)
83 except:
84 print("Provided model function fails when applied to the provided data set.")
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
116
117 # lambda, but not partial, allows help() to work with update_wrapper
--> 118 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
119 # update the docstring of the returned function
120 update_wrapper(out, self.fn)
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\pipeline.py in predict_proba(self, X)
379 for name, transform in self.steps[:-1]:
380 if transform is not None:
--> 381 Xt = transform.transform(Xt)
382 return self.steps[-1][-1].predict_proba(Xt)
383
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in transform(self, raw_documents, copy)
1631 check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')
1632
-> 1633 X = super(TfidfVectorizer, self).transform(raw_documents)
1634 return self._tfidf.transform(X, copy=False)
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in transform(self, raw_documents)
1084
1085 # use the same matrix-building strategy as fit_transform
-> 1086 _, X = self._count_vocab(raw_documents, fixed_vocab=True)
1087 if self.binary:
1088 X.data.fill(1)
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
940 for doc in raw_documents:
941 feature_counter = {}
--> 942 for feature in analyze(doc):
943 try:
944 feature_idx = vocabulary[feature]
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(doc)
326 tokenize)
327 return lambda doc: self._word_ngrams(
--> 328 tokenize(preprocess(self.decode(doc))), stop_words)
329
330 else:
c:\users\s.p\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(x)
254
255 if self.lowercase:
--> 256 return lambda x: strip_accents(x.lower())
257 else:
258 return strip_accents
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
KernelExplainer expects to receive a classification model as the first argument. Please check the use of Pipeline with Shap following the link.
In your case, you can use the Pipeline as follows:
x_Train = pipeline.named_steps['tfidv'].fit_transform(x_Train)
explainer = shap.KernelExplainer(pipeline.named_steps['lin_svc'].predict_proba, x_Train)

DNN Linear Regression. MAE measurement error

I am trying to implement MAE as a performance measurement for my DNN regression model. I am using DNN to predict the number of comments a facebook post will get. As I understand, if it is a classification problem, then we use accuracy. If it is regression problem, then we use either RMSE or MAE. My code is the following:
with tf.name_scope("eval"):
correct = tf.metrics.mean_absolute_error(labels = y, predictions = logits)
mae = tf.reduce_mean(tf.cast(correct, tf.int64))
mae_summary = tf.summary.scalar('mae', accuracy)
For some reason, I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-396-313ddf858626> in <module>()
1 with tf.name_scope("eval"):
----> 2 correct = tf.metrics.mean_absolute_error(labels = y, predictions = logits)
3 mae = tf.reduce_mean(tf.cast(correct, tf.int64))
4 mae_summary = tf.summary.scalar('mae', accuracy)
~/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/metrics_impl.py in mean_absolute_error(labels, predictions, weights, metrics_collections, updates_collections, name)
736 predictions, labels, weights = _remove_squeezable_dimensions(
737 predictions=predictions, labels=labels, weights=weights)
--> 738 absolute_errors = math_ops.abs(predictions - labels)
739 return mean(absolute_errors, weights, metrics_collections,
740 updates_collections, name or 'mean_absolute_error')
~/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py in binary_op_wrapper(x, y)
883 if not isinstance(y, sparse_tensor.SparseTensor):
884 try:
--> 885 y = ops.convert_to_tensor(y, dtype=x.dtype.base_dtype, name="y")
886 except TypeError:
887 # If the RHS is not a tensor, it might be a tensor aware object
~/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, preferred_dtype)
834 name=name,
835 preferred_dtype=preferred_dtype,
--> 836 as_ref=False)
837
838
~/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, ctx)
924
925 if ret is None:
--> 926 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
927
928 if ret is NotImplemented:
~/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in _TensorTensorConversionFunction(t, dtype, name, as_ref)
772 raise ValueError(
773 "Tensor conversion requested dtype %s for Tensor with dtype %s: %r" %
--> 774 (dtype.name, t.dtype.name, str(t)))
775 return t
776
ValueError: Tensor conversion requested dtype float32 for Tensor with dtype int64: 'Tensor("eval_9/remove_squeezable_dimensions/cond_1/Merge:0", dtype=int64)'
This line in your code:
correct = tf.metrics.mean_absolute_error(labels = y, predictions = logits)
executes in a way where TensorFlow is first subtracting predictions from labels as seen in the backrace:
absolute_errors = math_ops.abs(predictions - labels)
In order to do the subtraction, the two tensors need to be the same datatype. Presumably your predictions (logits) are float32 and from the error message your labels are int64. You either have to do an explicit conversion with tf.to_float or an implicit one you suggest in your comment: defining the placeholder as float32 to start with, and trusting TensorFlow to do the conversion when the feed dictionary is processed.

Resources