ValueError: continuous is not supported for RandomForestRegressor - python-3.x

After I had Pipeline preprocessed the weld data, I was able to get clean data in the output. Next, I need to pass the cleaned data through the model for training. Both the data preprocessing and model training steps can be further encapsulated in a Pipeline as follows:
from sklearn.ensemble import RandomForestRegressor
completed_pl = Pipeline(
steps=[
("preprocessor", preprocessor),
("classifier", RandomForestRegressor())
]
)
# training
completed_pl.fit(X_train, y_train)
# accuracy
y_train_pred = completed_pl.predict(X_train)
print(f"Accuracy on train: {accuracy_score(list(y_train), list(y_train_pred)):.2f}")
y_pred = completed_pl.predict(X_test)
print(f"Accuracy on test: {accuracy_score(list(y_test), list(y_pred)):.2f}")
I have used load_boston dataset from sklearn
And the error :
ValueError Traceback (most recent call last)
<ipython-input-86-d0b1928cf1a7> in <module>
12 # accuracy
13 y_train_pred = completed_pl.predict(X_train)
---> 14 print(f"Accuracy on train: {accuracy_score(list(y_train), list(y_train_pred)):.2f}")
15
16 y_pred = completed_pl.predict(X_test)
1 frames
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py in _check_targets(y_true, y_pred)
102 # No metrics support "multiclass-multioutput" format
103 if y_type not in ["binary", "multiclass", "multilabel-indicator"]:
--> 104 raise ValueError("{0} is not supported".format(y_type))
105
106 if y_type in ["binary", "multiclass"]:
ValueError: continuous is not supported

Related

Huggingface trainer.train() throws "IndexError: Target -1 is out of bounds" during Slovak sentences sentiment analysis using SlovakBert

My goal is to train a classifier able to do sentiment analysis in Slovak language using loaded SlovakBert model and HuggingFace library.
Code is executed on Google Colaboratory.
My dataset is read from one csv file for testing:
https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv
and one csv files for training :
https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_accomodation.csv
Data has two columns: column of sentences and 2nd column of labels which indicate sentiment of the sentence. Labels have values -1, 0 or 1.
After execution of trainer.train(), there is an error:
Num examples = 89
Num Epochs = 3
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 36
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-11-44bfec8c5f70> in <module>()
40 )
41 #Then fine-tune your model by calling train():
---> 42 trainer.train()
43
44 trainer.evaluate()
7 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
2994 if size_average is not None or reduce is not None:
2995 reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 2996 return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
2997
2998
IndexError: Target -1 is out of bounds.
Code:
!pip install transformers==4.10.0 -qqq
!pip install datasets -qqq
from re import M
import numpy as np
from datasets import load_metric, load_dataset, Dataset
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding
import pandas as pd
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
#links to dataset
test = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv'
train = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_accomodation.csv'
model_name = 'gerulata/slovakbert'
#Load data
dataset = load_dataset('csv', data_files={'train': train, 'test': test}, on_bad_lines='skip', column_names=["text", "label"], delimiter = ",")
#Preparing dataset
def tokenize_function(examples):
return tokenizer(examples['text'])
tokenized_datasets = dataset.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding="max_length")
#Train
#Load model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
#Specify place where we save checkpoints
training_args = TrainingArguments(output_dir="test_trainer")
#Metrics
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
# Create a Trainer object with your model, training arguments, training and test datasets, and evaluation function
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test'],
compute_metrics=compute_metrics,
data_collator=data_collator
)
#Then fine-tune your model by calling train():
trainer.train()
trainer.evaluate()
What is the reason of this error and how can it be solved?
Edit: I tried to cahnge the line:
model =AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
to:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3, id2label={"LABEL_0": -1, "LABEL_1": 0, "LABEL_2": 1})
but the same error is thrown.

sklearn train_test_split - ValueError: Found input variables with inconsistent numbers of samples: [2552, 1] /Linear Regression

I need assistance reshaping my input to match my output.
I wanted to create a model that vectorizes and classifies 'All information' information so that the label'Fall' can be divided into 0 and 1.
However, I keep getting the [ValueError: Found input variables with inconsistent numbers of samples: [2552, 1]] error.
The'shape' looks fine, but I don't know how to fix it.
## Linear Regression
import pandas as pd
import numpy as np
from tqdm import tqdm
#instance->fit->predict
from sklearn.linear_model import LinearRegression
model=LinearRegression(fit_intercept=True)
data=pd.read_csv("Fall_test_0826.csv", encoding='cp949', header=0)
data.head(2)
X=data.drop(["fall"], axis=1)
y= data.fall
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state = 0)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect=TfidfVectorizer()
tfidf_vect.fit(X_train)#단어사전 만듬
X_train_tfidf_vect = tfidf_vect.fit_transform(X_train['All information']).toarray()
X_test_tfidf_vect = tfidf_vect.transform(X_test)
lr_clf=LinearRegression()
lr_clf.fit(X_train_tfidf_vect, y_train)
pred = lr_clf.predict(X_test_tfidf_vect)
from sklearn.metrics import accuracy_score
print('Logisitic Regression _ {0:.3f}'.format(accuracy_score(y_test, pred)))
Error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-85-bec6ead862c8> in <module>
----> 1 print('{0:.3f}'.format(accuracy_score(y_test, pred)))
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
71 FutureWarning)
72 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73 return f(**kwargs)
74 return inner_f
75
~\anaconda3\lib\site-packages\sklearn\metrics\_classification.py in accuracy_score(y_true, y_pred, normalize, sample_weight)
185
186 # Compute accuracy for each possible representation
--> 187 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
188 check_consistent_length(y_true, y_pred, sample_weight)
189 if y_type.startswith('multilabel'):
~\anaconda3\lib\site-packages\sklearn\metrics\_classification.py in _check_targets(y_true, y_pred)
79 y_pred : array or indicator matrix
80 """
---> 81 check_consistent_length(y_true, y_pred)
82 type_true = type_of_target(y_true)
83 type_pred = type_of_target(y_pred)
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
254 uniques = np.unique(lengths)
255 if len(uniques) > 1:
--> 256 raise ValueError("Found input variables with inconsistent numbers of"
257 " samples: %r" % [int(l) for l in lengths])
258
ValueError: Found input variables with inconsistent numbers of samples: [2552, 1]
enter image description here
enter image description here
I think you have to change your the lines in your code from
X_test_tfidf_vect = tfidf_vect.transform(X_test)
to
X_test_tfidf_vect = tfidf_vect.transform(X_test['All information'])
But your approach is wrong. You are going for Linear Regression but trying to use classifciation metrics (accuracy_score) (Reference)
Doing so should lead to the error ValueError: Classification metrics can't handle a mix of binary and continuous targets
So this will not work, because your array pred will hold float values, so for example 0.5, but for the accuracy_score you need class labels as integers, so for example 0,1,2 or 3 etc.
You need to use regression metrics instead to evaluate your Linear Regression.
Have a look at the available regression metrics here.

LabelEncoder instance is not fitted yet

I have a code for prediction of unseen data in a sentence classification task.
The code is
from sklearn.preprocessing import LabelEncoder
maxlen = 1152
### PREDICT NEW UNSEEN DATA ###
tokenizer = Tokenizer()
label_enc = LabelEncoder()
X_test = ['this is boring', 'wow i like this you did a great job']
X_test = tokenizer.texts_to_sequences(X_test)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
a = (model.predict(X_test)>0.5).astype(int).ravel()
print(a)
reverse_pred = label_enc.inverse_transform(a.ravel())
print(reverse_pred)
But I am getting this error
[1 1]
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
<ipython-input-33-7e12dbe8aec1> in <module>()
39 print(a)
40
---> 41 reverse_pred = label_enc.inverse_transform(a.ravel())
42 print(reverse_pred)
1 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
965
966 if not attrs:
--> 967 raise NotFittedError(msg % {'name': type(estimator).__name__})
968
969
NotFittedError: This LabelEncoder instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
I have used Sequential model and the model.fit is written as history=model.fit() in the training part. Why am I getting this error?
following the sklearn documentation and what reported here, you have simply to fit your encoder before making an inverse transform
y = ['positive','negative','positive','negative','positive','negative']
label_enc = LabelEncoder()
label_enc.fit(y)
model_predictions = np.random.uniform(0,1, 3)
model_predictions = (model_predictions>0.5).astype(int).ravel()
model_predictions = label_enc.inverse_transform(model_predictions)

ValueError: Object arrays cannot be loaded when allow_pickle=False for IMDB data for Keras

First I am using Tensorflow 1.15 and Keras 2.2.4.
I ran the following code in Jupyter Notebook:
from keras.datasets import imdb
from keras import preprocessing
max_features = 10000
maxlen = 20
(x_train, y_train), (x_test, y_test) = imdb.load_data(
num_words=max_features)
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
and it gave me this error:
Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz
17465344/17464789 [==============================] - 8s 0us/step
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-609d113f6ed2> in <module>
6
7 (x_train, y_train), (x_test, y_test) = imdb.load_data(
----> 8 num_words=max_features)
9
10 x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
~\.conda\envs\tensorflow_env\lib\site-packages\keras\datasets\imdb.py in load_data(path, num_words, skip_top, maxlen, seed, start_char, oov_char, index_from, **kwargs)
57 file_hash='599dadb1135973df5b59232a0e9a887c')
58 with np.load(path) as f:
---> 59 x_train, labels_train = f['x_train'], f['y_train']
60 x_test, labels_test = f['x_test'], f['y_test']
61
~\.conda\envs\tensorflow_env\lib\site-packages\numpy\lib\npyio.py in __getitem__(self, key)
260 return format.read_array(bytes,
261 allow_pickle=self.allow_pickle,
--> 262 pickle_kwargs=self.pickle_kwargs)
263 else:
264 return self.zip.read(key)
~\.conda\envs\tensorflow_env\lib\site-packages\numpy\lib\format.py in read_array(fp, allow_pickle, pickle_kwargs)
720 # The array contained Python objects. We need to unpickle the data.
721 if not allow_pickle:
--> 722 raise ValueError("Object arrays cannot be loaded when "
723 "allow_pickle=False")
724 if pickle_kwargs is None:
ValueError: Object arrays cannot be loaded when allow_pickle=False
What is wrong with it? I took this code from "Deep Learning with Python" book.
Thanks
When numpy is loading a lot of data, it needs to have allow_pickle set to True. You should change that manually.
np.load("place of bumpy file.npy", allow_pickle=True)
I think imdb has numpy load inside.

Using GridSearchCV for NLP Missing Positional Argument Self

I am working on a NLP problem. I've been testing various models and the process has been working fine.
from sklearn.linear_model import SGDClassifier
classifier = SGDClassifier().fit(X_train_tfidf, y_train)
y_predicted_tfidf = classifier.predict(X_test_tfidf)
from sklearn.metrics import precision_score
precision = precision_score(y_test, y_predicted_tfidf, pos_label=None,average='weighted')
print(precision)
>>> 0.79708294305
Now I am trying to employ Grid Search in order find tune parameters and running into an error.
from sklearn.model_selection import GridSearchCV
parameters = {'alpha': [0.00001, 0.0001, 0.001, 0.001, 0.01] }
gs_classifier = GridSearchCV(SGDClassifier, parameters, n_jobs=-1)
gs_classifier = gs_classifier.fit(X_train_tfidf, y_train)
Which results in the following output:
TypeError Traceback (most recent call last)
<ipython-input-25-95b85f78662f> in <module>()
1 gs_classifier = GridSearchCV(SGDClassifier, parameters, n_jobs=-1)
----> 2 gs_classifier = gs_classifier.fit(X_train_tfidf, y_train)
anaconda/lib/python3.6/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups)
943 train/test set.
944 """
--> 945 return self._fit(X, y, groups,
...
/anaconda/lib/python3.6/site-packages/sklearn/base.py in clone(estimator, safe)
65 % (repr(estimator), type(estimator)))
66 klass = estimator.__class__
---> 67 new_object_params = estimator.get_params(deep=False)
68 for name, param in six.iteritems(new_object_params):
69 new_object_params[name] = clone(param, safe=False)
TypeError: get_params() missing 1 required positional argument: 'self'
I've tried various combinations of parameters and all result in the same error. For this example I've kept it simple and am just using a range of alpha values.

Resources