Set thresholds in PySpark multinomial logistic regression - apache-spark

I would like to perform a multinomial logistic regression but I can't set threshold and thresholds parameters correctly. Consider the following DF:
from pyspark.ml.linalg import DenseVector
test_train_df = (
sqlc
.createDataFrame([(0, DenseVector([-1.0, 1.2, 0.7])),
(0, DenseVector([3.1, -2.0, -2.9])),
(1, DenseVector([1.0, 0.8, 0.3])),
(1, DenseVector([4.2, 1.4, -1.7])),
(0, DenseVector([-1.9, 2.5, -2.3])),
(2, DenseVector([2.6, -0.2, 0.2])),
(1, DenseVector([0.3, -3.4, 1.8])),
(2, DenseVector([-1.0, -3.5, 4.7]))],
['label', 'features'])
)
My label has 3 classes, so I have to set thresholds (plural, which default is None) rather than threshold (singular, which default is 0.5). Then I write:
from pyspark.ml import classification as cl
test_logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThresholds([.5, .5, .5])
)
Then I would like to fit the model on my DF:
test_logit = test_logit_abst.fit(test_train_df)
but when executing this last command I get an error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
~/anaconda3/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
Py4JJavaError: An error occurred while calling o3769.fit.
: java.lang.IllegalArgumentException: requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.
During handling of the above exception, another exception occurred:
IllegalArgumentException Traceback (most recent call last)
<ipython-input-211-8f3443f41b6b> in <module>()
----> 1 test_logit = test_logit_abst.fit(test_train_df)
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
62 return self.copy(params)._fit(dataset)
63 else:
---> 64 return self._fit(dataset)
65 else:
66 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit(self, dataset)
263
264 def _fit(self, dataset):
--> 265 java_model = self._fit_java(dataset)
266 return self._create_model(java_model)
267
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset)
260 """
261 self._transfer_params_to_java()
--> 262 return self._java_obj.fit(dataset._jdf)
263
264 def _fit(self, dataset):
~/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: 'requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.'
The error says threshold is set. This looks strange, as the documentation says that setting thresholds (plural) clears threshold (singular), so that the value 0.5 should be deleted.
So, how to clear threshold since no clearThreshold() exists?
In order to achieve this I tried to clear threshold this way:
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThresholds([.5, .5, .5])
.setThreshold(None)
)
This time the fit command works, I even obtain the model intercept and coefficients:
test_logit.interceptVector
DenseVector([65.6445, 31.6369, -97.2814])
test_logit.coefficientMatrix
DenseMatrix(3, 3, [-76.4534, -19.4797, -79.4949, 12.3659, 4.642, 4.1057, 64.0876, 14.8377, 75.3892], 1)
But if I try to get thresholds (plural) from test_logit_abst I get an error:
test_logit_abst.getThresholds()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-214-fc1c8617ce80> in <module>()
----> 1 test_logit_abst.getThresholds()
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/classification.py in getThresholds(self)
363 if not self.isSet(self.thresholds) and self.isSet(self.threshold):
364 t = self.getOrDefault(self.threshold)
--> 365 return [1.0-t, t]
366 else:
367 return self.getOrDefault(self.thresholds)
TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'
What does this mean?
As a further detail, curiously (and incomprehensibly to me) inverting the order of the parameters settings produces the first error I posted above:
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThreshold(None)
.setThresholds([.5, .5, .5])
)
Why does changing the order of the "set" instructions change the output as well?

It is a messy situation indeed...
The short answer is:
setThresholds (plural) not clearing the threshold (singular) seems to be a bug
For multinomial classification (i.e. number of classes > 2), setThresholds does not do what you expect (and arguably you don't need it)
If all you need is having some "thresholds" in the "default" value of 0.5, you don't have a problem - simply don't use any relevant argument or setThresholds statement
If you really need to apply different decision thresholds to different classes in multinomial classification, you will have to do it manually, by post-processing the respective probabilities, i.e. the probability column in the transformed dataframe (it works OK though with setThreshold(s) for binary classification)
And now for the long answer...
Let's start with binary classification, adapting the toy data from the docs:
spark.version
# u'2.2.0'
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
bdf = sc.parallelize([
Row(label=1.0, features=Vectors.dense(0.0, 5.0)),
Row(label=0.0, features=Vectors.dense(1.0, 2.0)),
blor = LogisticRegression(threshold=0.7, thresholds=[0.3, 0.7])
Row(label=1.0, features=Vectors.dense(2.0, 1.0)),
Row(label=0.0, features=Vectors.dense(3.0, 3.0))]).toDF()
We don't need to set thresholds (plural) here - threshold=0.7 is enough, but it will be useful when illustrating the differences with setThreshold below.
blorModel = blor.fit(bdf) # works OK
blor.getThreshold()
# 0.7
blor.getThresholds()
# [0.3, 0.7]
blorModel.transform(bdf).show(truncate=False) # transform the training data
Here is the result:
+---------+-----+------------------------------------------+----------------------------------------+----------+
|features |label|rawPrediction |probability |prediction|
+---------+-----+------------------------------------------+----------------------------------------+----------+
|[0.0,5.0]|1.0 |[-1.138455151184087,1.138455151184087] |[0.242604109995602,0.757395890004398] |1.0 |
|[1.0,2.0]|0.0 |[-0.6056346859838877,0.6056346859838877] |[0.35305562698104337,0.6469443730189567]|0.0 |
|[2.0,1.0]|1.0 |[0.26586039040308496,-0.26586039040308496]|[0.5660763559614698,0.4339236440385302] |0.0 |
|[3.0,3.0]|0.0 |[1.6453673835702176,-1.6453673835702176] |[0.8382639556951765,0.16173604430482344]|0.0 |
+---------+-----+------------------------------------------+----------------------------------------+----------+
What is the meaning of thresholds=[0.3, 0.7]? The answer lies in the 2nd row, where the prediction is 0.0, despite the fact that the the probability is higher for 1.0 (0.65): 0.65 is indeed higher that 0.35, but it is lower than the threshold we have set for this class (0.7), hence it is not classified as such.
Let's now try the seemingly identical operation, but with setThreshold(s) instead:
blor2 = (LogisticRegression()
.setThreshold(0.7)
.setThresholds([0.3, 0.7]) ) # works OK
blorModel2 = blor2.fit(bdf)
[...]
IllegalArgumentException: u'requirement failed: Logistic Regression getThreshold found inconsistent values for threshold (0.5) and thresholds (equivalent to 0.7)'
Nice, eh?
setThresholds (plural) seems indeed to have cleared our value of threshold (0.7) set in the previous line, as claimed in the docs, but it seemingly did so only to restore it to its default value of 0.5...
Omitting .setThreshold(0.7) gives the first error you report yourself (not shown).
Inverting the order of the parameter settings resolves the issue (!!!) and, moreover, renders both getThreshold (singular) and getThresholds (plural) operational (in contrast with your case):
blor2 = (LogisticRegression()
.setThresholds([0.3, 0.7])
.setThreshold(0.7) )
blorModel2 = blor2.fit(bdf) # works OK
blor2.getThreshold()
# 0.7
blor2.getThresholds()
# [0.30000000000000004, 0.7]
Let's move now to the multinomial case; we'll stick again to the example in the docs, with data from the Spark Github repo (they should also be available locally, in your $SPARK_HOME/data/mllib/sample_multiclass_classification_data.txt, but I am working on a Databricks notebook); it is a 3-class case, with labels in {0.0, 1.0, 2.0}.
data_path ="/FileStore/tables/sample_multiclass_classification_data.txt"
mdf = spark.read.format("libsvm").load(data_path)
Similarly with the binary case above, where the elements of our thresholds (plural) sum up to 1, let's ask for a threshold of 0.8 for class 2:
mlor = (LogisticRegression()
.setFamily("multinomial")
.setThresholds([0, 0.2, 0.8])
.setThreshold(0.8) )
mlorModel= mlor.fit(mdf) # works OK
mlor.getThreshold()
# 0.8
mlor.getThresholds()
# [0.19999999999999996, 0.8]
Looks fine, but let's ask for a prediction in the (training) dataset:
mlorModel.transform(mdf).show(truncate=False)
I have singled out only one row - it should be the 2nd from the end of the full output:
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
|label|features |rawPrediction |probability |prediction|
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
[...]
|0.0 |(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.166667]) |[36.67790353804905,-74.71196613173531,38.034062593686244]|[0.20486526556822454,8.619113376801409E-50,0.7951347344317755] |2.0 |
[...]
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
Scrolling to the right, you'll see that despite the fact that the prediction for class 2.0 here is below the threshold we have set (0.8), the row is indeed predicted as 2.0 - in contrast with the binary case demonstrated above...
So, what to do? Simply remove all the threshold-related statements; you don't need them - even setFamily is unnecessary, as the algorithm will detect by itself that you have more than 2 classes. This will give identical results with the above:
mlor = LogisticRegression() # works OK - no family, no threshold(s)
To summarize:
In both the binary & multinomial cases, what is actually returned by the algorithm is a vector of probabilities of length equal to the number of classes, with elements summing up to 1.
In the binary case only, Spark allows you to go one step further and not naively selecting the highest probability class as the prediction, but applying a user-defined threshold instead; this setting might be useful e.g. in cases with imbalanced data.
This threshold(s) setting has actually no effect in the multinomial case, where Spark will always return as prediction the class with the highest probability.
Despite the mess in the documentation (about which I have argued elsewhere) and the possibility of some bugs, let me say about (3) that this design choice is not unjustifiable; as it has been nicely argued elsewhere (emphasis in the original):
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
Although the above argument was made for the binary case, it fully holds for the multinomial one, too...

Related

Setting `remove_unused_columns=False` causes error in HuggingFace Trainer class

I am training a model using HuggingFace Trainer class. The following code does a decent job:
!pip install datasets
!pip install transformers
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, AutoTokenizer
dataset = load_dataset('glue', 'mnli')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)
def preprocess_function(examples):
return tokenizer(examples["premise"], examples["hypothesis"], truncation=True, padding=True)
encoded_dataset = dataset.map(preprocess_function, batched=True)
args = TrainingArguments(
"test-glue",
learning_rate=3e-5,
per_device_train_batch_size=8,
num_train_epochs=3,
remove_unused_columns=True
)
trainer = Trainer(
model,
args,
train_dataset=encoded_dataset["train"],
tokenizer=tokenizer
)
trainer.train()
However, setting remove_unused_columns=False results in the following error:
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
704 if not is_tensor(value):
--> 705 tensor = as_tensor(value)
706
ValueError: too many dimensions 'str'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
8 frames
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
720 )
721 raise ValueError(
--> 722 "Unable to create tensor, you should probably activate truncation and/or padding "
723 "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
724 )
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
Any suggestions are highly appreciated.
It fails because the value in line 705 is a list of str, which points to hypothesis. And hypothesis is one of the ignored_columns in trainer.py.
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
704 if not is_tensor(value):
--> 705 tensor = as_tensor(value)
See the below snippet from trainer.py for the remove_unused_columns flag:
def _remove_unused_columns(self, dataset: "datasets.Dataset", description: Optional[str] = None):
if not self.args.remove_unused_columns:
return dataset
if self._signature_columns is None:
# Inspect model forward signature to keep only the arguments it accepts.
signature = inspect.signature(self.model.forward)
self._signature_columns = list(signature.parameters.keys())
# Labels may be named label or label_ids, the default data collator handles that.
self._signature_columns += ["label", "label_ids"]
columns = [k for k in self._signature_columns if k in dataset.column_names]
ignored_columns = list(set(dataset.column_names) - set(self._signature_columns))
There could be a potential pull request on HuggingFace to provide a fallback option in case the flag is False. But in general, it looks like that the flag implementation is not complete for e.g. it can't be used with Tensorflow.
On the contrary, it doesn't hurt to keep it True, unless there is some special need.

ImageDataBunch.from_df positional indexers are out-of-bounds

scratching my head on this issue. i dont know how to identify the positional indexers. am i even passing them?
attempting this for my first kaggle comp, can pass in the csv to a dataframe and make the needed edits. trying to create the ImageDataBunch so training a cnn can begin. This error pops up no matter which method is tried. Any advice would be appreciated.
data = ImageDataBunch.from_df(path, df, ds_tfms=tfms, size=24)
data.classes
Backtrace
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-25-5588812820e8> in <module>
----> 1 data = ImageDataBunch.from_df(path, df, ds_tfms=tfms, size=24)
2 data.classes
/opt/conda/lib/python3.7/site-packages/fastai/vision/data.py in from_df(cls, path, df, folder, label_delim, valid_pct, seed, fn_col, label_col, suffix, **kwargs)
117 src = (ImageList.from_df(df, path=path, folder=folder, suffix=suffix, cols=fn_col)
118 .split_by_rand_pct(valid_pct, seed)
--> 119 .label_from_df(label_delim=label_delim, cols=label_col))
120 return cls.create_from_ll(src, **kwargs)
121
/opt/conda/lib/python3.7/site-packages/fastai/data_block.py in _inner(*args, **kwargs)
477 assert isinstance(fv, Callable)
478 def _inner(*args, **kwargs):
--> 479 self.train = ft(*args, from_item_lists=True, **kwargs)
480 assert isinstance(self.train, LabelList)
481 kwargs['label_cls'] = self.train.y.__class__
/opt/conda/lib/python3.7/site-packages/fastai/data_block.py in label_from_df(self, cols, label_cls, **kwargs)
283 def label_from_df(self, cols:IntsOrStrs=1, label_cls:Callable=None, **kwargs):
284 "Label `self.items` from the values in `cols` in `self.inner_df`."
--> 285 labels = self.inner_df.iloc[:,df_names_to_idx(cols, self.inner_df)]
286 assert labels.isna().sum().sum() == 0, f"You have NaN values in column(s) {cols} of your dataframe, please fix it."
287 if is_listy(cols) and len(cols) > 1 and (label_cls is None or label_cls == MultiCategoryList):
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1760 except (KeyError, IndexError, AttributeError):
1761 pass
-> 1762 return self._getitem_tuple(key)
1763 else:
1764 # we by definition only have the 0th axis
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
2065 def _getitem_tuple(self, tup: Tuple):
2066
-> 2067 self._has_valid_tuple(tup)
2068 try:
2069 return self._getitem_lowerdim(tup)
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
701 raise IndexingError("Too many indexers")
702 try:
--> 703 self._validate_key(k, i)
704 except ValueError:
705 raise ValueError(
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
2007 # check that the key does not exceed the maximum size of the index
2008 if len(arr) and (arr.max() >= len_axis or arr.min() < -len_axis):
-> 2009 raise IndexError("positional indexers are out-of-bounds")
2010 else:
2011 raise ValueError(f"Can only index by location with a [{self._valid_types}]")
IndexError: positional indexers are out-of-bounds
I faced this error while creating a DataBunch when my dataframe/CSV did not have a class label explicitly defined.
I created a dummy column which stored 1's for all my rows in the dataframe and it seemed to work. Also please be sure to store your independent variable in the second column and the label(dummy variable in this case) in the first column.
I believe this error happens if there's just one column in the Pandas DataFrame.
Thanks.
Code:
df = pd.DataFrame(lines, columns=["dummy_value", "text"])
df.to_csv("./train.csv")
data_lm = TextLMDataBunch.from_csv(path, "train.csv", min_freq=1)
Note: This is my first attempt at answering a StackOverflow question. Hope it helped!
This error also appears when your dataset is not correctly split between test and validation.
In the case of dataframes, it assumes there is a column is_valid that indicates which rows are in validation set.
If all rows have True, then the training set is empty, so fastai cannot index into it to prepare the first example, thus raising this error.
Example:
data = pd.DataFrame({
'fname': [f'{x}.png' for x in range(10)],
'label': np.arange(10)%2,
'is_valid': True
})
blk = DataBlock((ImageBlock, CategoryBlock),
splitter=ColSplitter(),
get_x=ColReader('fname'),
get_y=ColReader('label'),
item_tfms=Resize(224, method=ResizeMethod.Squish),
)
blk.summary(data)
Results in the error.
Solution
The solution is to check that your data can be split correctly into train and valid sets. In the above example, it suffices to have one row that is not in validation set:
data.loc[0, 'is_valid'] = False
How to figure it out?
Work in a jupyter notebook. After the error, type %debug in a cell, and enter the post mortem debugging. Go to the frame of the setup function ( fastai/data/core.py(273) setup() ) by going up 5 frames.
This takes you to this line that is throwing the error.
You can then print(self.splits) and observe that the first one is empty.

How to define ration of summary with hugging face transformers pipeline?

I am using the following code to summarize an article from using huggingface-transformer's pipeline. Using this code:
from transformers import pipeline
summarizer = pipeline(task="summarization" )
summary = summarizer(text)
print(summary[0]['summary_text'])
How can I define a ratio between the summary and the original article? For example, 20% of the original article?
EDIT 1: I implemented the solution you suggested, but got the following error. This is the code I used:
summarizer(text, min_length = int(0.1 * len(text)), max_length = int(0.2 * len(text)))
print(summary[0]['summary_text'])
The error I got:
RuntimeError Traceback (most recent call last)
<ipython-input-9-bc11c5d8eb66> in <module>()
----> 1 summarizer(text, min_length = int(0.1 * len(text)), max_length = int(0.2 * len(text)))
2 print(summary[0]['summary_text'])
13 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1482 # remove once script supports set_grad_enabled
1483 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1484 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1485
1486
RuntimeError: index out of range: Tried to access index 1026 out of table with 1025 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418
(Note that this answer is based on the documentation for version 2.6 of transformers)
It seems that as of yet the documentation on the pipeline feature is still very shallow, which is why we have to dig a bit deeper. When calling a Python object, it internally references its own __call__ property, which we can find here for the summarization pipeline.
Note that it allows us (similar to the underlying BartForConditionalGeneration model) to specifiy the min_length and max_length, which is why we can simply call with something like
summarizer(text, min_length = 0.1 * len(text), max_length = 0.2 * len(text)
This would give you a summary of about 10-20% length of the original data, but of course you can change that to your liking. Note that the default value for BartForConditionalGeneration for max_length is 20 (as of now, min_length is undocumented, but defaults to 0), whereas the summarization pipeline has values min_length=21 and max_length=142.

Don't understand error message (basic sklearn command)

I'm new to Python and programming in general and I wanted to exercise a littlebit with linear regression in one variable.
Im currently following this tutorial in the link
https://www.youtube.com/watch?v=8jazNUpO3lQ&list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw&index=2
and I am exactly doing what he is doing.
I did however encounter an error when compiling as shown in the code below
(for simplicity, I put '--' to places which is the output. I used Jupyter Notebook)
At the end I encounterd a long list of errors when trying to compile 'reg.predict(3300)'.
I don't understand what went wrong.
Can someone help me out?
Cheers!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
df = pd.read_csv("homeprices.csv")
df
--area price
0 2600 550000
1 3000 565000
2 3200 610000
3 3600 680000
4 4000 725000
%matplotlib inline
plt.xlabel('area(sqr ft)')
plt.ylabel('price(US$)')
plt.scatter(df.area, df.price, color='red', marker = '+')
--<matplotlib.collections.PathCollection at 0x2e823ce66a0>
reg = linear_model.LinearRegression()
reg.fit(df[['area']],df.price)
--LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
reg.predict(3300)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-ad5a8409ff75> in <module>
----> 1 reg.predict(3300)
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
211 Returns predicted values.
212 """
--> 213 return self._decision_function(X)
214
215 _preprocess_data = staticmethod(_preprocess_data)
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in _decision_function(self, X)
194 check_is_fitted(self, "coef_")
195
--> 196 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
197 return safe_sparse_dot(X, self.coef_.T,
198 dense_output=True) + self.intercept_
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
543 "Reshape your data either using array.reshape(-1, 1) if "
544 "your data has a single feature or array.reshape(1, -1) "
--> 545 "if it contains a single sample.".format(array))
546 # If input is 1D raise error
547 if array.ndim == 1:
ValueError: Expected 2D array, got scalar array instead:
array=3300.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Try reg.predict([[3300]]). The api used to allow scalar value but now you need to give 2D array
reg.fit(df[['area']],df.price)
I think above we are using 2 variables, so using 2D array to fit [X]. we need to use 2D array in reg.predict for [X],too. Hence,
reg.predict([[3300]])
Expected 2D array,got scalar array instead: this is written in the error explained box so
kindly change it to :
just wrote it like this
reg.predict([[3300]])

ValueError: Number of priors must match number of classes

I want to compile my python3 code on ubuntu, and also want to know about the problem, such that i can handle that in future.
It seems there is some problem with the imported library function.
## sample code
1 import numpy as np
2 x = np.array([[-1,-1],[-2,-1],[-3,-2],[1,1],[2,1],[3,2]])
3 y = np.array([1,1,1,2,2,2])
4 from sklearn.naive_bayes import GaussianNB
5 clf = GaussianNB(x, y)
6 clf = clf.fit(x,y) ###showing error on compiling
7 print(clf.predict([[-2,1]]))
## output shown
Traceback (most recent call last):
File "naive.py", line 7, in <module>
clf = clf.fit(x,y)
File "/home/abhihsek/.local/lib/python3.6/site-
packages/sklearn/naive_bayes.py", line 192, in fit
sample_weight=sample_weight)
File "/home/abhihsek/.local/lib/python3.6/site-
packages/sklearn/naive_bayes.py", line 371, in _partial_fit
raise ValueError('Number of priors must match number of'
ValueError: Number of priors must match number of classes.
## code of library function line 192
190 X, y = check_X_y(X, y)
191 return self._partial_fit(X, y, np.unique(y),
_refit=True,
192
sample_weight=sample_weight)
## code of library function line 371
369 # Check that the provide prior match the number of classes
370 if len(priors) != n_classes:
371 raise ValueError('Number of priors must
match
number of'
372 ' classes.')
373 # Check that the sum is 1
As #Suvan Pandey mentioned, then the code won't give any error when writing clf = GaussianNB() instead of clf = GaussianNB(x, y).
If we look at the GaussianNB class then the __init__() can take these parameters:
def __init__(self, priors=None, var_smoothing=1e-9): # <-- these have a default value
self.priors = priors
self.var_smoothing = var_smoothing
The documentation about the two parameters:
priors – Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
var_smoothing – Portion of the largest variance of all features that is added to variances for calculation stability.
As your x and y variables both return an array object then they don't fit the parameters of the __init__(...).

Resources