Sklearn error : predict(x,y) takes 2 positional arguments but 3 were given - scikit-learn

I am working on building a multivariate regression analysis on sklearn , I did a thorough look at the documentation. When I run the predict() function I get the error : predict() takes 2 positional arguments but 3 were given
X is a data frame , y is column; I have tried to convert the data frame to array / matrix but still get the error.
Have added a snippet showing the x and y arrays.
reg.coef_
reg.predict(x,y)
x_train=train.drop('y-variable',axis =1)
y_train=train['y-variable']
x_test=test.drop('y-variable',axis =1)
y_test=test['y-variable']
x=x_test.as_matrix()
y=y_test.as_matrix()
reg = linear_model.LinearRegression()
reg.fit(x_train,y_train)
reg.predict(x,y)

Use reg.predict(x). You don't need to provide the y values to predict. In fact, the purpose of training the machine learning model is to let it infer the values of y given the input parameters in x.
Also, the documentation of predict here explains that predict expects only x as a parameter.
The reason why you get the error:
predict() takes 2 positional arguments but 3 were given
is because, when you call reg.predic(x), python will implicitly translate this to reg.predict(self,x), that's why the error is telling you that predict() takes 2 positional arguments. The way you call predict, reg.predict(x,y), will be translated to reg.predict(self,x,y) thus 3 positional arguments will be used instead of 2 and that explains the whole error message.

When you are testing over the test set, it is assumed you don't have the labels for it. You are testing to see how well your model can generalize, and hence you compare the predictions with the real labels. When you want to predict, you use only your X variable(s).

I think you are getting confused between reg.predict() and reg.score(), the former is a method which is used for making predictions on the data using the model which is trained using the data. It only takes your features/independent variables X and the object itself self (which is taken care internally) as inputs and gives you the corresponding predicted target/dependent variable Y, which can be later compared with the actual values of the target variable and evaluate the performance of the model. However, if you wish to do the model evaluation it in a single step you can use reg.score() method which takes both your X and Y as inputs and computes the corresponding evaluation measure (R^2 or accuracy depending on the problem at hand). Please refer to sklearn.linear_model.LinearRegression for more information.
Also, these methods are common for most of the supervised learning models in sklearn.

Related

XGboost model variable transformation

I am working on an XGBoost model with a few input variables. There is one variable X that I am testing different ways of transformation.
option 1. I apply a group-by average of X, and use the deviation X - group_by_mean(X) as input
option 2. I apply a simple line regression y = aX + b on group-by X, and use y as input
I run two models, with otherwise identical input,
Result, I get better model prediction from option 1 then option 2 on XGboost model.
My question is, can anyone direct me to the potential theorical reason why opition 1 gives me better result as an input to XGboost model, than option 2?
I suspect it is due to option 2 a simple linear regression creates unobserved error, while option 1 a simple average has unobserved error 0, since I apply all known information in option 1. But I would appreciate more theorical reasoning and backing if possible.

what y parameter means in knn sklearn fit function

im new in this field and im trying to use knn alghorithm from sklearn.
i found that i need to train the model with the function fit, where in there i pass first argument X as training set, and second argument y as Target values. i trying to understand what the Target values parameter y means. someone used this knn sklearn algorithm and can explain it to me?
here in the documentation-
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
i saw that they used also only X as parameter sometimes, but when i try to pass only X it says that also y is required.

Machine Learning liner Regression - Sklearn

I'm new to the Machine learning domain and in Learn Regression i have some doubt
1:While practicing the sklearn learn regression model prediction method getting the below error.
Code:
sklearn.linear_model.LinearRegression.predict(25)
Error:
"ValueError: Expected 2D array, got scalar array instead: array=25. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
Do i need to pass a 2-D array? Checked on sklearn documentation page any haven't found any thing for version update.
**Running my code on Kaggle
https://www.kaggle.com/aman9d/bikesharingdemand-upx/
2: Is index of dataset going to effect model's score (weights)?
First of all you should put your code as you use:
# import, instantiate, fit
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X, y)
# use the predict method
linreg.predict(25)
Because what you post in the question is not properly executable, predict method is not static for the class LinearRegression.
When you fit a model, the first step is recognize which kind of data will be the input, in your case will be similar to X, that means that if you pass something with different shape of X to the model it will raise an error.
In your example X seems to be a pd.DataFrame() instance with only 1 column, this should be replaceable with an array of 2 dimension representing the number of examples by the number of features, so if you try:
linreg.predict([[25]])
should work.
For example if you were trying a regression with more than 1 feature aka column, let's say temp and humidity, your input would look like this:
linreg.predict([[25, 56]])
I hope this will help you and always keep in mind which is the shape of your data.
Documentation: LinearRegression fit
X : array-like or sparse matrix, shape (n_samples, n_features)

How to provide weighted eval set to XGBClassifier.fit()?

From the sklearn-style API of XGBClassifier, we can provide eval examples for early-stopping.
eval_set (list, optional) – A list of (X, y) pairs to use as a
validation set for early-stopping
However, the format only mentions a pair of features and labels. So if the doc is accurate, there is no place to provide weights for these eval examples.
Am I missing anything?
If it's not achievable in the sklearn-style, is it supported in the original (i.e. non-sklearn) XGBClassifier API? A short example will be nice, since I never used that version of the API.
As of a few weeks ago, there is a new parameter for the fit method, sample_weight_eval_set, that allows you to do exactly this. It takes a list of weight variables, i.e. one per evaluation set. I don't think this feature has made it into a stable release yet, but it is available right now if you compile xgboost from source.
https://github.com/dmlc/xgboost/blob/b018ef104f0c24efaedfbc896986ad3ed1b66774/python-package/xgboost/sklearn.py#L235
EDIT - UPDATED per conversation in comments
Given that you have a target-variable representing real-valued gain/loss values which you would like to classify as "gain" or "loss", and you would like to make sure the validation-set of the classifier weighs the large-absolute-value gains/losses heaviest, here are two possible approaches:
Create a custom classifier which is just XGBoostRegressor fed to a treshold where the real-valued regression predictions are converted to 1/0 or "gain"/"loss" classifications. The .fit() method of this classifier would just call .fit() of xgbregressor, while .predict() method of this classifier would call .predict() of the regressor and then return the thresholded category predictions.
you mentioned you would like to try weighting the treatment of the records in your validation set, but there is no option for this in xgboost. The way to implement this would be to implement a custom eval-metric. However, you pointed out that eval_metric must be able to return a score for a single label/pred record at a time, so it couldn't accept all your row-values and perform the weighting in the eval metric. The solution to this you mentioned in your comment was "create a callable which has a ref to all validation examples, pass the indices (instead of labels and scores) into eval_set, use the indices to fetch labels and scores from within the callable and return metric for each validation examples." This should also work.
I would tend to prefer option 1 as more straightforward, but trying two different approaches and comparing results is generally a good idea if you have the time, so interested how these turn out for you.

kmeans clustering transform method specifying centroids

In the scikit-learn kmeans source code, there is an optional argument y that can be specified (transform(X[, y])); however when I examined the source code for transform, it seems that nowhere does it deal with y in the case that it is specified. What is the purpose of this optional argument (it is not clear in the documentation either)?
As an addendum; I was wondering if there was any way to specify the centroids in the transform function if they're already computed previously. (Or if there was any other function to do this in scikit-learn).
Centroid specification
You could just overwrite kmeans_object.cluster_centers_ with your own centroids. But it might be better just using init with these centers and do some iterations.
See the available attributes in the docs.
To answer your first question about the seemingly pointless argument y. You are correct, in many cases, Scikit-Learn allows users to pass a y argument that actually doesn't affect the outcome of the method.
As explained in their documentation:
y might be ignored in the case of unsupervised learning. However, to
make it possible to use the estimator as part of a pipeline that can
mix both supervised and unsupervised transformers, even unsupervised
estimators need to accept a y=None keyword argument in the second
position that is just ignored by the estimator. For the same reason,
fit_predict, fit_transform, score and partial_fit methods need to
accept a y argument in the second place if they are implemented.
So it's all to make the code easier to write. Imagine that you have a pipeline that looks like this:
step 0: some normalization
step 1: K-means to transform the data in another space
step 2: classification step
Step 1 obviously doesn't need y to work, but if you have to write the code to make the pipeline apply all of these steps, it'll be easier to simply pass X, y into all transformers, rather than having to worry about whether each individual transformer takes a y or not

Resources