Do Polars dataframes need to be formatted a certain way to be used with scikit-learn models? - scikit-learn

I've been working through a notebook on Kaggle that used Pandas, and I wanted to refactor it to Polars.
The dataframes I'm working with look like this:
X_train
Pclass (i64)
Sex (i16)
Age (i32)
Fare (i16)
Embarked (i16)
title (i16)
family_size (i64)
is_alone (i64)
age*class (i64)
3
0
1
0
1
1
2
1
0
...
...
...
...
...
...
...
...
...
shape: (891,9)
y_train
Survived (i64)
0
1
...
shape: (891,1)
Kaggle has me creating a logistic regression with the following code:
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
It's my understanding that Polars dataframes don't work with sklearn, so I modified X_train and y_train to be numpy ndarrays like so:
logreg.fit(X_train.to_numpy(),y_train.to_numpy())
but then I get the following error:
>>> logreg.fit(X_train.to_numpy(),y_train.to_numpy())
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/utils/validation.py:993: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
LogisticRegression()
I was going to try saying y_train.transpose(), but there's no mention of needing to reshape the Pandas dataframe so I'm wondering if a transpose is really necessary, or if I'm doing something else wrong.
Edit:
I added the numpy.ravel() function like this logreg.fit(X_train.to_numpy(),y_train.to_numpy().ravel()) and it seems to have worked, but now when I say
y_pred = logreg.predict(X_test.to_numpy()) I get the following error
>>>y_pred= logreg.predict(X_test.to_numpy())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/linear_model/_base.py", line 425, in predict
scores = self.decision_function(X)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/linear_model/_base.py", line 407, in decision_function
X = self._validate_data(X, accept_sparse="csr", reset=False)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/base.py", line 566, in _validate_data
X = check_array(X, **check_params)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/utils/validation.py", line 800, in check_array
_assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/utils/validation.py", line 114, in _assert_all_finite
raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
My X_test df is the same as X_train with the addition of a Survived feature.
Survived (i64)
Pclass (i64)
Sex (i16)
Age (i32)
Fare (i16)
Embarked (i16)
title (i16)
family_size (i64)
is_alone (i64)
age*class (i64)
null
3
0
1
0
1
1
2
1
0
...
...
...
...
...
...
...
...
...
...
shape (418,10)

Related

SUOD Model gives ValueError: Input Contains Nan

I am running SUOD from pyod which is ensemble method and received this error.
The models that I am running are Iforest, COPOD and ECOD.
Running these models individually does not say that the data has nan values in it. Also I have already verified if any of the columns has nan values and it does not have any. The data is one hot encoded
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 1.0min remaining: 0.0s
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 1.0min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 5.8s remaining: 0.0s
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 5.8s finished
Traceback (most recent call last):
File "ensemble.py", line 76, in <module>
clf.fit(x_train_scaled)
File "/home/ubuntu/thesis/lib/python3.8/site-packages/pyod/models/suod.py", line 220, in fit
decision_score_mat, self.score_scalar_ = standardizer(
File "/home/ubuntu/thesis/lib/python3.8/site-packages/pyod/utils/utility.py", line 152, in standardizer
X = check_array(X)
File "/home/ubuntu/thesis/lib/python3.8/site-packages/sklearn/utils/validation.py", line 919, in check_array
_assert_all_finite(
File "/home/ubuntu/thesis/lib/python3.8/site-packages/sklearn/utils/validation.py", line 161, in _assert_all_finite
raise ValueError(msg_err)
ValueError: Input contains NaN.
and this is my code
train_data.dropna(axis=0)
test1_data.dropna(axis=0)
test2_data.dropna(axis=0)
mm_scaler = MinMaxScaler()
x_train_scaled = mm_scaler.fit_transform(train_data)
x_test2_scaled = mm_scaler.transform(test2_data)
x_test1_scaled = mm_scaler.transform(test1_data)
detector_list = [COPOD(), IForest(n_estimators=100,max_samples=10000, max_features=10,
bootstrap=True, n_jobs=-1, random_state=42),
IForest(n_estimators=200,max_samples=10000, max_features=10,
bootstrap=True, n_jobs=-1, random_state=42), ECOD(contamination=0.001)]
clf = SUOD(base_estimators=detector_list, n_jobs=2, combination='average',
verbose=False)
clf.fit(x_train_scaled)
train_pred = clf.predict(x_train_scaled)
test_pred1 = clf.predict(x_test1_scaled)
test_pred2 = clf.predict(x_test2_scaled)
Thing that I have tried
SimpleImputer
dropping nan rows.
adding the mock patch
As error output, you need to handle NaN values. dropna method return a new dataframe. If you want modify it, set parameter inplace to true and do operation inplace (return None),
inplace : boolean, default False
So to modify it in place data.dropna(axis=0, how='any', inplace=True)
Also possible method to handle NaN values (this is optional and if applicable to your problem or a data mining related) is to convert NaN inputs to the median of the column df = df.fillna(df.mean()).
Another case (uncommon) is that your dataframe contains nan values represented as string type or "NaN", then functions to manage nan values wont work, in that case you need to use something like df.replace("NaN", numpy.nan) and drop.

Why I am getting matrices are not aligned error for DataFrame dot function?

I am trying to implement simple linear regression in Python using Numpy and Pandas. But I am getting a ValueError: matrices are not aligned error for calling the dot function which essentially calculates the matrix multiplication as the documentation says. Following is the code snippet:
import numpy as np
import pandas as pd
#initializing the matrices for X, y and theta
#dataset = pd.read_csv("data1.csv")
dataset = pd.DataFrame([[6.1101,17.592],[5.5277,9.1302],[8.5186,13.662],[7.0032,11.854],[5.8598,6.8233],[8.3829,11.886],[7.4764,4.3483],[8.5781,12]])
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]
X.insert(0, "x_zero", np.ones(X.size), True)
print(X)
print(f"\n{y}")
theta = pd.DataFrame([[0],[1]])
temp = pd.DataFrame([[1],[1]])
print(X.shape)
print(theta.shape)
print(X.dot(theta))
And this is the output for the same:
x_zero 0
0 1.0 6.1101
1 1.0 5.5277
2 1.0 8.5186
3 1.0 7.0032
4 1.0 5.8598
5 1.0 8.3829
6 1.0 7.4764
7 1.0 8.5781
0 17.5920
1 9.1302
2 13.6620
3 11.8540
4 6.8233
5 11.8860
6 4.3483
7 12.0000
Name: 1, dtype: float64
(8, 2)
(2, 1)
Traceback (most recent call last):
File "linear.py", line 16, in <module>
print(X.dot(theta))
File "/home/tejas/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 1063, in dot
raise ValueError("matrices are not aligned")
ValueError: matrices are not aligned
As you can see the output of shape attributes for both of them, the second axis has same dimension (2) and dot function should return a 8*1 DataFrame. Then, why the error?
This misalignment is not a one coming from shapes, but the one coming from pandas indexes. You have 2 options to fix your problem:
Tweak theta assignment:
theta = pd.DataFrame([[0],[1]], index=X.columns)
So the indexes you multiply will match.
Remove indexes relevancy, by moving second df to numpy:
X.dot(theta.to_numpy())
This functionality is actually useful in pandas - that it tries to match smart the indexes, your case is just the quite specific one, when it becomes counterproductive ;)

ValueError: Number of priors must match number of classes

I want to compile my python3 code on ubuntu, and also want to know about the problem, such that i can handle that in future.
It seems there is some problem with the imported library function.
## sample code
1 import numpy as np
2 x = np.array([[-1,-1],[-2,-1],[-3,-2],[1,1],[2,1],[3,2]])
3 y = np.array([1,1,1,2,2,2])
4 from sklearn.naive_bayes import GaussianNB
5 clf = GaussianNB(x, y)
6 clf = clf.fit(x,y) ###showing error on compiling
7 print(clf.predict([[-2,1]]))
## output shown
Traceback (most recent call last):
File "naive.py", line 7, in <module>
clf = clf.fit(x,y)
File "/home/abhihsek/.local/lib/python3.6/site-
packages/sklearn/naive_bayes.py", line 192, in fit
sample_weight=sample_weight)
File "/home/abhihsek/.local/lib/python3.6/site-
packages/sklearn/naive_bayes.py", line 371, in _partial_fit
raise ValueError('Number of priors must match number of'
ValueError: Number of priors must match number of classes.
## code of library function line 192
190 X, y = check_X_y(X, y)
191 return self._partial_fit(X, y, np.unique(y),
_refit=True,
192
sample_weight=sample_weight)
## code of library function line 371
369 # Check that the provide prior match the number of classes
370 if len(priors) != n_classes:
371 raise ValueError('Number of priors must
match
number of'
372 ' classes.')
373 # Check that the sum is 1
As #Suvan Pandey mentioned, then the code won't give any error when writing clf = GaussianNB() instead of clf = GaussianNB(x, y).
If we look at the GaussianNB class then the __init__() can take these parameters:
def __init__(self, priors=None, var_smoothing=1e-9): # <-- these have a default value
self.priors = priors
self.var_smoothing = var_smoothing
The documentation about the two parameters:
priors – Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
var_smoothing – Portion of the largest variance of all features that is added to variances for calculation stability.
As your x and y variables both return an array object then they don't fit the parameters of the __init__(...).

Apply MinMaxScaler() on a pandas column

I am trying to use the sklearn MinMaxScaler to rescale a python column like below:
scaler = MinMaxScaler()
y = scaler.fit(df['total_amount'])
But got the following errors:
Traceback (most recent call last):
File "/Users/edamame/workspace/git/my-analysis/experiments/my_seq.py", line 54, in <module>
y = scaler.fit(df['total_amount'])
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 308, in fit
return self.partial_fit(X, y)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 334, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/utils/validation.py", line 441, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[3.180000e+00 2.937450e+03 6.023850e+03 2.216292e+04 1.074589e+04
:
0.000000e+00 0.000000e+00 9.000000e+01 1.260000e+03].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Any idea what was wrong?
The input to MinMaxScaler needs to be array-like, with shape [n_samples, n_features]. So you can apply it on the column as a dataframe rather than a series (using double square brackets instead of single):
y = scaler.fit(df[['total_amount']])
Though from your description, it sounds like you want fit_transform rather than just fit (but I could be wrong):
y = scaler.fit_transform(df[['total_amount']])
A little more explanation:
If your dataframe had 100 rows, consider the difference in shape when you transform a column to an array:
>>> np.array(df[['total_amount']]).shape
(100, 1)
>>> np.array(df['total_amount']).shape
(100,)
The first returns a shape that matches [n_samples, n_features] (as required by MinMaxScaler), whereas the second does not.
Try to do with this way:
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

ValueError: Number of features of the model must match the input. Model n_features is 45 and input n_features is 2

I'm trying to plot a Random Forest visualization for classification purposes with python 3.
Firstly, I read a CSV file where all necesary data is located. Here, Read_CSV() is a method who run correctly, giving three variables, features (vector with all feature names, specifically 45), data (only the data without label column. There are 148000 rows and 45 columns), labels (column of labels in integer format. There are 3 classes to classify as integers 0, 1 or 2. There are also 148000 rows in this vector).
features,data,labels = Read_CSV()
X_train,X_test,Y_train,Y_test = train_test_split(data,labels,test_size=0.35,random_state=0)
X = np.array(X).astype(np.float)
y = np.array(y).astype(np.float)
ax = ax or plt.gca()
ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=cmap,
clim=(y.min(), y.max()), zorder=3)
ax.axis('tight')
ax.axis('off')
xlim = ax.get_xlim()
ylim = ax.get_ylim()
# fit the estimator
model.fit(X, y)
xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
np.linspace(*ylim, num=200))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
# Create a color plot with the results
n_classes = len(np.unique(y))
contours = ax.contourf(xx, yy, Z, alpha=0.3,
levels=np.arange(n_classes + 1) - 0.5,
cmap=cmap, clim=(y.min(), y.max()),
zorder=1)
ax.set(xlim=xlim, ylim=ylim)
This part of the code showed here is completely dedicated to obtain a plot like this:
enter image description here
When I run this code I obtain the following:
Traceback (most recent call last):
File "C:/Users/Carles/PycharmProjects/Article/main.py", line 441, in <module>
main()
File "C:/Users/Carles/PycharmProjects/Article/main.py", line 388, in main
visualize_classifier(RandomForestClassifier(),X_train, Y_train)
File "C:/Users/Carles/PycharmProjects/Article/main.py", line 353, in visualize_classifier
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
File "C:\Users\Carles\PycharmProjects\Article\venv\lib\site-packages\sklearn\ensemble\forest.py", line 538, in predict
proba = self.predict_proba(X)
File "C:\Users\Carles\PycharmProjects\Article\venv\lib\site-packages\sklearn\ensemble\forest.py", line 578, in predict_proba
X = self._validate_X_predict(X)
File "C:\Users\Carles\PycharmProjects\Article\venv\lib\site-packages\sklearn\ensemble\forest.py", line 357, in _validate_X_predict
return self.estimators_[0]._validate_X_predict(X, check_input=True)
File "C:\Users\Carles\PycharmProjects\Article\venv\lib\site-packages\sklearn\tree\tree.py", line 384, in _validate_X_predict
% (self.n_features_, n_features))
ValueError: Number of features of the model must match the input. Model n_features is 45 and input n_features is 2

Resources