Data Cleaning Error in Classification KNN Alrogithm Problem - scikit-learn

I believe the error is telling me I have null values in my data and I've tried fixing it but the error keeps appearing. I don't want to delete the null data because I consider it relevant to my analysis.
The columns of my data are in this order: 'Titulo', 'Autor', 'Género', 'Año Leido', 'Puntaje', 'Precio', 'Año Publicado', 'Paginas', **'Estado.' **The ones in bold are strings data.
Code:
import numpy as np
#Load Data
import pandas as pd
dataset = pd.read_excel(r"C:\Users\renat\Documents\Data Science Projects\Classification\Book Purchases\Biblioteca.xlsx")
#print(dataset.columns)
#Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
#Handling missing values
imputer = SimpleImputer(missing_values = np.nan, strategy='mean')
#Convert X and y to NumPy arrays
X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,8].values
print(X.shape, y.shape)
# Crea una instancia de LabelEncoder
labelEncoderTitulo = LabelEncoder()
X[:, 0] = labelEncoderTitulo.fit_transform(X[:, 0])
labelEncoderAutor = LabelEncoder()
X[:, 1] = labelEncoderAutor.fit_transform(X[:, 1])
labelEncoderGenero = LabelEncoder()
X[:, 2] = labelEncoderGenero.fit_transform(X[:, 2])
labelEncoderEstado = LabelEncoder()
X[:, -1] = labelEncoderEstado.fit_transform(X[:, -1])
#Instantiate our KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X,y)
y_pred = knn.predict(X)
print(y_pred)
Error Message:
ValueError: Input X contains NaN.
KNeighborsClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

You have to fit and transform the data with the SimpleImputer you created. From the documentation:
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') # Here the imputer is created
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) # Here the imputer is fitted, i.e. learns the mean
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imp_mean.transform(X)) # Here the imputer is applied, i.e. filling the mean
The crucial parts here are imp_mean.fit() and imp_mean.transform(X)
Additionally I'd use another technique to handle categorical data since LabelEncoder is not suitable here:
This transformer should be used to encode target values, i.e. y, and not the input X.
For alternatives see here: How to consider categorical variables in distance based algorithms like KNN or SVM?

You need SimpleImputer to impute the missing values in X. We fit the imputer on X and then transform X to replace the NaN values with the mean of the column.After imputing missing values, we encode the target variable using LabelEncoder.
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X = imputer.fit_transform(X)
# Encode target variable
labelEncoderEstado = LabelEncoder()
y = labelEncoderEstado.fit_transform(y)

Related

(Python 3) Classification of dataset, using a user input Elo, to suggest opening move based on Chess dataset?

I'm just looking for a little bit of help. I'm struggling to work out if what I'm doing is right or not, and even if Naive Bayes is even the right way to do this.
I am wanting the user to be able to input their elo, and the 'app' to suggest them a opening move set, based on win rate at that ELO. For this I am using the following dataset: https://www.kaggle.com/datasnaek/chess
The important data out of this, are the opening name (what I'm trying to suggest), the average rating (what the user can input), and winner (we need to see if white wins).
This is my code so far:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from matplotlib.colors import ListedColormap
from sklearn import preprocessing
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
#Read in dataset
data = pd.read_csv(f"games.csv")
# set new column that is true/false depending on if white wins
data['white_wins'] = (data['winner'] == "white")
# Create new columns, average rating (based on white rating and black rating) and category (categorization of rating for Naive Bayes)
data['average_rating'] = data.apply(lambda row: (row['white_rating'] + row['black_rating']) / 2, axis=1)
data['category'] = data['average_rating'] // 100 + 1
# Drop unneccessary columns
data = data.drop(['turns', 'moves', 'victory_status', 'id', 'winner', 'rated', 'created_at', 'last_move_at', 'opening_ply', 'white_id', 'black_id', 'increment_code', 'opening_eco', 'white_rating', 'black_rating'], axis=1)
#Label Encoder Initialisation
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
opening_name_encoded=le.fit_transform(data.opening_name)
category_encoded=le.fit_transform(data.category)
label=le.fit_transform(data.white_wins)
#Package features together
features=zip(opening_name_encoded, category_encoded)
#Create a Gaussian Classifier
model = GaussianNB()
# Train the model using the training sets
model.fit(features,label)
And i currently get the error:
error
Also, i'm not even convinced this is correct, as if i continue down this stream, I am only going to be predicting if white wins based on the opening moveset, and elo. I'm really unsure on where to take this to get it to the point i need.
Thanks for any help!
zip returns in iterator, so your code is not doing what you expect. My guess is that you intended features to be a list of 2-tuples. If that is the case, then adjust your code to features = list(zip(opening_name_encoded, category_encoded))
In [31]: zip([1, 2, 3], ['a', 'b', 'c'])
Out[31]: <zip at 0x25d61abfd80>
In [32]: list(zip([1, 2, 3], ['a', 'b', 'c']))
Out[32]: [(1, 'a'), (2, 'b'), (3, 'c')]

Use lmfit Model - function has dataframe as argument

I want to use lmfit in order to fit my data.
The function I am using, has only one argument features. The content of features will be different (both columns and values), so I can't initialize parameters.
I tried to create a dataframe as here, but I can't use the guess method because this is for LorentzianModel and I just want to use Model.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lmfit
from sklearn.linear_model import LinearRegression
df = {'a': [0, 0.2, 0.3], 'b':[14, 10, 9], 'target':[100, 200, 300]}
df = pd.DataFrame(df)
X = df[['a', 'b']]
y = df[['target']]
model = LinearRegression().fit(X, y)
features = pd.DataFrame({"a": np.array([0, 0.11, 0.36]),
"b": np.array([10, 14, 8])})
def eval_custom(features):
res = model.predict(features)
return res
x_val = features[["a"]].values
def calling_func(features, x_val):
pred_custom = eval_custom(features)
df = pd.DataFrame({'x': np.squeeze(x_val), 'y': np.squeeze(pred_custom)})
themodel = lmfit.Model(eval_custom)
params = themodel.guess(df['y'], x=df['x'])
result = themodel.fit(df['y'], params, x = df['x'])
result.plot_fit()
calling_func(features, x_val)
The model function needs to take independent variables and the individual model parameters as arguments. You're wrapping all of that into a single pandas Dataframe and then sending that. Don't do that.
If you need to create a dataframe from the current values of the model, do that inside your model function.
Also: a generic model function does not have a working guess function. Use model.make_params() and definitely, definitely (no exceptions, nope not ever) provide actual initial values for every parameter.

DataType of InputField is double although in the PMMLPipeline it is string

I am exporting a PMMLPipeline with a categorical string feature day_of_week as a PMML file. When I open the file in Java and list the InputFields I see that the data type of day_of_week field is double:
InputField{name=day_of_week, fieldName=day_of_week, displayName=null, dataType=double, opType=categorical}
Hence when I evaluate an input I get the error:
org.jpmml.evaluator.InvalidResultException: Field "day_of_week" cannot accept user input value "tuesday"
On the Python side the pipeline works with a string column:
data = pd.DataFrame(data=[{"age": 10, "day_of_week": "tuesday"}])
y = trained_model.predict(X=data)
Miminal example for creating the PMML file:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
if __name__ == '__main__':
data_dict = {
'age': [1, 2, 3],
'day_of_week': ['monday', 'tuesday', 'wednesday'],
'y': [5, 6, 7]
}
data = pd.DataFrame(data_dict, columns=data_dict)
numeric_features = ['age']
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])
categorical_features = ['day_of_week']
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore', categories='auto'))])
preprocessor = ColumnTransformer(
transformers=[
('numerical', numeric_transformer, numeric_features),
('categorical', categorical_transformer, categorical_features)])
pipeline = PMMLPipeline(
steps=[
('preprocessor', preprocessor),
('classifier', RandomForestRegressor(n_estimators=60))])
X = data.drop(labels=['y'], axis=1)
y = data['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=30)
trained_model = pipeline.fit(X=X_train, y=y_train)
sklearn2pmml(pipeline=pipeline, pmml='RandomForestRegressor2.pmml', with_repr=True)
EDIT:
sklearn2pmml creates a PMML file with A DataDictionary with DataField "day_of_week" that has dataType="double". I think it should be "String". Do I have to set the dataType somewhere to correct this?
<DataDictionary>
<DataField name="day_of_week" optype="categorical" dataType="double">
You can assist SkLearn2PMML by providing "feature type hints" using sklearn2pmml.decoration.CategoricalDomain and sklearn2pmml.decoration.ContinuousDomain decorators (see here for more details).
In the current case, you should prepend a CategoricalDomain step to the pipeline that deals with categorical features:
from sklearn2pmml.decoration import CategoricalDomain
categorical_transformer = Pipeline(steps=[
('domain', CategoricalDomain(dtype = str))
('onehot', OneHotEncoder(handle_unknown='ignore', categories='auto'))
])
Thanks for your reply #user1808924.
The given solution works. Now, to add in his answer; I would like to note that CategoricalDomain works for the single feature only.
Problem:
So, when you use it in to pipeline like:
# pipeline creatiion
categorical_transformer = Pipeline(steps=[
('domain', CategoricalDomain(dtype = str)),
('onehot', Ordinalecndoer())
])
# fit and transform of `df` with 3 features
categorical_transformer.fit_transform(df)
### >>> ERROR: Expected 1d array, got 2d array of shape (1000, 3)
Which means you will need to use multiple CategoricalDomains in there.
NOTE: We often use it in the ColumnTransformer as well. You need to know how many categorical features are there before hand.
What can we do?
We will simply use the MultiDomain from the same library.
from sklearn2pmml.decoration import MultiDomain
categorical_transformer = Pipeline(steps=[
('domain', MultiDomain([CategoricalDomain(dtype = str) for _ in range(3)])),
('onehot', OrdinalEncoder())
])
Note that the 3 is the number of categorical columns there. Hence, there will be n CategoricalDomains per categorical columns.
Then performing the transformation will work.

TypeError: __init__() got an unexpected keyword argument 'categorical_features'

Spyder(python 3.7)
I am facing following errors here. I have already update all library from anaconda prompt. But can't findout the solution of the problem.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
Traceback (most recent call last):
File "<ipython-input-4-05deb1f02719>", line 2, in <module>
onehotencoder = OneHotEncoder(categorical_features = [1])
TypeError: __init__() got an unexpected keyword argument 'categorical_features'
So based on your code, you'd have to:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
# Country column
ct = ColumnTransformer([("Country", OneHotEncoder(), [1])], remainder = 'passthrough')
X = ct.fit_transform(X)
# Male/Female
labelencoder_X = LabelEncoder()
X[:, 2] = labelencoder_X.fit_transform(X[:, 2])
Noticed how the first LabelEncoder was removed, you do not need to apply both the label encoded and the one hot encoder on the column anymore.
(I've kinda assumed your example came from the ML Udemy course, and the first column was a list of countries, while the second one a male/female binary choice)
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X=np.array(columnTransformer.fit_transform(X),dtype=np.str)
Since the latest build of sklearn library removed categorical_features parameter for onehotencoder class. It is advised to use ColumnTransformer class for categorical datasets. Refer the sklearn's official documentation for futher clarifications.
According to the documentation this is the __init__ line:
class sklearn.preprocessing.OneHotEncoder(categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')
As you can see the init does not get the variable categorical_features
You have an categories flag:
categories‘auto’ or a list of array-like, default=’auto’
Categories (unique values) per feature:
‘auto’ : Determine categories automatically from the training data.
list : categories[i] holds the categories expected in the ith column.
The passed categories should not mix strings and numeric values within
a single feature, and should be sorted in case of numeric values.
The used categories can be found in the categories_ attribute.
Attributes: categories_list of arrays The categories of each feature
determined during fitting (in order of the features in X and
corresponding with the output of transform). This includes the
category specified in drop (if any).
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
label_encoder_x_1 = LabelEncoder()
X[: , 2] = label_encoder_x_1.fit_transform(X[:,2])
transformer = ColumnTransformer(
transformers=[
("OneHot", # Just a name
OneHotEncoder(), # The transformer class
[1] # The column(s) to be applied on.
)
],
remainder='passthrough' # donot apply anything to the remaining columns
)
X = transformer.fit_transform(X.tolist())
X = X.astype('float64')
working like charm :)
Assuming this is problem from ML course from Udemy
complete code
I did replaced label encoder 1 with column transformer as suggested by Antoine Jaussoin in above comment.
Categorical Data
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([("Geography", OneHotEncoder(), [1])], remainder = 'passthrough')
X = ct.fit_transform(X)
Your Gender column will have index 4 now
labelencoder_x_2=LabelEncoder()
X[:,4]=labelencoder_x_2.fit_transform(X[:,4])
to avoid dummy variable trap
X=X[:, 1:]
You need to add call another class on sklearn which will eliminate 1 column to avoid dummies trap.
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer # Here is the one
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
#onehotencoder = OneHotEncoder(categorical_features = [1]) Not this one
# use this instead
ct = ColumnTransformer([("Country", OneHotEncoder(), [1])], remainder = 'passthrough')
X = ct.fit_transform(X)
X = X[:, 1:])
Happy Helping!!!
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([("Geography",OneHotEncoder(),[1])], remainder= 'passthrough')
X = ct.fit_transform(X)
labelencoder_X2 = LabelEncoder()
X[:, 4] = labelencoder_X2.fit_transform(X[:, 4])
X = X[: , 1:]
X = np.array(X, dtype=float)
Just adding an extra line to convert it from array of objects.
Replace the following code
# onehotencoder = OneHotEncoder(categorical_features = [1])
# X = onehotencoder.fit_transform(X).toarray()
# X = X[:, 1:]
with the following chunk and your code must
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [1])], remainder = 'passthrough')
X = np.array(columnTransformer.fit_transform(X), dtype = np.float64)
X = X[:, 1:]
Assuming you're learning Deep Learning from udemy.
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
# remove categorical_features, it works 100% perfectly
onehotencoder = OneHotEncoder()
X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]
Here is only one extension for onehotencoder.
if X have lot of columns.
instead
ct = ColumnTransformer([("encoder", OneHotEncoder(), list(categorical_features))], remainder = 'passthrough')
X = ct.fit_transform(X)
Another solution including the transformation of the X object in array type in a float64 type
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('encoder', OneHotEncoder(), [1])], remainder='passthrough')
X = np.array(ct.fit_transform(X), dtype=np.float)
one_hot_encode = OneHotEncoder(categorical_features=[0]) is working for scikit-learn 0.20.3 and the parameter removed from scikit-learn 0.24.2 (versions I am checking).
Either Downgrade scikit-learn version
Or Use
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
"""2 classes- Known/unknown Face"""
ct = ColumnTransformer([("Faces", OneHotEncoder(), [0])], remainder = 'passthrough')
X = ct.fit_transform(X)
"""Country column"""
ct = ColumnTransformer([("Country", OneHotEncoder(), [1])], remainder = 'passthrough')
X = ct.fit_transform(X)```

How to make a GridSearchCV with a proper FunctionTransformer in a pipeline?

I'm trying to make a Pipeline with GridSearchCV to filter data (with iforest) and perform a regression with StandarSclaler+MLPRegressor.
I made a FunctionTransformer to include my iForest filter in the pipeline. I also define a parameters grid for the iForest filter (using kw_args methods).
All seems OK but when un mahe the fit, nothing happens ... No error message. Nothing.
After, when I want to make a predict, I have the message : "This RandomizedSearchCV instance is not fitted yet"
from sklearn.preprocessing import FunctionTransformer
#Definition of the function auto_filter using the iForest algo
def auto_filter(DF, conta=0.1):
#iForest made on the DF dataframe
iforest = IsolationForest(behaviour='new', n_estimators=300, max_samples='auto', contamination=conta)
iforest = iforest.fit(DF)
# The DF (dataframe in input) is filtered taking into account only the inlier observations
data_filtered = DF[iforest.predict(DF) == 1]
# Only few variables are kept for the next step (regression by MLPRegressor)
# this function delivers X_filtered and y
X_filtered = data_filtered[['SessionTotalTime','AverageHR','MaxHR','MinHR','EETotal','EECH','EEFat','TRIMP','BeatByBeatRMSSD','BeatByBeatSD','HFAverage','LFAverage','LFHFRatio','Weight']]
y = data_filtered['MaxVO2']
return (X_filtered, y)
#Pipeline definition ('auto_filter' --> 'scaler' --> 'MLPRegressor')
pipeline_steps = [('auto_filter', FunctionTransformer(auto_filter)), ('scaler', StandardScaler()), ('MLPR', MLPRegressor(solver='lbfgs', activation='relu', early_stopping=True, n_iter_no_change=20, validation_fraction=0.2, max_iter=10000))]
#Gridsearch Definition with differents values of 'conta' for the first stage of the pipeline ('auto_filter)
parameters = {'auto_filter__kw_args': [{'conta': 0.1}, {'conta': 0.2}, {'conta': 0.3}], 'MLPR__hidden_layer_sizes':[(sp_randint.rvs(1, nb_features, 1),), (sp_randint.rvs(1, nb_features, 1), sp_randint.rvs(1, nb_features, 1))], 'MLPR__alpha':sp_rand.rvs(0, 1, 1)}
pipeline = Pipeline(pipeline_steps)
estimator = RandomizedSearchCV(pipeline, parameters, cv=5, n_iter=10)
estimator.fit(X_train, y_train)
You can try to run step by step manually to find a problem:
auto_filter_transformer = FunctionTransformer(auto_filter)
X_train = auto_filter_transformer.fit_transform(X_train)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
MLPR = MLPRegressor(solver='lbfgs', activation='relu', early_stopping=True, n_iter_no_change=20, validation_fraction=0.2, max_iter=10000)
MLPR.fit(X_train, y_train)
If each of the steps works fine, build a pipeline. Check the pipeline. If it works fine, try to use RandomizedSearchCV.
The func parameter of FunctionTransformer should be a callable that accepts the
same arguments as transform method (array-like X of shape
(n_samples, n_features) and kwargs for func) and returns a transformed X of
the same shape. Your function auto_filter doesn't fit these requirements.
Additionally, anomaly/outlier detection techniques from scikit-learn cannot be
used as intermediate steps in scikit-learn pipelines since a pipeline assembles
one or more transformers and an optional final estimator. IsolationForest or,
say, OneClassSVM is not a transformer: it implements fit and predict.
Thus, a possible solution is to cut off possible outliers separately and build
a pipeline composing of transformers and a regressor:
>>> import warnings
>>> from sklearn.exceptions import ConvergenceWarning
>>> warnings.filterwarnings(category=ConvergenceWarning, action='ignore')
>>> import numpy as np
>>> from scipy import stats
>>> from sklearn.datasets import make_regression
>>> from sklearn.ensemble import IsolationForest
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from sklearn.neural_network import MLPRegressor
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> X, y = make_regression(n_samples=50, n_features=2, n_informative=2)
>>> detect = IsolationForest(contamination=0.1, behaviour='new')
>>> inliers_mask = detect.fit_predict(X) == 1
>>> pipe = Pipeline([('scale', StandardScaler()),
... ('estimate', MLPRegressor(max_iter=500, tol=1e-5))])
>>> param_distributions = dict(estimate__alpha=stats.uniform(0, 0.1))
>>> search = RandomizedSearchCV(pipe, param_distributions,
... n_iter=2, cv=3, iid=True)
>>> search = search.fit(X[inliers_mask], y[inliers_mask])
The problem is that you won't be able to optimize the hyperparameters of
IsolationForest. One way to handle it is to define hyperparameter space
for the forest, sample hyperparameters with ParameterSampler or
ParameterGrid, predict inliers and fit randomized search:
>>> from sklearn.model_selection import ParameterGrid
>>> forest_param_dict = dict(contamination=[0.1, 0.15, 0.2])
>>> forest_param_grid = ParameterGrid(forest_param_dict)
>>> for sample in forest_param_grid:
... detect = detect.set_params(contamination=sample['contamination'])
... inliers_mask = detect.fit_predict(X) == 1
... search.fit(X[inliers_mask], y[inliers_mask])

Resources