ExtraTreesClassifier with sparse training data? - scikit-learn

I am trying to use an ExtraTreesClassifier with sparse data, as per the documentation, however I do get a run time TypeError asking for dense data. This is on scikit-learn 0.17.1, and below I am quoting from the documentation:
Parameters:
X : array-like or sparse matrix of shape = [n_samples, n_features]
The code is quite simple:
import pandas as pd
from scipy.sparse import coo_matrix, csr_matrix, hstack
from sklearn.ensemble import ExtraTreesClassifier
import numpy as np
from scipy import *
features = array([[1, 0], [0, 1], [3, 4]])
sparse_features = csr_matrix(features)
labels = array([0, 1, 0])
classifier = ExtraTreesClassifier()
classifier.fit(sparse_features, labels)
And here the exception: TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.. This works fine when passing in features.
Looks like the documentation is out of date or is there something wrong with the above code?
Any help will be greatly appreciated. Thank you.

Quoting the documentation:
Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.
So I expect that passing a csc_matrix should help.
On my setup both version work normally (csc and csr, sklearn 0.17.1), I assume that problems could be with older versions of scipy.

Related

Data Cleaning Error in Classification KNN Alrogithm Problem

I believe the error is telling me I have null values in my data and I've tried fixing it but the error keeps appearing. I don't want to delete the null data because I consider it relevant to my analysis.
The columns of my data are in this order: 'Titulo', 'Autor', 'Género', 'Año Leido', 'Puntaje', 'Precio', 'Año Publicado', 'Paginas', **'Estado.' **The ones in bold are strings data.
Code:
import numpy as np
#Load Data
import pandas as pd
dataset = pd.read_excel(r"C:\Users\renat\Documents\Data Science Projects\Classification\Book Purchases\Biblioteca.xlsx")
#print(dataset.columns)
#Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
#Handling missing values
imputer = SimpleImputer(missing_values = np.nan, strategy='mean')
#Convert X and y to NumPy arrays
X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,8].values
print(X.shape, y.shape)
# Crea una instancia de LabelEncoder
labelEncoderTitulo = LabelEncoder()
X[:, 0] = labelEncoderTitulo.fit_transform(X[:, 0])
labelEncoderAutor = LabelEncoder()
X[:, 1] = labelEncoderAutor.fit_transform(X[:, 1])
labelEncoderGenero = LabelEncoder()
X[:, 2] = labelEncoderGenero.fit_transform(X[:, 2])
labelEncoderEstado = LabelEncoder()
X[:, -1] = labelEncoderEstado.fit_transform(X[:, -1])
#Instantiate our KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X,y)
y_pred = knn.predict(X)
print(y_pred)
Error Message:
ValueError: Input X contains NaN.
KNeighborsClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
You have to fit and transform the data with the SimpleImputer you created. From the documentation:
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') # Here the imputer is created
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) # Here the imputer is fitted, i.e. learns the mean
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imp_mean.transform(X)) # Here the imputer is applied, i.e. filling the mean
The crucial parts here are imp_mean.fit() and imp_mean.transform(X)
Additionally I'd use another technique to handle categorical data since LabelEncoder is not suitable here:
This transformer should be used to encode target values, i.e. y, and not the input X.
For alternatives see here: How to consider categorical variables in distance based algorithms like KNN or SVM?
You need SimpleImputer to impute the missing values in X. We fit the imputer on X and then transform X to replace the NaN values with the mean of the column.After imputing missing values, we encode the target variable using LabelEncoder.
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X = imputer.fit_transform(X)
# Encode target variable
labelEncoderEstado = LabelEncoder()
y = labelEncoderEstado.fit_transform(y)

(Python 3) Classification of dataset, using a user input Elo, to suggest opening move based on Chess dataset?

I'm just looking for a little bit of help. I'm struggling to work out if what I'm doing is right or not, and even if Naive Bayes is even the right way to do this.
I am wanting the user to be able to input their elo, and the 'app' to suggest them a opening move set, based on win rate at that ELO. For this I am using the following dataset: https://www.kaggle.com/datasnaek/chess
The important data out of this, are the opening name (what I'm trying to suggest), the average rating (what the user can input), and winner (we need to see if white wins).
This is my code so far:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from matplotlib.colors import ListedColormap
from sklearn import preprocessing
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
#Read in dataset
data = pd.read_csv(f"games.csv")
# set new column that is true/false depending on if white wins
data['white_wins'] = (data['winner'] == "white")
# Create new columns, average rating (based on white rating and black rating) and category (categorization of rating for Naive Bayes)
data['average_rating'] = data.apply(lambda row: (row['white_rating'] + row['black_rating']) / 2, axis=1)
data['category'] = data['average_rating'] // 100 + 1
# Drop unneccessary columns
data = data.drop(['turns', 'moves', 'victory_status', 'id', 'winner', 'rated', 'created_at', 'last_move_at', 'opening_ply', 'white_id', 'black_id', 'increment_code', 'opening_eco', 'white_rating', 'black_rating'], axis=1)
#Label Encoder Initialisation
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
opening_name_encoded=le.fit_transform(data.opening_name)
category_encoded=le.fit_transform(data.category)
label=le.fit_transform(data.white_wins)
#Package features together
features=zip(opening_name_encoded, category_encoded)
#Create a Gaussian Classifier
model = GaussianNB()
# Train the model using the training sets
model.fit(features,label)
And i currently get the error:
error
Also, i'm not even convinced this is correct, as if i continue down this stream, I am only going to be predicting if white wins based on the opening moveset, and elo. I'm really unsure on where to take this to get it to the point i need.
Thanks for any help!
zip returns in iterator, so your code is not doing what you expect. My guess is that you intended features to be a list of 2-tuples. If that is the case, then adjust your code to features = list(zip(opening_name_encoded, category_encoded))
In [31]: zip([1, 2, 3], ['a', 'b', 'c'])
Out[31]: <zip at 0x25d61abfd80>
In [32]: list(zip([1, 2, 3], ['a', 'b', 'c']))
Out[32]: [(1, 'a'), (2, 'b'), (3, 'c')]

How to leave scikit-learn esimator result in dask distributed system?

You can find a minimal-working example below (directly taken from dask-ml page, only change is made to the Client() to make it work in distributed system)
import numpy as np
from dask.distributed import Client
import joblib
from sklearn.datasets import load_digits
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
# Don't forget to start the dask scheduler and connect worker(s) to it.
client = Client('localhost:8786')
digits = load_digits()
param_space = {
'C': np.logspace(-6, 6, 13),
'gamma': np.logspace(-8, 8, 17),
'tol': np.logspace(-4, -1, 4),
'class_weight': [None, 'balanced'],
}
model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=3, n_iter=50, verbose=10)
with joblib.parallel_backend('dask'):
search.fit(digits.data, digits.target)
But this returns the result to the local machine. This is not exactly my code. In my code
I am using scikit-learn tfidf vectorizer. After I use fit_transform(), it is returning the fitted and transformed data (in sparse format) to my local machine. How can I leave the results inside the distributed system (cluster of machines)?
PS: I just encountered this from dask_ml.wrappers import ParallelPostFit Maybe this is the solution?
The answer was in front of my eyes and I couldn't see it for 3 days of searching. ParallelPostFit is the answer. The only problem is that it doesn't support fit_transform() but fit() and transform() works and it returns a lazily evaluated dask array (that is what I was looking for). Be careful about this warning:
Warning
ParallelPostFit does not parallelize the training step. The underlying
estimator’s .fit method is called normally.

How to convert scalar array to 2d array?

I am new to machine learning and facing some issues in converting scalar array to 2d array.
I am trying to implement polynomial regression in spyder. Here is my code, Please help!
# Polynomial Regression
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
poly_reg.fit(X_poly, y)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
# Predicting a new result with Linear Regression
lin_reg.predict(6.5)
# Predicting a new result with Polynomial Regression
lin_reg_2.predict(poly_reg.fit_transform(6.5))
ValueError: Expected 2D array, got scalar array instead: array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
You get this issue in Jupyter only.
To resolve in jupyter make the value into np array using below code.
lin_reg.predict(np.array(6.5).reshape(1,-1))
lin_reg_2.predict(poly_reg.fit_transform(np.array(6.5).reshape(1,-1)))
For spyder it work same as you expected:
lin_reg.predict(6.5)
lin_reg_2.predict(poly_reg.fit_transform(6.5))
The issue with your code is linreg.predict(6.5).
If you read the error statement it says that the model requires a 2-d array , however 6.5 is scalar.
Why? If you see your X data is having 2-d so anything that you want to predict with your model should also have two 2d shape.
This can be achieved either by using .reshape(-1,1) which creates a column vector (feature vector) or .reshape(1,-1) If you have single sample.
Things to remember in order to predict I need to prepare my data in the same way as my original training data.
If you need any more info let me know.
You have to give the input as 2D array, Hence try this!
lin_reg.predict([6.5])
lin_reg_2.predict(poly_reg.fit_transform([6.5]))

MXNet - Dot Product of Sparse Matrices

I'm in the process of building a content recommendation model using MXNet. Despite being ~10K rows, out of memory issues are thrown with CPU and GPU contexts in MXNet. The current code is below.
```
import mxnet as mx
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
df = pd.read_csv("df_text.csv")
tf = TfidfVectorizer(analyzer = "word",
ngram_range = (1,3),
min_df = 2,
stop_words="english")
tfidf_matrix = tf.fit_transform(df["text_column"])
mx_tfidf = mx.nd.array(tfidf_matrix, ctx=mx.gpu())
# Out of memory error occurs here.
cosine_similarities = mx.ndarray.dot(mx_tfidf, mx_tfidf.T)
```
I'm aware that the dot product is a sparse matrix multiplied by a dense matrix, which may be part of the issue. This said, would the dot product have to be calculated across multiple GPU's, in order to prevent out of memory issues?
In MXNet (and AFAIK all other platforms) there is not magical "perform dot across GPUs" solution. One option is to use sparse matrices in MXNet (see this tutorial)
Another option is to implement your own multi-GPU dot product by slicing your input array into multiple matrices and performing parts of your dot product in each GPU.

Resources