How to leave scikit-learn esimator result in dask distributed system? - python-3.x

You can find a minimal-working example below (directly taken from dask-ml page, only change is made to the Client() to make it work in distributed system)
import numpy as np
from dask.distributed import Client
import joblib
from sklearn.datasets import load_digits
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
# Don't forget to start the dask scheduler and connect worker(s) to it.
client = Client('localhost:8786')
digits = load_digits()
param_space = {
'C': np.logspace(-6, 6, 13),
'gamma': np.logspace(-8, 8, 17),
'tol': np.logspace(-4, -1, 4),
'class_weight': [None, 'balanced'],
}
model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=3, n_iter=50, verbose=10)
with joblib.parallel_backend('dask'):
search.fit(digits.data, digits.target)
But this returns the result to the local machine. This is not exactly my code. In my code
I am using scikit-learn tfidf vectorizer. After I use fit_transform(), it is returning the fitted and transformed data (in sparse format) to my local machine. How can I leave the results inside the distributed system (cluster of machines)?
PS: I just encountered this from dask_ml.wrappers import ParallelPostFit Maybe this is the solution?

The answer was in front of my eyes and I couldn't see it for 3 days of searching. ParallelPostFit is the answer. The only problem is that it doesn't support fit_transform() but fit() and transform() works and it returns a lazily evaluated dask array (that is what I was looking for). Be careful about this warning:
Warning
ParallelPostFit does not parallelize the training step. The underlying
estimator’s .fit method is called normally.

Related

RandomSearchCV is too slow when working on pipe

I'm testing the method of running feature selection with hyper parameters.
I'm running feature selection algorithm SequentialFeatureSelection with hyper parameters algorithm RandomizedSearchCV with xgboost model
I run the following code:
from xgboost import XGBClassifier
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
import pandas as pd
def main():
df = pd.read_csv("input.csv")
x = df[['f1','f2','f3', 'f4', 'f5', 'f6','f7','f8']]
y = df[['y']]
model = XGBClassifier(n_jobs=-1)
sfs = SequentialFeatureSelector(model, k_features="best", forward=True, floating=False, scoring="accuracy", cv=2, n_jobs=-1)
params = {'xgboost__max_depth': [2, 4], 'sfs__k_features': [1, 4]}
pipe = Pipeline([('sfs', sfs), ('xgboost', model)])
randomized = RandomizedSearchCV(estimator=pipe, param_distributions=params,n_iter=2,cv=2,random_state=40,scoring='accuracy',refit=True,n_jobs=-1)
res = randomized.fit(x.values,y.values)
if __name__=='__main__':
main()
The file input.csv has only 39 rows of data (not including the header):
f1,f2,f3,f4,f5,f6,f7,f8,y
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1
4,110,92,0,0,37.6,0.191,30,0
10,168,74,0,0,38.0,0.537,34,1
10,139,80,0,0,27.1,1.441,57,0
1,189,60,23,846,30.1,0.398,59,1
5,166,72,19,175,25.8,0.587,51,1
7,100,0,0,0,30.0,0.484,32,1
0,118,84,47,230,45.8,0.551,31,1
7,107,74,0,0,29.6,0.254,31,1
1,103,30,38,83,43.3,0.183,33,0
1,115,70,30,96,34.6,0.529,32,1
3,126,88,41,235,39.3,0.704,27,0
8,99,84,0,0,35.4,0.388,50,0
7,196,90,0,0,39.8,0.451,41,1
9,119,80,35,0,29.0,0.263,29,1
11,143,94,33,146,36.6,0.254,51,1
10,125,70,26,115,31.1,0.205,41,1
7,147,76,0,0,39.4,0.257,43,1
1,97,66,15,140,23.2,0.487,22,0
13,145,82,19,110,22.2,0.245,57,0
5,117,92,0,0,34.1,0.337,38,0
5,109,75,26,0,36.0,0.546,60,0
3,158,76,36,245,31.6,0.851,28,1
3,88,58,11,54,24.8,0.267,22,0
6,92,92,0,0,19.9,0.188,28,0
10,122,78,31,0,27.6,0.512,45,0
4,103,60,33,192,24.0,0.966,33,0
11,138,76,0,0,33.2,0.420,35,0
9,102,76,37,0,32.9,0.665,46,1
2,90,68,42,0,38.2,0.503,27,1
As you can see, the amount of data is too small, and there are small amount of parameters to optimize.
I checked the number of cpus:
lscpu
and I got:
CPU(s): 12
so 12 threads can be created and run in parallel
I checked this post:
RandomSearchCV super slow - troubleshooting performance enhancement
But I already use n_jobs = -1
So why it's run too slow ? (More than 15 minutes !!!)

Recover feature names from pipeline [duplicate]

This seems like a very important issue for this library, and so far I don't see a decisive answer, although it seems like for the most part, the answer is 'No.'
Right now, any method that uses the transformer api in sklearn returns a numpy array as its results. Usually this is fine, but if you're chaining together a multi-step process that expands or reduces the number of columns, not having a clean way to track how they relate to the original column labels makes it difficult to use this section of the library to its fullest.
As an example, here's a snippet that I just recently used, where the inability to map new columns to the ones originally in the dataset was a big drawback:
numeric_columns = train.select_dtypes(include=np.number).columns.tolist()
cat_columns = train.select_dtypes(include=np.object).columns.tolist()
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns)
]
combined_pipe = ColumnTransformer(transformers)
train_clean = combined_pipe.fit_transform(train)
test_clean = combined_pipe.transform(test)
In this example I split up my dataset using the ColumnTransformer and then added additional columns using the OneHotEncoder, so my arrangement of columns is not the same as what I started out with.
I could easily have different arrangements if I used different modules that use the same API. OrdinalEncoer, select_k_best, etc.
If you're doing multi-step transformations, is there a way to consistently see how your new columns relate to your original dataset?
There's an extensive discussion about it here, but I don't think anything has been finalized yet.
Yes, you are right that there isn't a complete support for tracking the feature_names in sklearn as of now. Initially, it was decide to keep it as generic at the level of numpy array. Latest progress on the feature names addition to sklearn estimators can be tracked here.
Anyhow, we can create wrappers to get the feature names of the ColumnTransformer. I am not sure whether it can capture all the possible types of ColumnTransformers. But at-least, it can solve your problem.
From Documentation of ColumnTransformer:
Notes
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
Try this!
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.feature_extraction.text import _VectorizerMixin
from sklearn.feature_selection._base import SelectorMixin
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer
train = pd.DataFrame({'age': [23,12, 12, np.nan],
'Gender': ['M','F', np.nan, 'F'],
'income': ['high','low','low','medium'],
'sales': [10000, 100020, 110000, 100],
'foo' : [1,0,0,1],
'text': ['I will test this',
'need to write more sentence',
'want to keep it simple',
'hope you got that these sentences are junk'],
'y': [0,1,1,1]})
numeric_columns = ['age']
cat_columns = ['Gender','income']
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
text_pipeline = make_pipeline(CountVectorizer(), SelectKBest(k=5))
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns),
('text', text_pipeline, 'text'),
('simple_transformer', MinMaxScaler(), ['sales']),
]
combined_pipe = ColumnTransformer(
transformers, remainder='passthrough')
transformed_data = combined_pipe.fit_transform(
train.drop('y',1), train['y'])
def get_feature_out(estimator, feature_in):
if hasattr(estimator,'get_feature_names'):
if isinstance(estimator, _VectorizerMixin):
# handling all vectorizers
return [f'vec_{f}' \
for f in estimator.get_feature_names()]
else:
return estimator.get_feature_names(feature_in)
elif isinstance(estimator, SelectorMixin):
return np.array(feature_in)[estimator.get_support()]
else:
return feature_in
def get_ct_feature_names(ct):
# handles all estimators, pipelines inside ColumnTransfomer
# doesn't work when remainder =='passthrough'
# which requires the input column names.
output_features = []
for name, estimator, features in ct.transformers_:
if name!='remainder':
if isinstance(estimator, Pipeline):
current_features = features
for step in estimator:
current_features = get_feature_out(step, current_features)
features_out = current_features
else:
features_out = get_feature_out(estimator, features)
output_features.extend(features_out)
elif estimator=='passthrough':
output_features.extend(ct._feature_names_in[features])
return output_features
pd.DataFrame(transformed_data,
columns=get_ct_feature_names(combined_pipe))

Why does start_queue_runners() function use with shuffle_batch() function

I am new to TensorFlow and trying to understand the shuffle_batch() function.When I use the shuffle_batch() with following code it does not printing anything.
import tensorflow as tf
sess=tf.Session()
random=tf.random_normal([5],mean=0.0, stddev=1.0)
shu=tf.train.shuffle_batch([sliced], 20, 100, 10)
print(sess.run(shu))
But after adding the start_queue_runners() it gives me the expected output. So what is the relationship between these start_queue_runners() and shuffle_batch() ?
import tensorflow as tf
sess=tf.Session()
random=tf.random_normal([5],mean=0.0, stddev=1.0)
shu=tf.train.shuffle_batch([sliced], 20, 100, 10)
threads = tf.train.start_queue_runners(sess=sess)
print(sess.run(shu))
The queue pipeline has been replaced by tf.dataset. You should have a look at this instead.
tf datasets
guide
It is much simpler to use the dataset api.

MXNet - Dot Product of Sparse Matrices

I'm in the process of building a content recommendation model using MXNet. Despite being ~10K rows, out of memory issues are thrown with CPU and GPU contexts in MXNet. The current code is below.
```
import mxnet as mx
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
df = pd.read_csv("df_text.csv")
tf = TfidfVectorizer(analyzer = "word",
ngram_range = (1,3),
min_df = 2,
stop_words="english")
tfidf_matrix = tf.fit_transform(df["text_column"])
mx_tfidf = mx.nd.array(tfidf_matrix, ctx=mx.gpu())
# Out of memory error occurs here.
cosine_similarities = mx.ndarray.dot(mx_tfidf, mx_tfidf.T)
```
I'm aware that the dot product is a sparse matrix multiplied by a dense matrix, which may be part of the issue. This said, would the dot product have to be calculated across multiple GPU's, in order to prevent out of memory issues?
In MXNet (and AFAIK all other platforms) there is not magical "perform dot across GPUs" solution. One option is to use sparse matrices in MXNet (see this tutorial)
Another option is to implement your own multi-GPU dot product by slicing your input array into multiple matrices and performing parts of your dot product in each GPU.

ExtraTreesClassifier with sparse training data?

I am trying to use an ExtraTreesClassifier with sparse data, as per the documentation, however I do get a run time TypeError asking for dense data. This is on scikit-learn 0.17.1, and below I am quoting from the documentation:
Parameters:
X : array-like or sparse matrix of shape = [n_samples, n_features]
The code is quite simple:
import pandas as pd
from scipy.sparse import coo_matrix, csr_matrix, hstack
from sklearn.ensemble import ExtraTreesClassifier
import numpy as np
from scipy import *
features = array([[1, 0], [0, 1], [3, 4]])
sparse_features = csr_matrix(features)
labels = array([0, 1, 0])
classifier = ExtraTreesClassifier()
classifier.fit(sparse_features, labels)
And here the exception: TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.. This works fine when passing in features.
Looks like the documentation is out of date or is there something wrong with the above code?
Any help will be greatly appreciated. Thank you.
Quoting the documentation:
Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.
So I expect that passing a csc_matrix should help.
On my setup both version work normally (csc and csr, sklearn 0.17.1), I assume that problems could be with older versions of scipy.

Resources