RandomSearchCV is too slow when working on pipe - python-3.x
I'm testing the method of running feature selection with hyper parameters.
I'm running feature selection algorithm SequentialFeatureSelection with hyper parameters algorithm RandomizedSearchCV with xgboost model
I run the following code:
from xgboost import XGBClassifier
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
import pandas as pd
def main():
df = pd.read_csv("input.csv")
x = df[['f1','f2','f3', 'f4', 'f5', 'f6','f7','f8']]
y = df[['y']]
model = XGBClassifier(n_jobs=-1)
sfs = SequentialFeatureSelector(model, k_features="best", forward=True, floating=False, scoring="accuracy", cv=2, n_jobs=-1)
params = {'xgboost__max_depth': [2, 4], 'sfs__k_features': [1, 4]}
pipe = Pipeline([('sfs', sfs), ('xgboost', model)])
randomized = RandomizedSearchCV(estimator=pipe, param_distributions=params,n_iter=2,cv=2,random_state=40,scoring='accuracy',refit=True,n_jobs=-1)
res = randomized.fit(x.values,y.values)
if __name__=='__main__':
main()
The file input.csv has only 39 rows of data (not including the header):
f1,f2,f3,f4,f5,f6,f7,f8,y
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1
4,110,92,0,0,37.6,0.191,30,0
10,168,74,0,0,38.0,0.537,34,1
10,139,80,0,0,27.1,1.441,57,0
1,189,60,23,846,30.1,0.398,59,1
5,166,72,19,175,25.8,0.587,51,1
7,100,0,0,0,30.0,0.484,32,1
0,118,84,47,230,45.8,0.551,31,1
7,107,74,0,0,29.6,0.254,31,1
1,103,30,38,83,43.3,0.183,33,0
1,115,70,30,96,34.6,0.529,32,1
3,126,88,41,235,39.3,0.704,27,0
8,99,84,0,0,35.4,0.388,50,0
7,196,90,0,0,39.8,0.451,41,1
9,119,80,35,0,29.0,0.263,29,1
11,143,94,33,146,36.6,0.254,51,1
10,125,70,26,115,31.1,0.205,41,1
7,147,76,0,0,39.4,0.257,43,1
1,97,66,15,140,23.2,0.487,22,0
13,145,82,19,110,22.2,0.245,57,0
5,117,92,0,0,34.1,0.337,38,0
5,109,75,26,0,36.0,0.546,60,0
3,158,76,36,245,31.6,0.851,28,1
3,88,58,11,54,24.8,0.267,22,0
6,92,92,0,0,19.9,0.188,28,0
10,122,78,31,0,27.6,0.512,45,0
4,103,60,33,192,24.0,0.966,33,0
11,138,76,0,0,33.2,0.420,35,0
9,102,76,37,0,32.9,0.665,46,1
2,90,68,42,0,38.2,0.503,27,1
As you can see, the amount of data is too small, and there are small amount of parameters to optimize.
I checked the number of cpus:
lscpu
and I got:
CPU(s): 12
so 12 threads can be created and run in parallel
I checked this post:
RandomSearchCV super slow - troubleshooting performance enhancement
But I already use n_jobs = -1
So why it's run too slow ? (More than 15 minutes !!!)
Related
GPU runs out of memory when training a ml model
I am trying to train a ml model using dask. I am training on my local machine with 1 GPU. My GPU has 24 GiBs of memory. from dask_cuda import LocalCUDACluster from dask.distributed import Client, LocalCluster import dask.dataframe as dd import pandas as pd import numpy as np import os import xgboost as xgb np.random.seed(42) def get_columns(filename): return pd.read_csv(filename, nrows=10).iloc[:, :NUM_FEATURES].columns def get_data(filename, target): import dask_cudf X = dask_cudf.read_csv(filename) # X = dd.read_csv(filename, assume_missing=True) y = X[[target]] X = X.iloc[:, :NUM_FEATURES] return X, y def main(client: Client) -> None: X, y = get_data(FILENAME, TARGET) model = xgb.dask.DaskXGBRegressor( tree_method="gpu_hist", objective="reg:squarederror", seed=42, max_depth=5, eta=0.01, n_estimators=10) model.client = client model.fit(X, y, eval_set=[(X, y)]) print("Saving the model..") model.get_booster().save_model("xgboost.model") print("Doing model importance..") columns = get_columns(FILENAME) pd.Series(model.feature_importances_, index=columns).sort_values(ascending=False).to_pickle("~/yolo.pkl") if __name__ == "__main__": os.environ["MALLOC_TRIM_THRESHOLD_"]="65536" with LocalCUDACluster(device_memory_limit="15 GiB", rmm_pool_size="20 GiB") as cluster: # with LocalCluster() as cluster: with Client(cluster) as client: print(client) main(client) Error as follows. MemoryError: std::bad_alloc: out_of_memory: RMM failure at:/workspace/.conda-bld/work/include/rmm/mr/device/pool_memory_resource.hpp:192: Maximum pool size exceeded Basically my GPU runs out of memory when I call model.fit. It works when I use a csv with 64100 rows and fails when I use a csv with 128198 rows (2x rows). These aren't large files so I assume I am doing something wrong. I have tried fiddling around with LocalCUDACluster: device_memory_limit and rmm_pool_size dask_cudf.read_csv: chunksize Nothing has worked. I have been stuck on this all day so any help would be much appreciated.
You cannot train an xgboost model where the model grows larger than the remaining GPU memory size. You can scale out with dask_xgboost, but you need to ensure that the total GPU memory is sufficient. Here is a great blog on this by Coiled: https://coiled.io/blog/dask-xgboost-python-example/
kernel dies when computing DBSCAN in scikit-learn after dimensionality reduction
I have some data after using ColumnTransformer() like >>> X_trans <197431x6040 sparse matrix of type '<class 'numpy.float64'>' with 3553758 stored elements in Compressed Sparse Row format> I transform the data using TruncatedSVD() which seems to work like from sklearn.decomposition import TruncatedSVD >>> svd = TruncatedSVD(n_components=3, random_state=0) >>> X_trans_svd = svd.fit_transform(X_trans) >>> X_trans_svd array([[ 1.72326526, 1.85499833, -1.41848742], [ 1.67802434, 1.81705149, -1.25959756], [ 1.70251936, 1.82621935, -1.33124505], ..., [ 1.5607798 , 0.07638707, -1.11972714], [ 1.56077981, 0.07638652, -1.11972728], [ 1.91659627, -0.12081577, -0.84551125]]) Now I want to apply the transformed data to DBSCAN like >>> dbscan = DBSCAN(eps=0.5, min_samples=5) >>> clusters = dbscan.fit_predict(X_trans_svd) but my kernel crashes. I also tried converting it back to a df and apply it to DBSCAN >>> d = {'1st_component': X_trans_svd[:, 0], '2nd_component': X_trans_svd[:, 1], '3rd_component': X_trans_svd[:, 2]} >>> df = pd.DataFrame(data=d) >>> dbscan = DBSCAN(eps=0.5, min_samples=5) >>> clusters = dbscan.fit_predict(df) But the kernel keeps crashing. Any idea why is that? I'd appreciate a hint. EDIT: If I use just part of my 3x197431 array it works until X_trans_svd[0:170000] and starts crashing at X_trans_svd[0:180000]. Furthermore the size of the array is >>> X_trans_svd.nbytes 4738344 EDIT2: Sorry for doing this earlier. Here's an example to reproduce. I tried two machines with 16 and 64gb ram. Data is here: original data import pandas as pd import numpy as np from datetime import datetime from sklearn.cluster import DBSCAN s = np.loadtxt('data.txt', dtype='float') elapsed = datetime.now() dbscan = DBSCAN(eps=0.5, min_samples=5) clusters = dbscan.fit_predict(s) elapsed = datetime.now() - elapsed print(elapsed)
How to leave scikit-learn esimator result in dask distributed system?
You can find a minimal-working example below (directly taken from dask-ml page, only change is made to the Client() to make it work in distributed system) import numpy as np from dask.distributed import Client import joblib from sklearn.datasets import load_digits from sklearn.model_selection import RandomizedSearchCV from sklearn.svm import SVC # Don't forget to start the dask scheduler and connect worker(s) to it. client = Client('localhost:8786') digits = load_digits() param_space = { 'C': np.logspace(-6, 6, 13), 'gamma': np.logspace(-8, 8, 17), 'tol': np.logspace(-4, -1, 4), 'class_weight': [None, 'balanced'], } model = SVC(kernel='rbf') search = RandomizedSearchCV(model, param_space, cv=3, n_iter=50, verbose=10) with joblib.parallel_backend('dask'): search.fit(digits.data, digits.target) But this returns the result to the local machine. This is not exactly my code. In my code I am using scikit-learn tfidf vectorizer. After I use fit_transform(), it is returning the fitted and transformed data (in sparse format) to my local machine. How can I leave the results inside the distributed system (cluster of machines)? PS: I just encountered this from dask_ml.wrappers import ParallelPostFit Maybe this is the solution?
The answer was in front of my eyes and I couldn't see it for 3 days of searching. ParallelPostFit is the answer. The only problem is that it doesn't support fit_transform() but fit() and transform() works and it returns a lazily evaluated dask array (that is what I was looking for). Be careful about this warning: Warning ParallelPostFit does not parallelize the training step. The underlying estimator’s .fit method is called normally.
How to make a GridSearchCV with a proper FunctionTransformer in a pipeline?
I'm trying to make a Pipeline with GridSearchCV to filter data (with iforest) and perform a regression with StandarSclaler+MLPRegressor. I made a FunctionTransformer to include my iForest filter in the pipeline. I also define a parameters grid for the iForest filter (using kw_args methods). All seems OK but when un mahe the fit, nothing happens ... No error message. Nothing. After, when I want to make a predict, I have the message : "This RandomizedSearchCV instance is not fitted yet" from sklearn.preprocessing import FunctionTransformer #Definition of the function auto_filter using the iForest algo def auto_filter(DF, conta=0.1): #iForest made on the DF dataframe iforest = IsolationForest(behaviour='new', n_estimators=300, max_samples='auto', contamination=conta) iforest = iforest.fit(DF) # The DF (dataframe in input) is filtered taking into account only the inlier observations data_filtered = DF[iforest.predict(DF) == 1] # Only few variables are kept for the next step (regression by MLPRegressor) # this function delivers X_filtered and y X_filtered = data_filtered[['SessionTotalTime','AverageHR','MaxHR','MinHR','EETotal','EECH','EEFat','TRIMP','BeatByBeatRMSSD','BeatByBeatSD','HFAverage','LFAverage','LFHFRatio','Weight']] y = data_filtered['MaxVO2'] return (X_filtered, y) #Pipeline definition ('auto_filter' --> 'scaler' --> 'MLPRegressor') pipeline_steps = [('auto_filter', FunctionTransformer(auto_filter)), ('scaler', StandardScaler()), ('MLPR', MLPRegressor(solver='lbfgs', activation='relu', early_stopping=True, n_iter_no_change=20, validation_fraction=0.2, max_iter=10000))] #Gridsearch Definition with differents values of 'conta' for the first stage of the pipeline ('auto_filter) parameters = {'auto_filter__kw_args': [{'conta': 0.1}, {'conta': 0.2}, {'conta': 0.3}], 'MLPR__hidden_layer_sizes':[(sp_randint.rvs(1, nb_features, 1),), (sp_randint.rvs(1, nb_features, 1), sp_randint.rvs(1, nb_features, 1))], 'MLPR__alpha':sp_rand.rvs(0, 1, 1)} pipeline = Pipeline(pipeline_steps) estimator = RandomizedSearchCV(pipeline, parameters, cv=5, n_iter=10) estimator.fit(X_train, y_train)
You can try to run step by step manually to find a problem: auto_filter_transformer = FunctionTransformer(auto_filter) X_train = auto_filter_transformer.fit_transform(X_train) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) MLPR = MLPRegressor(solver='lbfgs', activation='relu', early_stopping=True, n_iter_no_change=20, validation_fraction=0.2, max_iter=10000) MLPR.fit(X_train, y_train) If each of the steps works fine, build a pipeline. Check the pipeline. If it works fine, try to use RandomizedSearchCV.
The func parameter of FunctionTransformer should be a callable that accepts the same arguments as transform method (array-like X of shape (n_samples, n_features) and kwargs for func) and returns a transformed X of the same shape. Your function auto_filter doesn't fit these requirements. Additionally, anomaly/outlier detection techniques from scikit-learn cannot be used as intermediate steps in scikit-learn pipelines since a pipeline assembles one or more transformers and an optional final estimator. IsolationForest or, say, OneClassSVM is not a transformer: it implements fit and predict. Thus, a possible solution is to cut off possible outliers separately and build a pipeline composing of transformers and a regressor: >>> import warnings >>> from sklearn.exceptions import ConvergenceWarning >>> warnings.filterwarnings(category=ConvergenceWarning, action='ignore') >>> import numpy as np >>> from scipy import stats >>> from sklearn.datasets import make_regression >>> from sklearn.ensemble import IsolationForest >>> from sklearn.model_selection import RandomizedSearchCV >>> from sklearn.neural_network import MLPRegressor >>> from sklearn.pipeline import Pipeline >>> from sklearn.preprocessing import StandardScaler >>> X, y = make_regression(n_samples=50, n_features=2, n_informative=2) >>> detect = IsolationForest(contamination=0.1, behaviour='new') >>> inliers_mask = detect.fit_predict(X) == 1 >>> pipe = Pipeline([('scale', StandardScaler()), ... ('estimate', MLPRegressor(max_iter=500, tol=1e-5))]) >>> param_distributions = dict(estimate__alpha=stats.uniform(0, 0.1)) >>> search = RandomizedSearchCV(pipe, param_distributions, ... n_iter=2, cv=3, iid=True) >>> search = search.fit(X[inliers_mask], y[inliers_mask]) The problem is that you won't be able to optimize the hyperparameters of IsolationForest. One way to handle it is to define hyperparameter space for the forest, sample hyperparameters with ParameterSampler or ParameterGrid, predict inliers and fit randomized search: >>> from sklearn.model_selection import ParameterGrid >>> forest_param_dict = dict(contamination=[0.1, 0.15, 0.2]) >>> forest_param_grid = ParameterGrid(forest_param_dict) >>> for sample in forest_param_grid: ... detect = detect.set_params(contamination=sample['contamination']) ... inliers_mask = detect.fit_predict(X) == 1 ... search.fit(X[inliers_mask], y[inliers_mask])
Running python code consumes GPU. why?
This is my python code for a model prediction. import csv import numpy as np np.random.seed(1) from keras.models import load_model import tensorflow as tf import pandas as pd import time output_location='Desktop/result/' #load model global graph graph = tf.get_default_graph() model = load_model("newmodel.h5") def Myfun(): ecg = pd.read_csv('/Downloads/model.csv') X = ecg.iloc[:,1:42].values y = ecg.iloc[:,42].values from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() y1 = encoder.fit_transform(y) Y = pd.get_dummies(y1).values from sklearn.model_selection import train_test_split X_train,X_test, y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=0) t1= timer() with graph.as_default(): prediction = model.predict(X_test[0:1]) diff=timer()-t1 class_labels_predicted = np.argmax(prediction) filename1=str(i)+"output.txt" newfile=output_location+filename1 with open(str(newfile),'w',encoding = 'utf-8') as file: file.write(" takes %f seconds time. predictedclass is %s \n" %(diff,class_labels_predicted)) return class_labels_predicted for i in range(1,100): Myfun() My system GPU is of size 2GB. While running this code ,nvidia-smi -l 2 shows it consumes 1.8 GB of GPU. And 100 files are getting as a result. Soon after the task completes again GPU utilisation turns to 500MB. I have tensorflow and keras GPU version installed in my system. My Question is: Why does this code runs on GPU. Does the complete code uses GPU or its only for importing libraries such as keras-gpu and tensorflow-gpu?
As I can see from your code, you are using Keras and Tensorflow. From Keras F.A.Q. If you are running on the TensorFlow or CNTK backends, your code will automatically run on GPU if any available GPU is detected.
You can force Keras to run on CPU only import os os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" os.environ["CUDA_VISIBLE_DEVICES"] = ""