GPU runs out of memory when training a ml model - python-3.x
I am trying to train a ml model using dask. I am training on my local machine with 1 GPU. My GPU has 24 GiBs of memory.
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import pandas as pd
import numpy as np
import os
import xgboost as xgb
np.random.seed(42)
def get_columns(filename):
return pd.read_csv(filename, nrows=10).iloc[:, :NUM_FEATURES].columns
def get_data(filename, target):
import dask_cudf
X = dask_cudf.read_csv(filename)
# X = dd.read_csv(filename, assume_missing=True)
y = X[[target]]
X = X.iloc[:, :NUM_FEATURES]
return X, y
def main(client: Client) -> None:
X, y = get_data(FILENAME, TARGET)
model = xgb.dask.DaskXGBRegressor(
tree_method="gpu_hist",
objective="reg:squarederror",
seed=42,
max_depth=5,
eta=0.01,
n_estimators=10)
model.client = client
model.fit(X, y, eval_set=[(X, y)])
print("Saving the model..")
model.get_booster().save_model("xgboost.model")
print("Doing model importance..")
columns = get_columns(FILENAME)
pd.Series(model.feature_importances_, index=columns).sort_values(ascending=False).to_pickle("~/yolo.pkl")
if __name__ == "__main__":
os.environ["MALLOC_TRIM_THRESHOLD_"]="65536"
with LocalCUDACluster(device_memory_limit="15 GiB", rmm_pool_size="20 GiB") as cluster:
# with LocalCluster() as cluster:
with Client(cluster) as client:
print(client)
main(client)
Error as follows.
MemoryError: std::bad_alloc: out_of_memory: RMM failure at:/workspace/.conda-bld/work/include/rmm/mr/device/pool_memory_resource.hpp:192: Maximum pool size exceeded
Basically my GPU runs out of memory when I call model.fit. It works when I use a csv with 64100 rows and fails when I use a csv with 128198 rows (2x rows). These aren't large files so I assume I am doing something wrong.
I have tried fiddling around with
LocalCUDACluster: device_memory_limit and rmm_pool_size
dask_cudf.read_csv: chunksize
Nothing has worked.
I have been stuck on this all day so any help would be much appreciated.
You cannot train an xgboost model where the model grows larger than the remaining GPU memory size. You can scale out with dask_xgboost, but you need to ensure that the total GPU memory is sufficient.
Here is a great blog on this by Coiled: https://coiled.io/blog/dask-xgboost-python-example/
Related
RandomSearchCV is too slow when working on pipe
I'm testing the method of running feature selection with hyper parameters. I'm running feature selection algorithm SequentialFeatureSelection with hyper parameters algorithm RandomizedSearchCV with xgboost model I run the following code: from xgboost import XGBClassifier from mlxtend.feature_selection import SequentialFeatureSelector from sklearn.pipeline import Pipeline from sklearn.model_selection import RandomizedSearchCV import pandas as pd def main(): df = pd.read_csv("input.csv") x = df[['f1','f2','f3', 'f4', 'f5', 'f6','f7','f8']] y = df[['y']] model = XGBClassifier(n_jobs=-1) sfs = SequentialFeatureSelector(model, k_features="best", forward=True, floating=False, scoring="accuracy", cv=2, n_jobs=-1) params = {'xgboost__max_depth': [2, 4], 'sfs__k_features': [1, 4]} pipe = Pipeline([('sfs', sfs), ('xgboost', model)]) randomized = RandomizedSearchCV(estimator=pipe, param_distributions=params,n_iter=2,cv=2,random_state=40,scoring='accuracy',refit=True,n_jobs=-1) res = randomized.fit(x.values,y.values) if __name__=='__main__': main() The file input.csv has only 39 rows of data (not including the header): f1,f2,f3,f4,f5,f6,f7,f8,y 6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1 5,116,74,0,0,25.6,0.201,30,0 3,78,50,32,88,31.0,0.248,26,1 10,115,0,0,0,35.3,0.134,29,0 2,197,70,45,543,30.5,0.158,53,1 8,125,96,0,0,0.0,0.232,54,1 4,110,92,0,0,37.6,0.191,30,0 10,168,74,0,0,38.0,0.537,34,1 10,139,80,0,0,27.1,1.441,57,0 1,189,60,23,846,30.1,0.398,59,1 5,166,72,19,175,25.8,0.587,51,1 7,100,0,0,0,30.0,0.484,32,1 0,118,84,47,230,45.8,0.551,31,1 7,107,74,0,0,29.6,0.254,31,1 1,103,30,38,83,43.3,0.183,33,0 1,115,70,30,96,34.6,0.529,32,1 3,126,88,41,235,39.3,0.704,27,0 8,99,84,0,0,35.4,0.388,50,0 7,196,90,0,0,39.8,0.451,41,1 9,119,80,35,0,29.0,0.263,29,1 11,143,94,33,146,36.6,0.254,51,1 10,125,70,26,115,31.1,0.205,41,1 7,147,76,0,0,39.4,0.257,43,1 1,97,66,15,140,23.2,0.487,22,0 13,145,82,19,110,22.2,0.245,57,0 5,117,92,0,0,34.1,0.337,38,0 5,109,75,26,0,36.0,0.546,60,0 3,158,76,36,245,31.6,0.851,28,1 3,88,58,11,54,24.8,0.267,22,0 6,92,92,0,0,19.9,0.188,28,0 10,122,78,31,0,27.6,0.512,45,0 4,103,60,33,192,24.0,0.966,33,0 11,138,76,0,0,33.2,0.420,35,0 9,102,76,37,0,32.9,0.665,46,1 2,90,68,42,0,38.2,0.503,27,1 As you can see, the amount of data is too small, and there are small amount of parameters to optimize. I checked the number of cpus: lscpu and I got: CPU(s): 12 so 12 threads can be created and run in parallel I checked this post: RandomSearchCV super slow - troubleshooting performance enhancement But I already use n_jobs = -1 So why it's run too slow ? (More than 15 minutes !!!)
How to train an image similarity model on 20 millions images(total size 10GB)?
My system is configured with 16GB RAM. I have tried to train image similarity model on 20 millions images(total size 10GB) using VGG19 and KNN's nearest neighbor. When tried to read images i am getting Memory error. Even I have tried to train model on 200000(total size 770MB) but issue is same. How I can read millions of images to train ML models. Ubuntu 18.04.2 LTS,Core™ i7,Intel® HD Graphics 5500 (Broadwell GT2), 64-bit, 16GB RAM import os import skimage.io import tensorflow as tf from skimage.transform import resize import numpy as np from sklearn.neighbors import NearestNeighbors import matplotlib.pyplot as plt from matplotlib import offsetbox from matplotlib.offsetbox import OffsetImage, AnnotationBbox from sklearn import manifold import pickle skimage.io.use_plugin('matplotlib') dirPath = 'train_data' args = [os.path.join(dirPath, filename) for filename in os.listdir(dirPath)] imgs_train = [skimage.io.imread(arg, as_gray=False) for arg in args] shape_img = (130, 130, 3) model = tf.keras.applications.VGG19(weights='imagenet', include_top=False, input_shape=shape_img) model.summary() shape_img_resize = tuple([int(x) for x in model.input.shape[1:]]) input_shape_model = tuple([int(x) for x in model.input.shape[1:]]) output_shape_model = tuple([int(x) for x in model.output.shape[1:]]) n_epochs = None def resize_img(img, shape_resized): img_resized = resize(img, shape_resized, anti_aliasing=True, preserve_range=True) assert img_resized.shape == shape_resized return img_resized def normalize_img(img): return img / 255. def transform_img(img, shape_resize): img_transformed = resize_img(img, shape_resize) img_transformed = normalize_img(img_transformed) return img_transformed def apply_transformer(imgs, shape_resize): imgs_transform = [transform_img(img, shape_resize) for img in imgs] return imgs_transform imgs_train_transformed = apply_transformer(imgs_train, shape_img_resize) X_train = np.array(imgs_train_transformed).reshape((-1,) + input_shape_model) E_train = model.predict(X_train) E_train_flatten = E_train.reshape((-1, np.prod(output_shape_model))) knn = NearestNeighbors(n_neighbors=5, metric="cosine") knn.fit(E_train_flatten)
Knowing that keras is working well with generator, you should consider using one: python generator tutorial, using a generator with keras (example) It allows you to load your image during your training, batch by batch.
Running python code consumes GPU. why?
This is my python code for a model prediction. import csv import numpy as np np.random.seed(1) from keras.models import load_model import tensorflow as tf import pandas as pd import time output_location='Desktop/result/' #load model global graph graph = tf.get_default_graph() model = load_model("newmodel.h5") def Myfun(): ecg = pd.read_csv('/Downloads/model.csv') X = ecg.iloc[:,1:42].values y = ecg.iloc[:,42].values from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() y1 = encoder.fit_transform(y) Y = pd.get_dummies(y1).values from sklearn.model_selection import train_test_split X_train,X_test, y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=0) t1= timer() with graph.as_default(): prediction = model.predict(X_test[0:1]) diff=timer()-t1 class_labels_predicted = np.argmax(prediction) filename1=str(i)+"output.txt" newfile=output_location+filename1 with open(str(newfile),'w',encoding = 'utf-8') as file: file.write(" takes %f seconds time. predictedclass is %s \n" %(diff,class_labels_predicted)) return class_labels_predicted for i in range(1,100): Myfun() My system GPU is of size 2GB. While running this code ,nvidia-smi -l 2 shows it consumes 1.8 GB of GPU. And 100 files are getting as a result. Soon after the task completes again GPU utilisation turns to 500MB. I have tensorflow and keras GPU version installed in my system. My Question is: Why does this code runs on GPU. Does the complete code uses GPU or its only for importing libraries such as keras-gpu and tensorflow-gpu?
As I can see from your code, you are using Keras and Tensorflow. From Keras F.A.Q. If you are running on the TensorFlow or CNTK backends, your code will automatically run on GPU if any available GPU is detected.
You can force Keras to run on CPU only import os os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" os.environ["CUDA_VISIBLE_DEVICES"] = ""
Run python code using GPU
Jena Climate Code is as follows import numpy as np import os from matplotlib import pyplot as plt from numba import vectorize f=open('jena.csv') data=f.read() f.close() lines=data.split('\n') header=lines[0].split(',') lines=lines[1:] print(header) N=len(lines) print(N) float_data=np.zeros((len(lines),len(header)-1)) for i, line in enumerate(lines): values=[float(x) for x in line.split(',')[1:]] float_data[i,:]=values mean=float_data[:200000].mean(axis=0) float_data -=mean std=float_data[:200000].std(axis=0) float_data/=std def generator(data,lookback,delay,min_index,max_index,shuffle=False,batch_size=128,step=6): if max_index is None: max_index=len(data)-delay-1 i=min_index+lookback while 1: if shuffle: rows=np.random.randint( min_index+lookback,max_index,size=batch_size) else: if i + batch_size>=max_index: i=min_index+lookback rows=np.arange(i,min(i+batch_size,max_index)) i+=len(rows) samples=np.zeros((len(rows),lookback//step,data.shape[-1])) targets=np.zeros((len(rows),)) for j, row in enumerate(rows): indices=range(rows[j]-lookback,rows[j],step) samples[j]=data[indices] targets[j]=data[rows[j]+delay][1] yield samples, targets lookback=1440 step=6 delay=144 batch_size=128 train_gen=generator(float_data,lookback=lookback,delay=delay,min_index=0,max_index=200000,shuffle=True,step=step,batch_size=batch_size) val_gen=generator(float_data,lookback=lookback,delay=delay,min_index=200001,max_index=300000,step=step,batch_size=batch_size) test_gen=generator(float_data,lookback=lookback,delay=delay,min_index=300001,max_index=None,step=step,batch_size=batch_size) val_steps=(300000-200001-lookback) test_steps=(len(float_data)-300001-lookback) def evaluate_naive_method(): batch_maes=[] for step in range(val_steps): samples,targets=next(val_gen) mae=np.mean(np.abs(preds-targets)) batch_maes.append(mae) print(np.mean(batch_maes)) evaluate_naive_method() When i execute the code, it uses CPU and takes approximately 14 minutes to produce mae. I want to use tensorflow in this section using GPU so that output can be faster. for step in range(val_steps): samples,targets=next(val_gen) mae=np.mean(np.abs(preds-targets)) batch_maes.append(mae) Should i convert the variables "samples" and "targets" into tensorflow so that I can get output faster? If so how can i convert it to tensorflow?
Tensorflow does this thing what you want, please have a look at below example using GPU: https://www.tensorflow.org/guide/using_gpuhttps://www.tensorflow.org/guide/using_gpu
How to reduce memory usage?
I am trying to generate pickle file of the predictions on my dataset. But after executing the code for 6 hours PC is going out of memory again and again. I wonder if anyone can help me with this? from keras.models import load_model import sys sys.setrecursionlimit(10000) import pickle import os import cv2 import glob dirlist = [] imgdirs = os.listdir('/chars/') imgdirs.sort(key=float) for imgdir in imgdirs: imglist = [] for imgfile in glob.glob(os.path.join('/chars/', imgdir, '*.png')): img = cv2.imread(imgfile) img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) model = load_model('mymodel.h5') predictions=model.predict(img) print('predicted model:', predictions) imglist.append(predictions) dirlist.append(imglist) q = open("predict.pkl","wb") pickle.dump(dirlist,q) q.close()
First of all why you reload your model for every prediction? Code would be much faster, if you load your model only once and then do the prediction. Also if you load several pictures at once and you predict in batches that also would be a big speed boost. What out of memory error do you get? One from the tensorflow(or which backend you're using) or one from python? My best guess would be that load_model is loading the same model over and over in the same tensorflow session till your resource is exhausted. The Solution is, as stated above, to just load the model at the beginning once.