I'm using scikit-learn Random Forest to fit a training data (~30mb) and my laptop keeps crashing running of out application memory. The test data is a few times bigger than the training data. Using Macbook Air 2GHz 8GB memory.
What are some of the ways to deal with this?
rf = RandomForestClassifier(n_estimators = 100, n_jobs=4)
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rf, X_train_a, y_train, cv=20, scoring='roc_auc'))
Your best choice is to tune the arguments.
n_jobs=4
This makes the computer compute four train-test cycles simultaneously. Different Python jobs run in separate processes, thus the full dataset is also copied. Try to reduce n_jobs to 2 or 1 to save memory. n_jobs==4 uses four times the memory n_jobs==1 uses.
cv=20
This splits the data into 20 pieces and the code does 20 train-test iterations. This means that the training data is the size of 19 pieces of the original data. You can quite safely reduce it to 10, however your accuracy estimate might get worse. It won't save much memory, but makes runtime faster.
n_estimators = 100
Reducing this will save little memory, but it will make the algorithm run faster as the random forest will contain fewer trees.
To sum up, I'd recommend reducing n_jobs to 2 to save the memory (2-fold increase in runtime). To compensate runtime, I'd suggest changing cv to 10 (2-fold savings in runtime). If that does not help, change n_jobs to 1 and also reduce the number of estimators to 50 (extra two times faster processing).
I was dealing with ~4MB dataset and Random Forest from scikit-learn with default hyper-parameters was ~50MB (so more than 10 times of the data). By setting the max_depth = 6 the memory consumption decrease 66 times. The performance of shallow Random Forest on my dataset improved!
I write down this experiment in the blog post.
From my experience, in the case of regression tasks the memory usage can grow even much more, so it is important to control the tree depth. The tree depth can be controlled directly with max_depth or by tuning: min_samples_split, min_samples_leaf, min_weight_fraction_leaf, max_features, max_leaf_nodes.
The memory of the Random Forest can be of course controlled with number of trees in the ensemble.
Related
I have read these threads [1] [2] [3] [4], and this article.
I think I got how batch size and epochs works with DDP, but I am not sure about the learning rate.
Let's say I have a dataset of 100 * 8 images. In a non-distributed scenario, I set the batch size to 8, so each epoch will do 100 gradient steps.
Now I am in a multi-node multi-gpu scenario, with 2 nodes and 4 GPUs (so world size is 8).
I understand that I need to pass batches of 8 / 8 = 1, because each update will aggregate the gradients from the 8 GPUs. In each worker, the data loader will load still 100 batches, but each of 1 sample. So the whole dataset is parsed exactly once per epoch.
I checked and everything seems like that.
But what about the learning rate?
According to the official doc
When a model is trained on M nodes with batch=N, the gradient will be
M times smaller when compared to the same model trained on a single
node with batch=M*N if the loss is summed (NOT averaged as usual)
across instances in a batch (because the gradients between different
nodes are averaged). [...] But in most cases, you can just treat a
DistributedDataParallel wrapped model, a DataParallel wrapped model
and an ordinary model on a single GPU as the same (E.g. using the same
learning rate for equivalent batch size).
I understand that the gradients are averaged, so if the loss is averaged over samples nothing changes, while if it is summer we need to account for that. But does 'nodes' refer to the total number of GPUs across all cluster nodes (world size) or just cluster nodes? In my example, would M be 2 or 8? Some posts in the threads I linked say that the gradient is divided 'by the number of GPUs'. How exactly is the gradient aggregated?
Please refer to the following discussion:
https://github.com/PyTorchLightning/pytorch-lightning/discussions/3706
"As far as I know, learning rate is scaled with the batch size so that the sample variance of the gradients is kept approx. constant.
Since DDP averages the gradients from all the devices, I think the LR should be scaled in proportion to the effective batch size, namely, batch_size * num_accumulated_batches * num_gpus * num_nodes
In this case, assuming batch_size=512, num_accumulated_batches=1, num_gpus=2 and num_noeds=1 the effective batch size is 1024, thus the LR should be scaled by sqrt(2), compared to a single gpus with effective batch size 512."
So Training a DNN model can be a pain when a batch of one image takes 15GB. Speed is not so important for me, yet to fit bigger batches (and models is). So I wonder if there is a PyTorch with CUDA Unified Memory fork or something like that to fit giant models (having 16gb per GPU RAM, yet 250 on CPU side it seems quite resonable)?
If you do not care about the time it takes, but need large batches, you can use a more slow approach. Say you need to have a batch of 128 samples but your gpu memory fits only 8 samples. You can create smaller batches of 8 samples and then average their gradients.
For each small batch of 8 samples that you evaluate, you keep the .grad of each parameter in your cpu memory. You keep a list of grads for each of your models parameters. After you have gathered the grads for 16 batches of 8 samples (128 samples in total) you can average the gradients of each parameter and put the result back into the .grad attribute of each parameter.
You can then call the .step() of your optimizer. This should yield exactly the same results as if you were using a large batch of 128 samples.
What are the differences between the batch parameter in the yolov3.cfg file and the batch_size parameter in the keras.fit()? How should I set them? please.
There is no difference, batch size means how many images (samples) will be in a mini-batch while training. For yolo, usually in the inference case, the batch_size is 1.
How would you set it?
Go for as high as you can unless you run out of GPU memory.
Batch size is a term used in machine learning and refers to the number of training examples utilized in one iteration.
Traditionally, batch_size are chosen as a power of 2 -> 8, 16, 32, 64. The higher the batch size the faster the convergence usually.
Scikit-Learn's RandomForestRegressor has an n_jobs instance attribute, that, from the documentation:
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel for both fit and predict. If
-1, then the number of jobs is set to the number of cores.
Training the Random Forest model with more than one core is obviously more performant than on a single core. But I have noticed that predictions are a lot slower (approximately 10 times slower) - this is probably because I am using .predict() on an observation-by-observation basis.
Therefore, I would like to train the random forest model on, say, 4 cores, but run the prediction on a single core. (The model is pickled and used in a separate process.)
Is it possible to configure the RandomForestRegressor() in this way?
Oh sure you can, I use a similar strategy for stored-models.
Just set <_aRFRegressorModel_>.n_jobs = 1 upon pickle.load()-ed, before using a .predict() method.
Nota bene:
the amount of work on .predict()-task is pretty "lightweight" if compared to .fit(), so in doubts, what are is core-motivation for tweaking this. Memory could be the issue, once large-scale forests may get a need to get scanned in n_jobs-"many" replicas ( which due to joblib nature re-instate all the python process-state into that many full-scale replicas ... and the new, overhead-strict Amdahl's Law re-fomulation shows one, what a bad idea that was -- to pay a way more than finally earned ( performancewise ) ). This is not an issue for .fit(), where concurrent processes can well adjust the setup overheads ( in my models ~ 4:00:00+ hrs runtime per process ), but right due to this cost/benefit "imbalance", it could be a killer-factor for "lightweight"-.predict(), where not much work is to be done, so masking the process setup/termination costs cannot be done ( and you pay way more than get ).
BTW, do you pickle.dump() object(s) from the top-level namespace? I got issues if not and the stored object(s) did not reconstruct correctly. ( Spent ages on this issue )
I was confused by this problem for several days...
My question is that why the training time has such massive difference between that I set the batch_size to be "1" and "20" for my generator.
If I set the batch_size to be 1, the training time of 1 epoch is approximately 180 ~ 200 sec.
If I set the batch_size to be 20, the training time of 1 epoch is approximately 3000 ~ 3200 sec.
However, this horrible difference between these training times seems to be abnormal..., since it should be the reversed result:
batch_size = 1, training time -> 3000 ~ 3200 sec.
batch_size = 20, training time -> 180 ~ 200 sec.
The input to my generator is not the file path, but the numpy arrays which are already loaded into the
memory via calling "np.load()".
So I think the I/O trade-off issue doesn't exist.
I'm using Keras-2.0.3 and my backend is tensorflow-gpu 1.0.1
I have seen the update of this merged PR,
but it seems that this change won't affect anything at all. (the usage is just the same with original one)
The link here is the gist of my self-defined generator and the part of my fit_generator.
When you use fit_generator, the number of samples processed for each epoch is batch_size * steps_per_epochs. From the Keras documentation for fit_generator: https://keras.io/models/sequential/
steps_per_epoch: Total number of steps (batches of samples) to yield from generator before declaring one epoch finished and starting the next epoch. It should typically be equal to the number of unique samples of your dataset divided by the batch size.
This is different from the behaviour of 'fit', where increasing batch_size typically speeds up things.
In conclusion, when you increase batch_size with fit_generator, you should decrease steps_per_epochs by the same factor, if you want training time to stay the same or lower.
Let's clear it :
Assume you have a dataset with 8000 samples (rows of data) and you choose a batch_size = 32 and epochs = 25
This means that the dataset will be divided into (8000/32) = 250 batches, having 32 samples/rows in each batch. The model weights will be updated after each batch.
one epoch will train 250 batches or 250 updations to the model.
here steps_per_epoch = no.of batches
With 50 epochs, the model will pass through the whole dataset 50 times.
Ref - https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/
You should also take into account the following function parameters when working with fit_generator:
max_queue_size, use_multiprocessing and workers
max_queue_size - might cause to load more data than you actually expect, which depending on your generator code may do something unexpected or unnecessary which can slow down your execution times.
use_multiprocessing together with workers - might spin-up additional processes that would lead to additional work for serialization and interprocess communication. First you would get your data serialized using pickle, then you would send your data to that target processes, then you would do your processing inside those processes and then the whole communication procedure repeats backwards, you pickle results, and send them to the main process via RPC. In most cases it should be fast, but if you're processing dozens of gigabytes of data or have your generator implemented in sub-optimal fashion then you might get the slowdown you describe.
The whole thing is:
fit() works faster than fit_generator() since it can access data directly in memory.
fit() takes numpy arrays data into memory, while fit_generator() takes data from the sequence generator such as keras.utils.Sequence which works slower.