Training multiple Sequential models in a row slows down - python-3.x

I am using Keras/TensorFlow (GPU) to create a time series forecasting model. I have 100x of time series and want to train a network for each of them.
Running a few time series in a row is fine, but once I run 100x or 1000x then it appears that the training time of each model increase slowly (but surely). Is there a simple reason for this ?
Below is code to reproduce the issue (note that it could take a while to run).
https://gist.github.com/mannsi/c5666c4b786c35c3443beea6d13a32fe
On my machine the first iteration takes 10s, iteration #250 takes 16s and iteration #500 takes 25s.
I am new to Neural Networks and Keras/TF so maybe this is totally normal but I did not factor this in when doing my back-of-the-envelope time calculations.
System info:
python 3.5
keras (1.2.2)
tensorflow-gpu(1.0.0)
EDIT: I tested the same code on a TensorFlow CPU backend and I see the exact same behavior there.

It's possible that there is some overhead building up in the computation graph over each iteration. Use the Keras backend function K.clear_session() to reset the underlying Tensorflow session between each run.

Could it be that your gpu warms up and therefore the power is lowered to reduce temperature?
How long does the first iteration take if you relaunch it after having done many iterations?

As your model parameters didn't change, you have to compile only once the model. Then you can build a loop for fitting it.
You instantiate a model and compile it in every loop, that's why your memory consumption grows continuously.

Related

PyTorch Lightning training stalling at the beginning of fourth batch

I am having an odd problem in PyTorch Lightning, which I'm using for finetuning a language model on a GPU. The first three training batches run very quickly (<1 second), then the fourth goes on for hours without finishing, and eventually I cancel the job. This is true whether I use batches of size 2 or 16.
Using the callbacks on_train_batch_start and on_train_batch_end to print 'batch started' and 'batch ended', I know that the first three batches have all completed, and the fourth doesn't reach the on_train_batch_start callback. This leads me to believe that the problem is somewhere in the DataLoader, since on_train_batch_start appears to be the first hook in the training loop, according to PyTorch Lightning's pseudocode.
I place some printing statements in my custom collate_fn for the DataLoader, and they all printed as well, so it appears that the problem arises sometime after collating occurs.
Does anyone have any idea what the issue could be or how I can interrogate the code further?

Using multiprocessing with AllenNLP decoding is sluggish compared to non-multiprocessing case

I'm using the AllenNLP (version 2.6) semantic role labeling model to process a large pile of sentences. My Python version is 3.7.9. I'm on MacOS 11.6.1. My goal is to use multiprocessing.Pool to parallelize the work, but the calls via the pool are taking longer than they do in the parent process, sometimes substantially so.
In the parent process, I have explicitly placed the model in shared memory as follows:
from allennlp.predictors import Predictor
from allennlp.models.archival import load_archive
import allennlp_models.structured_prediction.predictors.srl
PREDICTOR_PATH = "...<srl model path>..."
archive = load_archive(PREDICTOR_PATH)
archive.model.share_memory()
PREDICTOR = Predictor.from_archive(archive)
I know the model is only being loaded once, in the parent process. And I place the model in shared memory whether or not I'm going to make use of the pool. I'm using torch.multiprocessing, as many recommend, and I'm using the spawn start method.
I'm calling the predictor in the pool using Pool.apply_async, and I'm timing the calls within the child processes. I know that the pool is using the available CPUs (I have six cores), and I'm nowhere near running out of physical memory, so there's no reason for the child processes to be swapped to disk.
Here's what happens, for a batch of 395 sentences:
Without multiprocessing: 638 total processing seconds (and elapsed time).
With a 4-process pool: 293 seconds elapsed time, 915 total processing seconds.
With a 12-process pool: 263 seconds elapsed time, 2024 total processing seconds.
The more processes, the worse the total AllenNLP processing time - even though the model is explicitly in shared memory, and the only thing that crosses the process boundary during the invocation is the input text and the output JSON.
I've done some profiling, and the first thing that leaps out at me is that the function torch._C._nn.linear is taking significantly longer in the multiprocessing cases. This function takes two tensors as arguments - but there are no tensors being passed across the process boundary, and I'm decoding, not training, so the model should be entirely read-only. It seems like it has to be a problem with locking or competition for the shared model resource, but I don't understand at all why that would be the case. And I'm not a torch programmer, so my understanding of what's happening is limited.
Any pointers or suggestions would be appreciated.
Turns out that I wasn't comparing exactly the right things. This thread: https://github.com/allenai/allennlp/discussions/5471 goes into all the detail. Briefly, because pytorch can use additional resources under the hood, my baseline test without multiprocessing wasn't taxing my computer enough when running two instances in parallel; I had to run 4 instances to see the penalty, and in that case, the total processing time was essentially the same for 4 parallel nonmultiprocessing invocations, or one multiprocessing case with 4 subprocesses.

Run TensorFlow in a for loop

I have a for loop in my code and in each iteration I augment some processed data and train my TF model again. After a while it takes longer than expected to process my code. I suspect about CPU usage since I running on multiple cores. How can I fix that?

effect of increase worker thread in gensim word2vec

I'm trying to train a gensim sgns model and in the process I measure the loss during which I'm calculating as
loss = model.running_training_loss / model.corpus_count,
however, I noticed that if I change my worker thread I get different losses keeping all other parameters same. Especially if I keep my worker thread a 1 I get a really high loss and If I increase threads I get less loss. An instance
thread loss
worker=1 20.40519721
worker=10 2.714875407
worker=16 1.239528453
Up through gensim 3.5.0, the loss value reported may not be very sensible, only resetting the tally each call to train(), rather than each internal epoch. There are some fixes forthcoming in this issue:
https://github.com/RaRe-Technologies/gensim/pull/2135
What version of gensim are you using, and what is your code doing to collect the loss data?

Parallelization slows down execution in MatLab

I am using Matlab on a Mac OS X running on a Pentium processor with 4 real cores.
I want to analyse Magnetic resonance images (MRI) and fit the signal from these images using optimisation. For every pixel I have 35 values (i.e. the same image acquired 35 times during different conditions) and I want to fit these values to some function
Below, I have stripped my code down to the very basic loop that calls the fitting function:
ticid1 = tic;
for x= a:1:b
[a, b, c, d] = FitSignal(Volume(y,x,:));
end;
toc(ticid1);
Here Volume is a 3D matrix holding all MRI images about 9 MB in size. FitSignal thus gets an array holding 35 values for a specific pixel and the optimisation finds the best fit. The loop runs in this case 120 times (b-a = 120) which is once for every pixel that are on a horizontal line in the image.
Timing the above code using tic and toc, the entire loop takes about 50 seconds
I thought executing the code in parallel may provide some speed up. So I opened 3 workers and ran the loop with parfor but found only marginal (20-30%) speedup.
Then I reduced the number of workers to 1. Now running the code with parfor took about 90 seconds. So with 1 worker the code is app. twice as slow as when running without parallelization. This is consistent with the small benefit seen with 3 workers.
I then tried timing inside the function FitSignal and found that without parallelization it takes app. 0.4 seconds while with parallelization it takes 0.7 seconds.
I understand that parallelization comes with overhead but in this case it seems excessive to me. Besides, once inside the function FitSignal, and when there is only one worker, it should not matter if the function runs on the main process or within a worker - right ? However, running inside a sole worker, the function runs quite slower!
Can anyone tell me what is wrong? and importantly, how to change the code to take advantage of any possible speedup with parallel execution ?
Thanks in advance
PS: I have checked my system. Memory pressure low, I even issued "purge" in terminal to free memory. CPU does not exceed 15% during run.
When running on a single machine, Matlab automatically parallelises vector operations (1)... except when you are running explicit parallelisation, like parfor (2).
So, what is happening here is that when you run in normal, not parfor mode you are getting a 100% speedup from parallelised vector operations, based on your numbers.
When you run in parfor mode, you loose the vector operations boost, but gain the parallelisation from parfor, so half the speed of normal processing, but split over three cores, so taking about two thirds of the time.
The above is a rough estimate based on the numbers in the question; naturally for other problems these relative speedups will vary due to a number of factors, such as differing amounts of vectorized code and overheads of parfor.

Resources