Tensorflow supports multiple threads/streams on one GPU for training? - python-3.x

UPDATE:
I found the source code of GPUDevice, it hard-coded max streams to 1, may I know the know reason?
GPUDevice(const SessionOptions& options, const string& name,
Bytes memory_limit, const DeviceLocality& locality,
TfGpuId tf_gpu_id, const string& physical_device_desc,
Allocator* gpu_allocator, Allocator* cpu_allocator)
: BaseGPUDevice(options, name, memory_limit, locality, tf_gpu_id,
physical_device_desc, gpu_allocator, cpu_allocator,
false /* sync every op */, 1 / max_streams /) {
if (options.config.has_gpu_options()) {
force_gpu_compatible_ =
options.config.gpu_options().force_gpu_compatible();
}
======================================
I am wondering whether TensorFlow(1.x version) supports multi-thread or multi-stream on a single GPU. If not, I am curious the underlying reasons, TF did this on some purposes or some libs like CUDA prevents TF from providing or some other reasons?
Like some previous posts[1,2], I tried to run multiple training ops in TF, i.e. sees.run([train_op1, train_op2],feed_dict={...}), I used the TF timeline to profile each iteration. However, TF timeline always showed that two train ops run sequentially (although timeline is not accurate[3], the wall time of each op suggests sequential running). I also looked at some source code of TF, it looks like the each op are computed by in device->ComputeAsync() or device->Compute(), and the GPU is blocked when computing an op. If I am correct, one GPU can only run a single op each time, which may lower GPU utilization.
1.Running multiple tensorflow sessions concurrently
2.Run parallel op with different inputs and same placeholder
3.https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-244251867

I have similar experience with you.
I have two GPU, each GPU run three threads, each thread running a session, each session running time fluct a lot.
if run only one thread on each GPU, session running time is quite stable.
from these appearence, we can conclude that ,thread in tensorflow not cowork well,
the mechanism of tensorflow has problem.

Related

Why Doc2vec is slower with multiples cores rather than one?

I'm trying to train multiple "documents" (here mostly log format), and the Doc2Vec is taking longer if I'm specifying more than one core (which i have).
My data looks like this:
print(len(train_corpus))
7930196
print(train_corpus[:5])
[TaggedDocument(words=['port', 'ssh'], tags=[0]),
TaggedDocument(words=['session', 'initialize', 'by', 'client'], tags=[1]),
TaggedDocument(words=['dfs', 'fsnamesystem', 'block', 'namesystem', 'addstoredblock', 'blockmap', 'update', 'be', 'to', 'blk', 'size'], tags=[2]),
TaggedDocument(words=['appl', 'selfupdate', 'component', 'amd', 'microsoft', 'windows', 'kernel', 'none', 'elevation', 'lower', 'version', 'revision', 'holder'], tags=[3]),
TaggedDocument(words=['ramfs', 'tclass', 'blk', 'file'], tags=[4])]
I have 8 cores available:
print(os.cpu_count())
8
I am using gensim 4.1.2, on Centos7. Using this approch (stackoverflow.com/a/37190672/130288), It looks like my BLAS library is OpenBlas, so I setted OPENBLAS_NUM_THREADS=1 on my bashrc (and could be visible from Jupyter, using !echo $OPENBLAS_NUM_THREADS=1 )
This is my test code :
dict_time_workers = dict()
for workers in range(1, 9):
model = Doc2Vec(vector_size=20,
min_count=1,
workers=workers,
epochs=1)
model.build_vocab(train_corpus, update = False)
t1 = time.time()
model.train(train_corpus, epochs=1, total_examples=model.corpus_count)
dict_time_workers[workers] = time.time() - t1
And the variable dict_time_workers is equal too :
{1: 224.23211407661438,
2: 273.408652305603,
3: 313.1667754650116,
4: 331.1840877532959,
5: 433.83785605430603,
6: 545.671571969986,
7: 551.6248495578766,
8: 548.430994272232}
As you can see, the time taking is increasing instead of decreasing. Results seems to be the same with a bigger epochs parameters.
Nothing is running on my Centos7 except this.
If I look at what's happening on my threads using htop, I see that the right number of thread are used for each training. But, the more threads are used, less the percentage of usage is (for example, with only one thread, 95% is used, for 2 they both used around 65% of their max power, for 6 threads are 20-25% ...). I suspected an IO issue, but iotop showed me that nothing bad is happening at the same disk.
The post seems now to be related to this post
Not efficiently to use multi-Core CPU for training Doc2vec with gensim .
When getting no benefit from extra cores like that, it's likely that the BLAS library you've got installed is already configured to try to use all cores for every bulk array operation. That means that other attempts to engage more cores, like Gensim's workers specification, just increase the overhead of contention, when each individual worker thread's individual BLAS callouts also try to use 8 threads.
Depending on the BLAS library in use, its own propensity to use more cores can typically be limited by environment variables named something like OPENBLAS_NUM_THREADS and/or MKL_NUM_THREADS.
If you set these to just 1 before your process launches, you may see different, and possibly better, multithreaded behavior.
Note, though: 1 just restores the assumption that every worker-thread only ever engages a single core. Some other mix of BLAS-cores & Gensim-worker-threads might actually achieve the best training throughput & non-contending core-utilization.
And, at least for Gensim workers, the actual thread count value achieving the best throughput will vary based on other model parameters that influence the relative amount of calculation time in highly-parallelizable code-blocks versus highly-contended blocks, especially window, vector_size, & negative. And, there's not really a shortcut to finding the best workers value except via trial-and-error: observing reported training rates in logs over a few minutes of running. (Though: any rate observed in, say, minutes 2-4 of a abbreviated trial run should be representative of the training rate through the whole corpus over multiple epochs.)
(For any system with at least 4 cores, the optimal value with a classic iterable corpus of TaggedDocuments is usually at least 3, no more than the number of cores, but also rarely more than 8-12 threads, due to other inherent sources of contention due to both Gensim's approach to fanning out the work among worker-threads, and the Python 'GIL'.)
Other thoughts:
the build_vocab() step is never multi-threaded, so benchmarking alternate workers values will give a truer readout of their effect by only timing the train() step
ensuring your iterable corpus does as little redundant work (like say IO & tokenization) on each pass can help limit any bottlenecks around the single manager thread doing each epoch's iteration & batching texts to the workers
the alternate corpus_file approach can achieve higher core utilization, up to any number of cores, by assigning each thread its own exclusive range of an input-file. But, it also means (a) your whole corpus must be in one uncompressed space-tokenized plain-text file; (b) your documents only get a single integer tag (their line-number); (c) you may be subject to some small as-yet-diagnosed-and-fixed bug(s). (See project issue #2747.)
Ok, the best way to fully use the core is to use the parameter corpus_file of doc2vec.
Doing the same bench, the result looks like :
{1: 114.58889961242676, 2: 82.8250150680542, 3: 71.52109575271606, 4: 67.1010684967041, 5: 75.96869373321533, 6: 100.68377351760864, 7: 116.7901406288147, 8: 139.53436756134033}
The thread seems to be useful, in my case 4 are the best.
Still strange that the "regular" doc2vec is not that great at parallelizing

Why does SageMaker PyTorch DDP init times out on SageMaker?

I'm using PyTorch DDP on SageMaker PyTorch Training DLC 1.8.1 The code seems properly DDP-formatted. I'm using instance_count = 2, and launching torch.distributed.launch and I believe the ranks and world size are properly set however the dist.init_process_group waits and times out
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00)
What could go wrong? machines not networked together?
This is usually something to do with the way local_rank is retrieved and used during initialization. Please refer to the below example and see if you can figure out the difference.
https://github.com/aruncs2005/pytorch-ddp-sagemaker-example
The torch.distributed.launch is the helper utility within the torch.distributed package which can be used to launch multiple processes per node for distributed training. It tells all workers which IP address is of rank 0 which is set by MASTER_ADDR,
Each rank needs to be able to communicate to the MASTER_ADDR on the port MASTER_PORT. If those are set but the workers cannot reach the MASTER_ADDR, then it can be the root cause of hang and timeoutfor the job.
Besides, it will also wait until all nodes report in from --nodes defined in the launch as well.

Python 3.8 RAM owerflow and loading issues

First, I want to mention, that this is our first project in a bigger scale and therefore we don't know everything but we learn fast.
We developed a code for image recognition. We tried it with a raspberry pi 4b but quickly faced that this is way to slow overall. Currently we are using a NVIDIA Jetson Nano. The first recognition was ok (around 30 sec.) and the second try was even better (around 6-7 sec.). The first took so long because the model will be loaded for the first time. Via an API the image recognition can be triggered and the meta data from the AI model will be the response. We use fast-API for this.
But there is a problem right now, where if I load my CNN as a global variable in the beginning of my classification file (loaded on import) and use it within a thread I need to use mp.set_start_method('spawn') because otherwise I will get the following error:
"RuntimeError: Cannot re-initialize CUDA in forked subprocess.
To use CUDA with multiprocessing, you must use the 'spawn' start method"
Now that is of course an easy fix. Just add the method above before starting my thread. Indeed this works but another challenge occurs at the same time. After setting the start method to 'spawn' the ERROR disappears but the Jetson starts to allocate way to much memory.
Because of the overhead and preloaded CNN model, the RAM is around 2.5Gig before the thread starts. After the start it doesn’t stop allocating RAM, it consumes all 4Gig of the RAM and also the whole 6Gig Swap. Right after this, the whole API process kill with this error: "cannot allocate memory" which is obvious.
I managed to fix that as well just by loading the CNN Model in the classification function. (Not preloading it on the GPU as in the two cases before). However, here I got problem as well. The process of loading the model to the GPU takes around 15s - 20s and this every time the recognition starts. This is not suitable for us and we are wondering why we cannot pre-load the model without killing the whole thing after two image-recognitions. Our goal is to be under 5 sec with this.
#clasify
import torchvision.transforms as transforms
from skimage import io
import time
from torch.utils.data import Dataset
from .loader import *
from .ResNet import *
#if this part is in the classify() function than no allocation problem occurs
net = ResNet152(num_classes=25)
net = net.to('cuda')
save_file = torch.load("./model.pt", map_location=torch.device('cuda'))
net.load_state_dict(save_file)
def classify(imgp=""):
#do some classification with the net
pass
if __name__ == '__main__':
mp.set_start_method('spawn') #if commented out the first error ocours
manager = mp.Manager()
return_dict = manager.dict()
p = mp.Process(target=classify, args=('./bild.jpg', return_dict))
p.start()
p.join()
print(return_dict.values())
Any help here will be much appreciated. Thank you.

pytorch loss.backward() keeps running for hours

I am using pytorch to train some x-ray images but I ran into the following issue:
in the line : loss.backward(), the program just keeps running and never end, and there is no error or warning.
loss, outputs = self.forward(images, targets)
loss = loss / self.accumulation_steps
print("loss calculated: " + str(loss))
if phase == "train":
print("running loss backwarding!")
loss.backward()
print("loss is backwarded!")
if (itr + 1 ) % self.accumulation_steps == 0:
self.optimizer.step()
self.optimizer.zero_grad()
The loss calculated before this is something like tensor(0.8598, grad_fn=<DivBackward0>).
Could anyone help me with why this keeps running or any good ways to debug the backward() function?
I am using torch 1.2.0+cu92 with the compatible cuda 10.0.
Thank you so much!!
It's hard to give a definite answer but I have a guess.
Your code looks fine but from the output you've posted (tensor(0.8598, grad_fn=<DivBackward0>)) I conclude that you are operating on your CPU and not on the GPU.
One possible explanation is that the backwards pass is not running forever, but just takes very very long. Training a large network on a CPU is much slower than on a GPU. Check your CPU and memory utilization. It might be that your data and model is too big to fit into your main memory, forcing the operation system to use your hard disk, which would slow down execution by several additional magnitudes. If this is the case I generally recommend:
Use a smaller batch size.
Downscale your images (if possible).
Only open images that are currently needed.
Reduce the size of your model.
Use your GPU (if available) by calling model.cuda(); images = images.cuda() before starting your training.
If that doesn't solve your problem you could start narrowing down the issue by doing some of the following:
Create a minimal working example to reproduce the issue.
Check if the problem persists with other, very simple model architectures.
Check if the problem persists with different input data
Check if the problem persists with a different PyTorch version

scikit learn unwanted parallel processing

I have a problem with nested multiprocessing witch starts when I use scikit-learn (v. 0.22) Quadratic Discriminant Analysis. Necessary is system configuration: 24 thread Xeon machine running fedora 30.
I run consecutively qda on the randomly selected subset of predictors:
def process(X,y,n_features,i=1):
comb = np.random.choice(range(X.shape[1]),n_features,replace=False)
qda = QDA(tol=1e-8)
qda.fit(X[:,comb],y)
y_pred = qda.predict(X[:,comb])
return (accuracy_score(y,y_pred),comb,i)
where n_features is number of features randomly selected from the full set of possible predictors, X,y explanatory and depended variables.
When n_features is 18 or less process works in single-thread mode, which means that I can use any tool to parallel processing (I use ray). When n_features is 19, and above for unknown reason it (not me) starts all available threads, and entire calculation get more time even in comparison to a single thread.
tmp = [process(X,y,n_features,i=1) for _ in range(1000)]
Based on my previous experiences with other Linux libraries (R gstat precisely) the same situation (uncontrolled multithreading mode) was caused by Linux implementation of blas, but here it could not be the case. In general, the question is: what starts this multithreading and how to control it even if LDA/QDA hasn't n_jobs parameter to avoid nested multiprocessing.
QDA in scikit-learn does not expose n_jobs meaning that you cannot set anything. However, it could be due to numpy which does not restrict the number of threads.
The solution to limit the number of threads are:
set the environment variable OMP_NUM_THREADS, MKL_NUM_THREADS, or OPENBLAS_NUM_THREADS to be sure that you will limit the number of threads;
you can use threadpoolctl which provides a context manager to set the number of threads.

Resources