Multithreaded backpropagation - multithreading

I have written a back propagation class in VB.NET -it works well- and I'm using it in a C# artificial intelligence project.
But I have a AMD Phenom X3 at home and a Intel i5 at school. and my neural network is not multi-threaded.
How to convert that back propagation class to a multithreaded algorithm? or how to use GPGPU programming in it? or should I use any third party libraries that have a multithreaded back propagation neural network?

JeffHeaton has recommend that you use resilient propagation (RPROP) instead of backpropagation. There are examples on how to do multithreaded RPROP (MPROP):
Article on C# multithreaded backpropagation (from Jeff heaton)
Chapter 7.2.1- "Propagation and Multithreading" (p.94 of Introduction to Encog 2.5 for C#)
It's a difficult to discuss all of the details here, so I would recommend that you either read that article and take a look at the relevant chapters of the book I referenced. This, of course, is assuming you're familiar with concurrent programming.
Update:
Resilient propagation will typically outperform backpropagation by a
considerable factor. Additionally, RPROP has no parameters that must
be set. Backpropagation requires that a learning rate and momentum
value be specified. Finding an optimal learning rate and momentum
value for backpropagation can be difficult. This is not necessary with
resilient propagation.
(source: Encog Machine Learning)

I've tried implementing multiple threads for RPROP batch processing, but it seemed it was always slower than using a single thread. I've tried to implement separately at the loop level "#pragma omp parallel" and by calculating the errors, gradients and weights in separate threads. In my interpretation, it seems that the computing done in each thread is too small to outcome the computing done in switching the threads and synchronizing the results (mutex.) I'm wondering if I've done something wrong? My conclusion is that would be smarter to run RPROP single-threaded, while processing multiple neuronal networks at the same time in separate threads. Most of the implementations usually imply multiple interconnected NNs so that would make sense.

Related

What kinds of optimization are used in PyTorch methods?

I'm using PyTorch to implement an intense sequence of matrix operations, using methods such as torch.mm or torch.dot. I was wondering if PyTorch uses multithreading or other optimization mechanisms to speed up the process. I am not utilizing a GPU. I appreciate if you could inform me of how fast these methods are and whether I need to take any actions to help the process.
PyTorch uses an efficient BLAS implementation and multithreading (openMP, if I'm not wrong) to parallelize such operations with multiple cores. Some performance loss comes from the Python itself - since this is an interpreted language, no significant compiler-like optimization can be done. You can use the jit module to speed up the "wrapper" code around the matrix multiplies, but for anything more than very small matrices this cost is probably negligible.
One big improvement you may be able to get manually, but which PyTorch doesn't apply automatically, is to properly order the matrix multiplies. As you probably know, depending on matrix shapes, a multiplication ABCD may have different performance computed as A(B(CD)) than if computed as (AB)(CD), etc.

How does pytorch's parallel method and distributed method work?

I'm not an expert in distributed system and CUDA. But there is one really interesting feature that PyTorch support which is nn.DataParallel and nn.DistributedDataParallel. How are they actually implemented? How do they separate common embeddings and synchronize data?
Here is a basic example of DataParallel.
import torch.nn as nn
from torch.autograd.variable import Variable
import numpy as np
class Model(nn.Module):
def __init__(self):
super().__init__(
embedding=nn.Embedding(1000, 10),
rnn=nn.Linear(10, 10),
)
def forward(self, x):
x = self.embedding(x)
x = self.rnn(x)
return x
model = nn.DataParallel(Model())
model.forward(Variable.from_numpy(np.array([1,2,3,4,5,6], dtype=np.int64)).cuda()).cpu()
PyTorch can split the input and send them to many GPUs and merge the results back.
How does it manage embeddings and synchronization for a parallel model or a distributed model?
I wandered around PyTorch's code but it's very hard to know how the fundamentals work.
That's a great question.
PyTorch DataParallel paradigm is actually quite simple and the implementation is open-sourced here . Note that his paradigm is not recommended today as it bottlenecks at the master GPU and not efficient in data transfer.
This container parallelizes the application of the given :attr:module by
splitting the input across the specified devices by chunking in the batch
dimension (other objects will be copied once per device). In the forward
pass, the module is replicated on each device, and each replica handles a
portion of the input. During the backwards pass, gradients from each replica
are summed into the original module.
As of DistributedDataParallel, thats more tricky. This is currently the more advanced approach and it is quite efficient (see here).
This container parallelizes the application of the given module by
splitting the input across the specified devices by chunking in the batch
dimension. The module is replicated on each machine and each device, and
each such replica handles a portion of the input. During the backwards
pass, gradients from each node are averaged.
There are several approaches towards how to average the gradients from each node. I would recommend this paper to get a real sense how things work. Generally speaking, there is a trade-off between transferring the data from one GPU to another, regarding bandwidth and speed, and we want that part to be really efficient. So one possible approach is to connect each pairs of GPUs with a really fast protocol in a circle, and to pass only part of gradients from one to another, s.t. in total, we transfer less data, more efficiently, and all the nodes get all the gradients (or their average at least). There will still be a master GPU in that situation, or at least a process, but now there is no bottleneck on any GPU, they all share the same amount of data (up to...).
Now this can be further optimized if we don't wait for all the batches to finish compute and start do a time-sharing thing where each node sends his portion when he's ready. Don't take me on the details, but it turns out that if we don't wait for everything to end, and do the averaging as soon as we can, it might also speed up the gradient averaging.
Please refer to literature for more information about that area as it is still developing (as of today).
PS 1: Usually these distributed training work better on machines that are set for that task, e.g. AWS deep learning instances that implement those protocols in HW.
PS 2: Disclaimer: I really don't know what protocol PyTorch devs chose to implement and what is chosen according to what. I work with distributed training and prefer to follow PyTorch best practices without trying to outsmart them. I recommend for you to do the same unless you are really into researching this area.
References:
[1] Distributed Training of Deep Learning Models: A Taxonomic Perspective
Approach to ml parallelism with Pytorch
DataParallel & DistributedDataParallel
Model parallel https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
See Will switching GPU device affect the gradient in PyTorch back propagation?

A3C in Tensorflow - Should I use threading or the distributed Tensorflow API

I want to implement the Asynchronous Advantage Actor Critic (A3C) model for reinforcement learning in my local machine (1 CPU, 1 cuda compatible GPU). In this algorithm, several "learner" networks interact with copies of an environment and update a central model periodically.
I've seen implementations that create n "worker" networks and one "global" network inside the same graph and use threading to run these. In these approaches, the global net is updated by applying gradients to the trainable parameters with a "global" scope.
However, I recently read a bit about distributed tensorflow and now I'm a bit confused. Would it be easier/faster/better to implement this using the distributed tensorflow API? In the documentation and talks they always make expicit mention of using it in multi-device environments. I don't know if it's an overkill to use it in a local async algorithm.
I would also like to ask, is there a way to batch the gradients calculated by every worker to be applied together after n steps?
After implementing both, in the end I found using threading simpler than the distributed tensorflow API, however it also runs slower. The more CPU cores you use, the faster distributed tensorflow becomes compared to threads.
However this only holds for asynchronous training. If the available CPU cores are limited and you want to make use of a GPU, you might want to use synchronous training with multiple workers instead, like OpenAI does in their A2C implementation. There only the environments are parallelized (through multiprocessing) and tensorflow uses the GPU without any graph parallelization. OpenAI reported that their results were better with synchronous training than with A3C.
edit:
Here are some more details:
The problem with distributed tensorflow for A3C is that you need to call multiple tensorflow forward passes (to get the actions during the n steps) before you call the learning step. However since you learn asynchronously your network will change during the n steps by the other workers. So your policy will change during the n steps and the learning step will happen with wrong weights. Distributed tensorflow will not prevent that. Therefore you need a global and a local network in distributed tensorflow as well, making the implementation not easier than an implementation with threading (and for threading you don't have to learn how to make distributed tensorflow work). Runtime wise, on 8 CPU cores or less there will be no large difference.

Multi threaded AForge.NET training

I am using AForge.NET ANN and training it on my training set. Because the training is single threaded and the process can take ages, I wondered if it's possible to run a multi threaded training.
Because it is a problem to use threads while training a Resilient Backpropagation network I thought about splitting my training set between different networks and once every N epoch's, combine the weights of all networks in to one, Then, duplicate it to all threads (so the next epoch will start with the new weights).
I can't seem to find a method in the AForge.NET that combines two (or more) networks. Looking for some help on how to get started with the implementation process.
Combining the neural networks every N number of iterations won't work really well. It can be very tricky to just take the weights and combine them. In some ways this is how the crossover operation of a Genetic Algorithm works.
Really the only way you are going to be able to do this is modify AForge's training to support multiple threads. Basically to do this you need to map the gradient calculation and then do a reduce-sum on the gradients. Then use the reduced gradients to update the network.
I've implemented this exact thing in the Encog Framework, it supports multi-threaded (RPROP), and has a C# version. http://www.heatonresearch.com/encog.

Lightweight unsupervised learning method like Self organizing maps

I am trying to develop a lightweight system that uses an unsupervised learning method that uses system parameters such as CPU, RAM utilization to train an anomaly detection system. I could not think of anything beyond a Self organizing map. Is there any other learning technique that I can consider here?
You don't have many options on this with SOM. The only think you could consider is whether you will do batch or sequential training, if of course the implementation that you will use offers both options. But this option mainly affects the training time (the first is much more quicker) and not the resulting map (in theory at least).
You could also select a distance function other than the Euclidian but the vast percentage of the bibliography doesn't bother with this.

Resources