How to use DistributedDataParallel() and init_process_group() in a HPC? - pytorch

I will use HPC for my research and I don't know a lot about parallel or distributed computing.
I really don't understand the DistributedDataParallel() in pytorch. Especially init_process_group().
What is the meaning of initializing processes group? and what is
init_method : URL specifying how to initialize the package.
for example (I found those in the documentation):
'tcp://10.1.1.20:23456' or 'file:///mnt/nfs/sharedfile'
What are those URLs?
What is the Rank of the current process?
Is world_size the number of GPUs?
It would be really appreciated if someone explained to me what is and How to use DistributedDataParallel() and init_process_group() because I don't know parallel or distributed computing.
I will use things like Slurm(sbatch) in the HPC.

Related

Is there a way to use fastText's word representation process in parallel?

I am new to fastText, a library for efficient learning of word representations and sentence classification. I am trying to generate word-vector for huge data set. But in single process it's taking significantly long time.
So let me put my questions clearly:
Are there any options which I can use to speedup the single fastText process?
Is there any way to generate word-vector in parallel fastText processes?
Are there any other implementation or workaround available which can solve the problem, as I read caffe2 implementation is available, but I am unable to find it.
Thanks
I understand your questions that you like to distribute fastText and do parallel training.
As mentioned in Issue #144
... a future feature we might consider implementing. For now it's not on our list of priorities, but it might very well soon.
Except for the there also mentioned Word2Vec Spark implementation, I am not aware of any other implementations.
The original FastText release by Facebook includes a command-line option thread, default 12, which controls the number of worker threads which will do parallel training (on a single machine). If you have more CPU cores, and haven't yet tried increasing it, try that.
The gensim implementation (as gensim.models.fasttext.FastText) includes an initialization parameter, workers, which controls the number of worker threads. If you haven't yet tried increasing it, up to the number of cores, it may help. However, due to extra multithreading bottlenecks in its Python implementation, if you have a lot of cores (especially 16+), you might find maximum throughput with fewer workers than cores – often something in the 4-12 range. (You have to experiment & watch the achieved rates via logging to find the optimal value, and all cores won't be maxed.)
You'll only get significant multithreading in gensim if your installation is able to make use of its Cython-optimized routines. If you watch the logging when you install gensim via pip or similar, there should be a clear error if this fails. Or, if you are watching logs/output when loading/using gensim classes, there will usually be a warning if the slower non-optimized versions are being used.
Finally, often in the ways people use gensim, the bottleneck can be in their corpus iterator or IO rather than the parallelism. To minimize this slowdown:
Check to see how fast your corpus can iterate over all examples separate from passing it to the gensim class.
Avoid doing any database-selects or complicated/regex preprocessing/tokenization in the iterator – do it once, and save the easy-to-read-as-tokens resulting corpus somewhere.
If the corpus is coming from a network volume, test if streaming it from a local volume helps. If coming from a spinning HD, try an SSD.
If the corpus can be made to fit in RAM, perhaps on a special-purpose giant-RAM machine, try that.

Clojure: Create and manage multiple threads

I wrote a program which needs to process a very large dataset and I'm planning to run it with multiple threads in a high-end machine.
I'm a beginner in Clojure and i'm lost in the myriad of tools at disposal -
agents, futures, core.async (and Quartzite?). I would like to know which one is most suited for this job.
The following describes my situation:
I have a function which transforms some data and store it in database.
The argument to the said function is popped from a Redis set.
Run the function in several separate threads as long as there is a value in the Redis set.
For simplicity, futures can't be beat. They create a new thread, and return a value from it. However, often you need more fine-grained control than they provide.
The core.async library has nice support for parallelism (via pipeline, see below), and it also provides automatic back-pressure. You have to have a way to control the flow of data such that no one's starving for work, or burdened by too much of it. core.async channels must be bounded, and this helps with this problem. Also, it's a pretty logical model of your problem: taking a value from a source, transforming it (maybe using a transducer?) with some given parallelism, and then putting the result to your database.
You can also go the manual route of using Java's excellent j.u.concurrent library. There are low level primitives as well as thread management tools for thread pools. All of this is accessible within clojure.
From a design standpoint, it comes down to whether you are more CPU-bound or I/O-bound. This affects decisions such as whether or not you will perform parallel reads from redis and writes to your database. If you are CPU-bound and thus your bottleneck is the computation, then it wouldn't make much sense to parallelize your reads from redis, or your writes to your database, would it? These are the types of things to consider.
You really have two problems to solve: (1) your familiarity with clojure's/java's concurrency mechanisms, and (2) your approach to this problem (i.e., how would you approach this problem, irrespective of the language you're using?). Once you solve #2, you will have a much better idea of which tools to use that I mentioned above, and how to use them.
Sounds like you may have a
good
embarrassingly parallel problem
to solve. In that case, you could start simply by coding up your
processing into a top-level function that processes the first datum.
Once that's working, wrap it in
a map to handle all of the
data sequentially (serially, one-at-a-time).
You might want to start tackling the bigger problem with just a few
items from your data set. That will make your testing smoother and
faster.
After you have the map working, it's time to just add a p
(parallel) to your code to make it
a pmap. This is a very
rewarding way to heat up your
machine.
Here is
a discussion about the number of threads pmap uses.
The above is the simplest approach. If you need finer control over
the concurrency, the
this concurrency screencast explores
the use cases.
It is hard to be precise w/o knowing the details of your problem. There are several choices as you mention:
Plain Java threads & threadpools. If your problem is similar to a pre-existing Java solution, this may be the most straightforward.
Simple Clojure threading with future et al. Kicking off a thread with future and getting the result in a promise is very easy.
Replace map with pmap (parallel map). This can help in simple cases that are primarily map/reduce oriented.
The Claypoole library: Lots of tools to make multithreading simpler and easier. Please see their GitHub project and the Clojure/West talk.

MC-Stan on Spark?

I hope to use MC-Stan on Spark, but it seems there is no related page searched by Google.
I wonder if this approach is even possible on Spark, therefore I would appreciate if someone let me know.
Moreover, I also wonder what is the widely-used approach to use MCMC on Spark. I heard Scala is widely used, but I need some language that has a decent MCMC library such as MC-Stan.
Yes it's certainly possible but requires a bit more work. Stan (and popular MCMC tools that I know of) are not designed to be run in a distributed setting, via Spark or otherwise. In general, distributed MCMC is an area of active research. For a recent review, I'd recommend section 4 of Patterns of Scalable Bayesian Inference (PoFSBI). There are multiple possible ways you might want to split up a big MCMC computation but I think one of the more straightforward ways would be splitting up the data and running an off-the-shelf tool like Stan, with the same model, on each partition. Each model will produce a subposterior which can be reduce'd together to form a posterior. PoFSBI discusses several ways of combining such subposteriors.
I've put together a very rough proof of concept using pyspark and pystan (python is the common language with the most Stan and Spark support). It's a rough and limited implementation of the weighted-average consensus algorithm in PoFSBI, running on the tiny 8-schools dataset. I don't think this example would be practically very useful but it should provide some idea of what might be necessary to run Stan as a Spark program: partition data, run stan on each partition, combine the subposteriors.

Can node.js do parallel computations

I know node.js can do parallel IO, but can it do by itself or via a plugin parallel computations for multicore processors. Like if i want to do large matrix multiplications.
I have on older questions here at stackflow that node is working on such a feature. Anyone knows how far along it is.
The official cluster module is what you are looking for:
http://nodejs.org/api/all.html#all_cluster

Threading paradigm?

Are there any paradigm that give you a different mindset or have a different take to writing multi thread applications? Perhaps something that feels vastly different like procedural programming to function programming.
Concurrency has many different models for different problems. The Wikipedia page for concurrency lists a few models and there's also a page for concurrency patterns which has some good starting point for different kinds of ways to approach concurrency.
The approach you take is very dependent on the problem at hand. Different models solve various different issues that can arise in concurrent applications, and some build on others.
In class I was taught that concurrency uses mutual exclusion and synchronization together to solve concurrency issues. Some solutions only require one, but with both you should be able to solve any concurrency issue.
For a vastly different concept you could look at immutability and concurrency. If all data is immutable then the conventional approaches to concurrency aren't even required. This article explores that topic.
I don't really understand the question, but if you start doing some coding using CUDA give you some different way of thinking about multi-threading applications.
It differs from general multi-threading technics, like Semaphores, Monitors, etc. because you have thousands of threads concurrently. So the problem of parallelism in CUDA resides more in partitioning your data and mixing the chunks of data later.
Just a small example of a complete rethinking of a common serial problem is the SCAN algorithm. It is as simple as:
Given a SET {a,b,c,d,e}
I want the following set:
{a, a+b, a+b+c, a+b+c+d, a+b+c+d+e}
Where the symbol '+' in this case is any Commutattive operator (not only plus, you can do multiplication also).
How to do this in parallel? It's a complete rethink of the problem, it is described in this paper.
Many more implementations of different algorithms in CUDA can be found in the NVIDIA website
Well, a very conservative paradigm shift is from thread-centric concurrency (share everything) towards process-centric concurrency (address-space separation). This way one can avoid unintended data sharing and it's easier to enforce a communication policy between different sub-systems.
This idea is old and was propagated (among others) by the Micro-Kernel OS community to build more reliable operating systems. Interestingly, the Singularity OS prototype by Microsoft Research shows that traditional address spaces are not even required when working with this model.
The relatively new idea I like best is transactional memory: avoid concurrency issues by making sure updates are always atomic.
Have a looksee at OpenMP for an interesting variation.

Resources