How to use Rmpi in R on linux Cluster to increase cores available with DEoptim? - linux

I am using code developed in R to calibrate a hydrological model with 8 parameters using DEoptim (a function that aims to minimise an objective function). The DEoptim code uses the 'parallel' package to detect the number of cores available using 'DetectCores()'. On my PC I have 4 cores with 2 threads each so it detects 8 cores and then sends out the hydrological model to a core with different values of parameters and the results are returned to the centre. It does this hundreds or thousands of times and iterates the parameters to try and find an optimum set. Therefore the more cores available, the faster it will work.
I am at a university and have access to a Linux compute cluster. They have servers with up to 12 cores (i.e. not threads) and if I used this it would work two - three times faster than my PC. Great. However, ideally I would spread the code around other servers so I could have access to more cores and all the info sent back the master.
Therefore, my question is how could I include Rmpi in my code to effectively increase the cores available. As you can probably tell, I am quite new to using clusters.
Many thanks, Antony

If you want to execute DEoptim on multiple nodes of a Linux cluster, I believe you'll need to use foreach by specifying parallelType=2 in the control argument. You can use either the doMPI parallel backend or the doParallel backend with an MPI cluster object. For example:
library(doParallel)
library(Rmpi)
cl <- makeCluster(mpi.universe.size()-1, type='MPI')
registerDoParallel(cl)
# and eventually...
DEoptim(fn=Genrose, lower=rep(-25, n), upper=rep(25, n),
control=list(NP=10*n, itermax=maxIt, parallelType=2))
You'll need to have the snow package installed in addition to the others. Also, make sure that you execute your script with mpirun using the -np 1 option. If you don't use mpirun, the workers will all be spawned on the local machine.

Related

Use PyTorch DistributedDataParallel with Hugging Face on Amazon SageMaker

Even for single-instance training, PyTorch DistributedDataParallel (DDP) is generally recommended over PyTorch DataParallel (DP) because DP's strategy is less performant and it uses more memory on the default device. (Per this PyTorch forums thread)
Hugging Face recommend to run distributed training via the python -m torch.distributed.launch launcher, because their Trainer API supports DDP but will fall back to DP if you don't. (Per this HF forums thread)
I recently ran in to this problem: scaling a HF training job from p3.8xlarge to p3.16xlarge increased memory consumption on (I think) one of the GPUs to the point where I had to significantly reduce batch size to avoid CUDA Out of Memory errors - basically losing all scaling advantage.
So the good news is for p3.16xl+ I can just enable SageMaker Distributed Data Parallel and the PyToch DLC will automatically launch via torch.distributed for me.
The bad news for use cases with smaller workloads or wanting to test before they scale up, is that SMDistributed doesn't support all multi-GPU instance types. No p3.8xl or g series, for example. I did try manually setting the sagemaker_distributed_dataparallel_enabled environment variable, but no joy.
So how else can we launch HF Trainer scripts with PyTorch DDP on SageMaker?
Great question, thanks for asking! PyTorch DDP runs data parallel workers in multiple processes, that must be launched and managed by developers. DDP should be seen as a managed allreduce, more than a managed data-parallelism library, since it requires you to launch and manage the workers and even assigning resources to workers. In order to launch the DDP processes in a SageMaker Training job you have many options:
If you do multi-GPU, single-machine, you can use torch.multiprocessing.spawn, as shown in this official PyTorch demo (that is broken by the way)
If you do multi-GPU, single-machine, you can also use the Ray Train library to launch those processes. I was able to use it in a Notebook, but not in the DLC yet (recent library that is a bit rough to learn and make work, see all my issues here). Ray Train should work on multi-node too.
If you do multi-GPU, any-machine, you can use torch.distributed.launch, wrapped in a launcher script in shell or Python. Example here https://gitlab.aws.dev/cruchant/a2d2-segmentation/-/blob/main/3_2D-Seg-Audi-A2D2-Distributed-Training-DDP.ipynb
You can also launch those processes with the SageMaker MPI integration instead of torch.distributed. Unfortunately, we didn't create documentation for this, so no one uses it nor pitches it. But it looks cool, because it allows to run copies of your script directly in the EC2 machines without the need to invoke an intermediary PyTorch launcher. Example here
So for now, my recommendation would be to go the route (3), which is the closest to what the PyTorch community does, so provides easier development and debugging path.
Notes:
PyTorch DDP evolves fast. In PT 1.10 torch.distributed is replaced by torchrun, and a torchX tool is being created to...simplify things!).
Not having to manage that mess is a reason why SageMaker Distributed Data Parallel is a great value prop: you only need to edit your script, and the SM service handles process creation. Unfortunately, as you point out, SMDP being limited to P3 and P4 training jobs seriously limits its use.
Below are important PT DDP concepts to understand to alter single-GPU code into multi-machine code
Unlike Apache Spark, which takes care of workload partitioning on your behalf, Pytorch distributed training requires the user to assign specific pieces of work to specific GPUs. In the following section, we assume that we train on GPU.
In PyTorch DDP, each GPU runs a customized copy of you training code. A copy of the training code running on one GPU is generally called a rank, a data parallel replica, a process, a worker, but other names may exist.
For PyTorch DDP to launch a training cluster on the MxN GPUs spread over your M machines, you must specify to PyTorch DDP the number of machines you have and the number of processes to launch per machine. This is respectively done by the parameters -nnodes and -nproc_per_node of the torch.distributed.launch utility. You must run the torch.distributed.lauch once on each node of the training cluster. You can achieve this parallel command with multiple tools, for example with MPI or SageMaker Training as mentioned above. In order to establish the necessary handshakes and form a cluster, you must also specify in the torch.distributed.launch command -node_rank, which must take a unique machine ID between 0 and N-1 on each of the machines, and -master_addr and -master_port, optional if you run a single-machine cluster, which must be the same across all machines.
In the init_process_group DDP initialization method running from within each data parallel replica script, you must specify the world size and replica ID, respectively with the world_size and rank parameters. Hence you must have a way to communicate to each script a unique ID, generally called the global rank. The global rank can help you personalize the work done by each GPU, for example saving a model just from one card, or running validation only in one card. In a cluster composed of 3 machines having 4 GPUs each, global ranks would range from 0 to 11. Within a machine, in order to assign DDP data parallel replicas to available GPUs, the script running in each replica must be assigned a GPU ID, unique within the machine it's running on. This is called the local rank and can be set as an argument by the PyTorch DDP torch.distributed.launch. In a cluster composed of 3 machines having 4 GPUs each, on each machine the DDP processes would have local ranks ranging from 0 to 3

Will a 8 CPUs Cloud Machine run 8x faster than a 1 CPU CM without changes in the code?

I am a beginner and I have no clue, yet, about cloud computing nor multithreading nor multiprocessing.
I have a desktop PC with an i7 (4 cores) and I was wondering if a multi-CPUs cloud machine OR an 8+ cores machine would run ANY CODE faster than my PC without any changes in the code.
Does the machine handle the tasks distribution on the several CPUs (or the 8+ cores) by itself or is it required to adapt the code? (multithreading or multiprocessing)
For the sake of argument, let say I run a simple loop like below:
results = {}
for i in range(10**8):
results[i] = i**2
This takes about 67 sec on my PC (I was running something else at the same time so I'm not sure this is accurate but my timing is irrelevant anyway).
Would the exact same code be faster on a multi-CPUs machine or an 8+ cores machine compare to a single CPU 4cores machine?
If it is, in fact, required to make changes, I would appreciate any beginner links to learn about multiprocess or multithread.
Thank you for your help.
I'm no expert but I think it really depends on the platform that you're using to write and run your code. Some languages may support multi-threading/multi-processing natively and as such the code will run faster but others might not.
One thing is for certain you can't explicitly say that in %100 of the cases a machine with more cores/CPUs will run a given piece of code faster than a machine with lesser cores/CPUs.
Hope I helped clear things up.
Edit:
This medium post regarding multiprocessing\multithreading in python looks good - Multithreading vs Multiprocessing in Python 🐍
Python multiprocessing for dummies
A code will run faster only when it is written in parallel fashion. The snippet in the text of your question is not written parallel, so it won't run any faster.
When parallel program is being written, the programmer keeps in mind the target level of parallelization. A sequential program has parallelization level = 1. A program with N CPU-intensive threads would run most effectively on N processors (cores). A program with high parallelization level may execute slower on 2-4 core machine than sequential variant.

GNU make - how to simulate multiple simultaneous jobs

I know that to allow make to be multithreaded, I use the command make --jobs=X where X is usually equal to number of cores (or twice that or whatever).
I am debugging a makefile - actually consists of many makefiles - to work with the --jobs=X option. Here's an example of why it currently doesn't:
T1:
mkdir D1
output_makefile.bat > ./D1/makefile
T2:
cd D1
make
Executing this with --jobs=X will lead to a race condition because T1 is not specified as a dependency of T2 and eventually T2 will get built ahead of T1; most of the bugs I need to fix are of this variety.
If X in --jobs=X is greater than the number of ?logical or physical? cores, the number of jobs executed simultaneously will be capped at the number of ?logical or physical? cores.
My machine has 4 physical/8 logical cores but the build machine that will be running our builds will have as many as 64 cores.
So I'm concerned that just because my makefile (a) builds the final output correctly (b) runs without errors on my machine with --jobs=4 does not mean it'll run correctly and without errors with --jobs=64 on a 64-core machine.
Is there a tool that will simulate make executing in an environment that has more cores than the physical machine?
What about creating a virtual machine with 64 cores and run it on my 4-core machine; is that even allowed by VMPlayer?
UPDATE 1
I realized that my understanding of make was incorrect: the number of job slots make creates is equal to the --jobs=N argument and not the number of cores or threads my PC has.
However, this by itself doesn't necessarily mean that make will also execute those jobs in parallel even if I have fewer cores than jobs by using task-switching.
I need to confirm that ALL the jobs are being executed in parallel vs merely 'queued up' and waiting for the actively executing jobs to finish.
So I created a makefile with 16 targets - more than the num of threads or cores I have - and each recipe merely echos the name of the target a configurable number of times.
make.mk
all: 1 2 3 4 ... 14 15 16
<target X>:
#loop_output.bat $#
loop_output.bat
#FOR /L %%G IN (1,1,2048) DO #echo (%1-%%G)
The output will be something like
(16-1) <-- Job 16
(6-1400)
(12-334)
(1-1616) <-- Job 1
(4-1661)
(15-113)
(11-632)
(2-1557)
(10-485)
(7-1234)
(5-1530)
The format is Job#X-Echo#Y. The fact that I see (1-1616) after (16-1) means that make is indeed executing target 16 at the same time as target 1.
The alternative is that make finishes jobs (1-#of cores/threads) and then takes another chunk of jobs equal to #num cores/threads but that's not what's happening.
See my "UPDATE 1":
No special software or make tricks are required. Regardless of number of cores you have, Make will execute the jobs truly in parallel by spawning multiple processes and letting the OS multitask them just like any other process.
Windows PITFALL #1: The version of Gnu Make available on SourceForge is 3.81 which does NOT have the ability to even execute using --jobs. You'll have to download ver 4.2 and build it.
>
Windows PITFALL #2: make 4.2 source will fail to build because of some header that VS2008 (and older) doesn't have. The fix is easy: you have to replace the invocation of the "symbol not found" with its macro equivalent; it should be obvious what I'm talking about when you try to build it. (I forgot what the missing symbol was).

Distributed Processing of Volumetric Image Data

For the development of an object recognition algorithm, I need to repeatedly run a detection program on a large set of volumetric image files (MR scans).
The detection program is a command line tool. If I run it on my local computer on a single file and single-threaded it takes about 10 seconds. Processing results are written to a text file.
A typical run would be:
10000 images with 300 MB each = 3TB
10 seconds on a single core = 100000 seconds = about 27 hours
What can I do to get the results faster? I have access to a cluster of 20 servers with 24 (virtual) cores each (Xeon E5, 1TByte disks, CentOS Linux 7.2).
Theoretically the 480 cores should only need 3.5 minutes for the task.
I am considering to use Hadoop, but it's not designed for processing binary data and it splits input files, which is not an option.
I probably need some kind of distributed file system. I tested using NFS and the network becomes a serious bottleneck. Each server should only process his locally stored files.
The alternative might be to buy a single high-end workstation and forget about distributed processing.
I am not certain, if we need data locality,
i.e. each node holds part of the data on a local HD and processes only his
local data.
I regularly run large scale distributed calculations on AWS using Spot Instances. You should definitely use the cluster of 20 servers at your disposal.
You don't mention which OS your servers are using but if it's linux based, your best friend is bash. You're also lucky that it's a command line programme. This means you can use ssh to run commands directly on the servers from one master node.
The typical sequence of processing would be:
run a script on the Master Node which sends and runs scripts via ssh on all the Slave Nodes
Each Slave Node downloads a section of the files from the master node where they are stored (via NFS or scp)
Each Slave Node processes its files, saving required data via scp, mysql or text scrape
To get started, you'll need to have ssh access to all the Slaves from the Master. You can then scp files to each Slave, like the script. If you're running on a private network, you don't have to be too concerned about security, so just set ssh passwords to something simple.
In terms of CPU cores, if the command line program you're using isn't designed for multi-core, you can just run several ssh commands to each Slave. Best thing to do is run a few tests and see what the optimal number of process is, given that too many processes might be slow due to insufficient memory, disk access or similar. But say you find that 12 simultaneous processes gives the fastest average time, then run 12 scripts via ssh simultaneously.
It's not a small job to get it all done, however, you will forever be able to process in a fraction of the time.
You can use Hadoop. Yes, default implementation of FileInputFormat and RecordReader are splitting files into chunks and split chunks into lines, but you can write own implementation of FileInputFormat and RecordReader. I've created custom FileInputFormat for another purpose, I had opposite problem - to split input data more finely than default, but there is a good looking recipes for exactly your problem: https://gist.github.com/sritchie/808035 plus https://www.timofejew.com/hadoop-streaming-whole-files/
But from other side Hadoop is a heavy beast. It has significant overhead for mapper start, so optimal running time for mapper is a few minutes. Your tasks are too short. Maybe it is possible to create more clever FileInputFormat which can interpret bunch of files as single file and feed files as records to the same mapper, I'm not sure.

Matlab 2011a Use all Cores Available on 64 bit Linux?

Hi I've looked online but I can't seem to find the answer whether I need to do anything to make matlab use all cores? From what I understand multi-threading has been supported since 2007. On my machine matlab only uses one core #100% and the rest hang at ~2%. I'm using a 64 bit Linux (Mint 12). On my other computer which has only 2 cores and is 32 bit Matlab seems to be utilizing both cores #100%. Not all of the time but in sufficient number of cases. On the 64 bit, 4 core PC this never happens.
Do I have to do anything in 64 bit to get Matlab to use all the cores whenever possible? I had to do some custom linking after install as Matlab wasn't finding the libraries (eg. libc.so.6) because it wasn't looking in the correct places.
By standard, since the latest release, you can use 12 cores using the Parallel Computing Toolbox. Without this toolbox, I guess you're out of luck. Any additional cores could be accessed by the MATLAB Distributed Computing Server, where you actually pay per number of worker threads.
To make matlab use your multiple cores you have to do
matlabpool open
And it of course works better if you actually have multithreaded code (like using the spmd function or parfor loops)
More info at the Matlab homepage
MATLAB has only one single thread for Computation.
That said, multiple threads would be created for certain functions which use the multithreaded features of the BLAS libraries that it uses underneath.
Thus, you would only be able to gain a 'multi threaded' advantage if you are calling functions which use these multi-threaded blas libraries.
This link has information on the list of functions which are multithreaded.
Now for the use of your cores, that would depend on your OS. I believe the OS would have to load balance your threads to be used on all cores. One CANNOT set affinities to threads from within MATLAB. One can however set worker MATLAB processes to have affinities to cores from within the Parallel Computing toolbox.
However, you could always try setting the affinity for the MATLAB process to all your processors manually by the details available at the following link for Linux
Windows users can simply right click on the process in the task manager and set affinity.
My understanding is that this is only a request to the OS and is not a hard binding rule that the OS must adhere to.

Resources