How to optimize multithreaded program for use in LSF? - multithreading

I am working on a multithreaded number crunching app, let's call it myprogram. I plan to run myprogram on IBM's LSF grid. LSF allows a job to scheduled on CPUs from different machines. For example, bsub -n 3 ... myprogram ... can allocate two CPUs from node1 and one CPU from node2.
I know that I can ask LSF to allocate all 3 cores in the same node, but I am interested in the case where my job is scheduled onto different nodes.
How does LSF manage this? Will myprogram be run in two different processes in node1 and node2?
Does LSF automatically manage data transfer between node1 and node2?
Anything I can do in myprogram to make this easy for LSF to manage? Should I be making use of any LSF libraries?

Answer to Q1
When you submit a job like bsub -n 3 myprogram, all LSF does is allocate 3 slots across 1-3 hosts. One of these hosts will be designated as the "first execution host", and LSF will dispatch and run a single instance of myprogram on that host.
If you want to run myprogram in parallel, LSF has a command called blaunch which will essentially launch one instance of a program per allocated core. For example, submit your job like bsub -n 3 blaunch myprogram will run 3 instances of myprogram.
Answer to Q2
By "manage data transfer" I assume you mean communication between the instances of myprogram. The answer is no, LSF is a scheduling and dispatching tool. All it does is allocation and dispatch, but it has no knowledge of what the dispatched program is doing. blaunch in turn is simply a task launcher, it just launches multiple instances of a task.
What you're after here is some kind of parallel programming framework like MPI (see for example www.openmpi.org). This provides a set of APIs and commands that allow you to write myprogram in a parallel fashion.
Once you've done that and turned your program in to mympiprogram, you can submit it to LSF like bsub -n 3 mpirun mympiprogram. The mpirun tool - at least in the case of OpenMPI (and some others) - integrates with LSF, and uses the blaunch interface under the hood to launch your tasks for you.
Answer to Q3
You don't need to use LSF libraries in your program to make anything easier for LSF, like I said what's going on inside the program is transparent to the system. LSF libraries just enable your program to become a client of the LSF system (submit jobs, query, etc...)

Related

linux taskset: Does a thread of a multi-thread process always run on a particular core?

I use the taskset to set a multi-thread process to run on a Linux host as below:
task -c 1,2 ./myprocess
Will a particular thread always run on a particular CPU, for example, thread 1 always run on c1? or it will run on c1 or c2 at different times?
No, the filter is applied to the whole process and threads can move between (the restricted list of) cores. If you want threads not to move, then you need set the affinity of each thread separately (eg. using pthread_setaffinity_np for example). Note that you can check the affinity of threads of a given process with the great hwloc tool (hwloc-ps -t).
Note that some libraries/frameworks have ways to do that more easily. This is the case for OpenMP programs where you can use environment variables like OMP_PLACES to set the affinity of each thread.

How to run binary executables in multi-thread HPC cluster?

I have this tool called cgatools from complete genomics (http://cgatools.sourceforge.net/docs/1.8.0/). I need to run some genome analyses in High-Performance Computing Cluster. I tried to run the job allocating more than 50 cores and 250gb memory, but it only uses one core and limits the memory to less than 2GB. What would be my best option in this case? Is there a way to run binary executables in HPC cluster making it use all the allocated memory?
The scheduler just runs the binary provided by you on the first node allocated. The onus of splitting the job and running it in parallel is on the binary. Hence, you see that you are using one core out of the fifty allocated.
Parallelising at the code level
You will need to make sure that the binary that you are submitting as a job to the cluster has some mechanism to understand the nodes that are allocated (interaction with the Job Scheduler) and a mechanism to utilize the allocated resources (MPI, PGAS etc.).
If it is parallelized, submitting the binary through a job submission script (through a wrapper like mpirun/mpiexec) should utilize all the allocated resources.
Running black box serial binaries in parallel
If not, the only other possible workload distribution mechanism across the resources is the data parallel mode, wherein, you use the cluster to supply multiple inputs to the same binary and run the processes in parallel to effectively reduce the time taken to solve the problem.
You can set the granularity based on the memory required for each run. For example, if each process needs 1GB of memory, you can run 16 processes per node (with assumed 16 cores and 16GB memory etc.)
The parallel submission of multiple inputs on a single node can be done through the tool Parallel. You can then submit multiple jobs to the cluster, with each job requesting 1 node (exclusive access and the parallel tool) and working on different input elements respectively.
If you do not want to launch 'n' separate jobs, you can use the mechanisms provided by the scheduler like blaunch to specify the machine on which the job is supposed to be run dynamically. You can parse the names of the machines allocated by the scheduler and further use blaunch like script to emulate the submission of n jobs from the first node.
Note: These class of applications are better off being run on a cloud like setup instead of typical HPC systems [effective utilization of the cluster at all the levels of available parallelism (cluster, thread and SIMD) is a key part of HPC.]

Multithreading on SLURM

I have a Perl script that forks using the Parallel::ForkManager module.
To my knowledge, if I fork 32 child processes and ask the SLURM scheduler to run the job on 4 nodes, 8 processors per node, the code will execute each child process on every core.
Someone in my lab said that if I run a job on multiple nodes that the other nodes are not used, and I'm wasting time and money. Is this accurate?
If I use a script that forks am I limited to one node with SLURM?
As far as I know Parallel::ForkManager doesn't make use of MPI, so if you're using mpirun I don't see how it's going to communicate across nodes. A simple test is to have each child output hostname.
One thing that commonly happens with non-MPI software launched with mpirun is that you duplicate all your effort across all nodes, so that they are all doing the exact same thing instead of sharing the work. If you use Parallel::MPI it should work just fine.

Will forking more workers allow me to balance CPU-heavy work?

I love node.js' evented model, but it only takes you so far - when you have a function (say, a request handler for HTTP connections) that does a lot of heavy work on the CPU, it's still "blocking" until its function returns. That's to be expected. But what if I want to balance this out a bit, so that a given requests takes longer to process but the overall response time is shorter, using the operarting system's ability to schedule the processes?
My production code uses node's wonderfully simple Cluster module to fork a number of workers equal to the number of cores the system's CPU has. Would it be bad to fork more than this - perhaps two or three workers per core? I know there'll be a memory overhead here, but memory is not my limitation. What reading I did mentioned that you want to avoid "oversubscribing", but surely on a modern system you're not going crazy by having two or three processes vying for time on the processor.
I think your idea sounds like a good one; especially because many processors support hyperthreading. Hyperthreading is not magical and won't suddenly double your application's speed or throughput but it can make sense to have another thread ready to execute in a core when the first thread needs to wait for a memory request to be filled.
Be careful when you start multiple workers: the Linux kernel really prefers to keep processes executing on the same processor for their entire lifetime to provide for strong cache affinity. This makes enough sense. But I've seen several CPU-hungry processes vying for a single core or worse a single hyperthread instance rather than the system re-balancing the processes across all cores or all siblings. Check your processor affinities by running ps -eo pid,psr,comm (or whatever your favorite ps(1) command is; add the psr column).
To combat this you might want to start your workers with an explicitly limited CPU affinity:
taskset -c 0,1 node worker 1
taskset -c 2,3 node worker 2
taskset -c 4,5 node worker 3
taskset -c 6,7 node worker 4
Or perhaps start eight, one per HT sibling, or eight and confine each one to their own set of CPUs, or perhaps sixteen, confine four per core or two per sibling, etc. (You can go nuts trying to micromanage. I suggest keeping it simple if you can.) See the taskset(1) manpage for details.

using qsub (sge) with multi-threaded applications

i wanted to submit a multi-threaded job to the cluster network i'm working with -
but the man page about qsub is not clear how this is done - By default i guess it just sends it as a normal job regardless of the multi-threading - but this might cause problems, i.e. sending many multi-threaded jobs to the same computer, slowing things down.
Does anyone know how to accomplish this? thanks.
The batch server system is sge.
In SGE/UGE the configuration is set by the administrator so you have to check what they've called the parallel environments
qconf -spl
make
our_paraq
look for one with $pe_slots in the config
qconf -sp make
qconf -sp our_paraq
qsub with that environment and number of cores you want to use
qsub -pe our_paraq 8 -cwd ./myscript
If you're using mpi you have more choices for the config allocation rule ($pe_slots above) like $round_robin and $fill_up, but this should get you going.
If your job is multithreaded, you can harness the advantage of multithreading even in SGE. In SGE single job can use one or many CPUs. If you submit a job that uses single processor and you have a program that have many threads than a single processor can handle, problem occurs. Verify how many processors your job is using and how many threads per CPU your program is creating.
In my case i have a java program that uses one processor with two threads, it works pretty efficiently. i submit same java program for execution to many CPUs with 2 threads each to make it parallel as i have not use MPI.
The answer by the user "j_m" is very helpful, but in my case I needed to both request multiple cores AND submit my job to a specific node. After a copious amount of searching, I finally found a solution that worked for me and I'm posting it here so that other people who might have a similar problem don't have to go through the same pain (please note that I'm leaving this as an answer instead of a reply because I don't have enough reputation for making replies):
qsub -S /bin/sh -cwd -l h=$NODE_NAME -V -pe $ENV_NAME $N_OF_CORES $SCRIPT_NAME
I think the variables $NODE_NAME, $N_OF_CORES and $SCRIPT_NAME are pretty straightforward. You can easily find $ENV_NAME by following the answer by "j_m".

Resources