Distributed Processing of Volumetric Image Data - linux

For the development of an object recognition algorithm, I need to repeatedly run a detection program on a large set of volumetric image files (MR scans).
The detection program is a command line tool. If I run it on my local computer on a single file and single-threaded it takes about 10 seconds. Processing results are written to a text file.
A typical run would be:
10000 images with 300 MB each = 3TB
10 seconds on a single core = 100000 seconds = about 27 hours
What can I do to get the results faster? I have access to a cluster of 20 servers with 24 (virtual) cores each (Xeon E5, 1TByte disks, CentOS Linux 7.2).
Theoretically the 480 cores should only need 3.5 minutes for the task.
I am considering to use Hadoop, but it's not designed for processing binary data and it splits input files, which is not an option.
I probably need some kind of distributed file system. I tested using NFS and the network becomes a serious bottleneck. Each server should only process his locally stored files.
The alternative might be to buy a single high-end workstation and forget about distributed processing.
I am not certain, if we need data locality,
i.e. each node holds part of the data on a local HD and processes only his
local data.

I regularly run large scale distributed calculations on AWS using Spot Instances. You should definitely use the cluster of 20 servers at your disposal.
You don't mention which OS your servers are using but if it's linux based, your best friend is bash. You're also lucky that it's a command line programme. This means you can use ssh to run commands directly on the servers from one master node.
The typical sequence of processing would be:
run a script on the Master Node which sends and runs scripts via ssh on all the Slave Nodes
Each Slave Node downloads a section of the files from the master node where they are stored (via NFS or scp)
Each Slave Node processes its files, saving required data via scp, mysql or text scrape
To get started, you'll need to have ssh access to all the Slaves from the Master. You can then scp files to each Slave, like the script. If you're running on a private network, you don't have to be too concerned about security, so just set ssh passwords to something simple.
In terms of CPU cores, if the command line program you're using isn't designed for multi-core, you can just run several ssh commands to each Slave. Best thing to do is run a few tests and see what the optimal number of process is, given that too many processes might be slow due to insufficient memory, disk access or similar. But say you find that 12 simultaneous processes gives the fastest average time, then run 12 scripts via ssh simultaneously.
It's not a small job to get it all done, however, you will forever be able to process in a fraction of the time.

You can use Hadoop. Yes, default implementation of FileInputFormat and RecordReader are splitting files into chunks and split chunks into lines, but you can write own implementation of FileInputFormat and RecordReader. I've created custom FileInputFormat for another purpose, I had opposite problem - to split input data more finely than default, but there is a good looking recipes for exactly your problem: https://gist.github.com/sritchie/808035 plus https://www.timofejew.com/hadoop-streaming-whole-files/
But from other side Hadoop is a heavy beast. It has significant overhead for mapper start, so optimal running time for mapper is a few minutes. Your tasks are too short. Maybe it is possible to create more clever FileInputFormat which can interpret bunch of files as single file and feed files as records to the same mapper, I'm not sure.

Related

ec2 linux performance in bash scripts of using $PATH vs /FULL/NAME vs ALIAS

I am considering the speed/memory cost of scaling a large number of small bash scripts. The permutations of scripts calling each other will be quite large, <=5000 scripts.
Is there a benchmark or way of profiling best performance for using a particular method to call each script, say
searching $PATH (could be many deep directories needing to be added to $PATH
using MYFILE=/full/directory/path/file.sh in a number of config files
Using a large number of alias commands to store the more heavily used scripts
The target system will be a scaling cluster of m8.xl type ec2 instances that are general compute optimized with standard ssd i/o.
The reason for using this method is the development of the scripts will be produced by several teams and the only common document for reference is a basic db with function=myfile -p param, return=x, etc. so if a developer needs a function or script they would call the script name and one of the methods above would point to the actual script.
UPDATE
To clarify, the speed of launching the script is not that important if it is only marginally faster to use a method. The concern is the weight of script names being housed in memory having a detrimental effect on the overall performance.
Several hundred or thousand scripts may be spawned in a burst over a few seconds, so the peak load is a concern factor.
Thx
Art

How to use Rmpi in R on linux Cluster to increase cores available with DEoptim?

I am using code developed in R to calibrate a hydrological model with 8 parameters using DEoptim (a function that aims to minimise an objective function). The DEoptim code uses the 'parallel' package to detect the number of cores available using 'DetectCores()'. On my PC I have 4 cores with 2 threads each so it detects 8 cores and then sends out the hydrological model to a core with different values of parameters and the results are returned to the centre. It does this hundreds or thousands of times and iterates the parameters to try and find an optimum set. Therefore the more cores available, the faster it will work.
I am at a university and have access to a Linux compute cluster. They have servers with up to 12 cores (i.e. not threads) and if I used this it would work two - three times faster than my PC. Great. However, ideally I would spread the code around other servers so I could have access to more cores and all the info sent back the master.
Therefore, my question is how could I include Rmpi in my code to effectively increase the cores available. As you can probably tell, I am quite new to using clusters.
Many thanks, Antony
If you want to execute DEoptim on multiple nodes of a Linux cluster, I believe you'll need to use foreach by specifying parallelType=2 in the control argument. You can use either the doMPI parallel backend or the doParallel backend with an MPI cluster object. For example:
library(doParallel)
library(Rmpi)
cl <- makeCluster(mpi.universe.size()-1, type='MPI')
registerDoParallel(cl)
# and eventually...
DEoptim(fn=Genrose, lower=rep(-25, n), upper=rep(25, n),
control=list(NP=10*n, itermax=maxIt, parallelType=2))
You'll need to have the snow package installed in addition to the others. Also, make sure that you execute your script with mpirun using the -np 1 option. If you don't use mpirun, the workers will all be spawned on the local machine.

NFS time gap too long

I've got 2 machines that exchange data via NFS: 2 different files of about 20 bytes. The client writes its file and the server reads and deletes it then it writes its different file and the server reads and deletes. And so on. The 2 files have always the same names.
It was all ok. They run Linux 2.4. Nowadays, I've added another client which runs Linux 2.6. It works in the same way, it only uses files with different names.
The problem is that the new client sees the file from the server about 40 seconds after that it is written. I can wait 4-5 or even 10 seconds, but not 40.
I've tried to mount the remote partition with -o vers=2 or -o vers=3, but with no effects.
Then I tried echo 3 > /proc/sys/vm/drop_caches, (see NFS cache-cleaning command?) no effects.
What can I do to reduce the time gap?
You can try to incorporate listen-notify approach, using iNotify to monitor filesystem events.
The inotify API provides a mechanism for monitoring file system
events. Inotify can be used to monitor individual files, or to monitor
directories. When a directory is monitored, inotify will return events
for the directory itself, and for files inside the directory
man page
Q: Can I watch sysfs (procfs, nfs...)? Simply spoken: yes, but with
some limitations. These limitations vary between kernel versions and
tend to get smaller. Please read information about particular
filesystems.
FAQ page
It's very likely this will decrease the time gap.

Profiling very long running tasks

How can you profile a very long running script that is spawning lots of other processes?
We have a job that takes a long time to run - 11 or more hours, sometimes more than 17 - so it runs on a Amazon EC2 instance.
(It is doing cufflinks DNA alignment and stuff.)
The job is executing lots of processes, scripts and utilities and such.
How can we profile it and determine which component parts of the job take the longest time?
A simple CPU utilisation per process per second would probably suffice. How can we obtain it?
There are many solutions to your question :
munin is a great monitoring tool that can scan almost everything in your system and make nice graph about it :). It is very easy to install and use it.
atop could be a simple solution, it can scan cpu, memory, disk regulary and you can store all those informations into files (the -W option), then you'll have to anaylze those file to detect the bottleneck.
sar, that can scan more than everything on your system, but a little more hard to interpret (you'll have to make the graph yourself with RRDtool for example)

Multiple Machines -- Process Many Files Concurrently?

I need to concurrently process a large amount of files (thousands of different files, with avg. size of 2MB per file).
All the information is stored on one (1.5TB) network hard drive, and will be accessed (read) by about 30 different machines. For efficiency, each machine will be reading (and processing) different files (there are thousands of files that need to be processed).
Every machine -- following its reading of a file from the 'incoming' folder on the 1.5TB hard drive -- will be processing the information and be ready to output the processed information back to the 'processed' folder on the 1.5TB drive. the processed information for every file is of roughly the same average size as the input files (about ~2MB per file).
Are there any 'do' and 'donts' when one is building such an operation? is it a problem to have 30 machines or so read (or write) information to the same network drive, at the same time?
(note: existing files will only be read, not appended/written; new files will be created from scratch, so there are no issues of multiple access to the same file...).
Are there any bottlenecks that I should expect?
(I am use Linux, Ubuntu 10.04 LTS on all machines if it all matters)
Things you should think about:
If the processing to be done for each file is simple, then your real bottleneck isn't the amount of parallel files you read, but the capabilities of the hard disk drive.
Unless processing takes a long time (say, some seconds per file) you'll go past a point in which adding more processes will only slow down matters to a crawl, since every process is reading and writing results, and the disk can only do so much.
Try to minimize disk access: for example, download files and produce results locally while other processes are downloading, and send the results back when the load on the disk goes down.
The more I write the more it boils down to how much processing needs to be done for each file. If it's simple parsing, something that takes milliseconds, 1 machine or 30 will make little difference.
You need to be careful that two worker processes don't pick up (and try to do) the same piece of work at the same time.
Unfortunately, NFS filesystems don't have semantics that allow you to easily do that.
So what I'd recommend is to use something like Gearman and a producer/consumer model, where one process gives out work to whoever is available to do it.
Another possibility is to have a database (e.g. mysql) with a table of all tasks, and have the processes atomically "claim" tasks for themselves.
But all of this is only worthwhile if your processes are mostly CPU-bound. If you're trying to get more IO bandwidth (or operations) out of your NAS by using multiple clients, it's not going to work.
I am assuming that you will be running at least gigabit ethernet here (or it's probably not worth it).
Have you tried running multiple processes on the same machine?

Resources