Any C++ libraries to run a single program on multiple PC's (i.e. "use grid computing to run my app") - visual-c++

I'm after a method of converting a single program to run on multiple computers on a network (think "grid computing").
I'm using MSVC 2007 and C++ (non-.NET).
The program I've written is ideally suited for parallel programming (its doing analysis of scientific data), so the more computers the better.

The classic answer for this would be MPI (Message Passing Interface). It requires a bit of work to get your program to work well with message passing, but the end result is that you can easily launch your executable across a cluster of machines that are running a MPI daemon.
There are several implementations. I've worked with MPICH, but I might consider doing this with Boost MPI (which didn't exist last time I was in the neighborhood).

Firstly, this topic is covered here:
https://stackoverflow.com/questions/2258332/distributed-computing-in-c
Secondly, a search for "C++ grid computing library", "grid computing for visual studio" and "C++ distributed computing library" returned the following:
OpenMP+OpenMPI. OpenMP handles the running of single C++ program on multiple CPU cores within the same machine, OpenMPI handles the messaging between multiple machines. OpenMP+OpenMPI=grid computing.
POP-C++, see http://gridgroup.hefr.ch/popc/.
Xoreax Grid Engine, see http://www.xoreax.com/high_performance_grid_computing.htm. Xoreax focuses on speeding up builds of Visual Studio, but the Xoreax Grid Engine can also be applied to generic applications. Looking at http://www.xoreax.com/xge_xoreax_grid_engine.htm quotes, we see the quote "Once a task-set (a set of tasks for distribution along with their dependency definitions) is defined through one of the interfaces described below, it can be executed on any machine running an IncrediBuild Agent.". See the accompanying CodeGuru article at http://www.codeproject.com/KB/showcase/Xoreax-Grid.aspx
Alchemi, see http://www.codeproject.com/KB/threads/alchemi.aspx.
RightScale, see http://www.rightscale.com/pdf/Grid-Whitepaper-Technical.pdf. A quote from the examples section of this paper: "Pharmaceutical protein analysis: Several million protein compound comparisons were performed in less than a day – a task that would have taken over a week on the customer’s internal resources ..."

Related

How to get the operating system's "Currently Executing Thread" handle (NOT the "Managed Thread ID") in .NET 6?

In .NET6 I want to retrieve the native handle (NOT the "Managed Thread ID") of the OS thread, on which the handle-retrieving function just runs, as a (possibly casted to) UInt32.
I found a solution for Windows (using the kernel's "GetCurrentWin32ThreadId"), but I want to have solutions also for Linux, MacOS and Android, assuming that the respective implicit OS' object models contain also "Thread Handles".
To avoid senseless reading time consuming tries to lead me on other paths: my question is very precise, please don't ask "why"s! And please avoid "you could try"s, because I don't have access to Linux-Computers, Macs, Smartphones, and don't want to bother others by intermediate tests and/or even "tries". I need concrete definitoric "code snippet" answers.
I need it 1. for debugging purposes, 2. for .NET-ManagedThreadPool monitoring (if it always works correctly), 3. cross-checking with the Visual Studio output (about finished threads) and 4. some other (also platform specific to be handled, native) functions/stuff (e.g. native thread coordination, cross-process).
My goal:
I want to deliver my program(s) [atm especially the "OpenSimulator"-software, including the server (Windows, Linux) as well as the user's viewer (Windows, Linux, MacOS, iOS)] with a target-platform-independent .NET6-".exe", and an OS-respective target-platform-specific .NET6-.dll as the respective implementation for certain interfaces, to bridge the yet current compatibility-gaps, something/somehow like MAUI tries to do, but generalized more complete on the logical (.NET6) layer.

Record dynamic instruction trace or histogram in QEMU?

I've written and compiled a RISC-V Linux application.
I want to dump all the instructions that get executed at run-time (which cannot be achieved by static analysis).
Is it possible to get a dynamic assembly instruction execution historgram from QEMU (or other tools)?
For instruction tracing, I go with -singlestep -d nochain,cpu, combined with some awk. This can become painfully slow and large depending on the code you run.
Regarding the statistics you'd like to obtain, delegate it to R/numpy/pandas/whatever after extracting the program counter.
The presentation or video of user "yvr18" on that topic, might cover some aspects of QEMU tracing at various levels (as well as some interesting heatmap visualization).
QEMU doesn't currently support that sort of trace of all instructions executed.
The closest we have today is that there are various bits of debug logging under the -d switch, and you can combine the tracing of "instructions translated from guest to native" with the "blocks of translated code executed" translation to work out what was executed, but this is pretty awkward.
Alternatively you could try scripting the gdbstub interface to do something like "disassemble instruction at PC; singlestep" which will (slowly!) give you all the instructions executed.
Note: There ongoing work to improve QEMU's ability to introspect guest execution so that you can write a simple 'plugin' with functions that are called back on events like guest instruction execution; with that it would be fairly easy to write a dump of guest instructions executed (or do more interesting processing), but this is still work-in-progress, so not available yet.
It seems you can do something similar with rv8 (https://github.com/rv8-io/rv8), using the command:
rv-jit -l
The "spike" RISC-V emulator allows tracing instructions executed, new values stored into registers, or just simply a histogram of PC values (from which you can extract what instruction was at each PC location).
It's not as fast as qemu, but runs at 100 to 200 MIPS on current x86 hardware (at least without tracing enabled)

Writing CUDA program for more than one GPU

I have more than one GPU and want to execute my kernels on them. Is there an API or software that can schedule/manage GPU resources dynamically? Utilizing resources of all available GPUs for the program.
A utility that may periodically report the available resource and my program will launch as many threads to GPUs.
Secondly, I am using Windows+ Visual Studio for my development. I have read that CUDA is supported on Linux. what changes do I need to do in my program?
I have more than GPUs and want to execute my kernels on them. Is there an API or software that can schedule/manage GPU resources dynamically.
For arbitrary kernels that you write, there is no API that I am aware of (certainly no CUDA API) that "automatically" makes use of multiple GPUs. Today's multi-GPU aware programs often use a strategy like this:
detect how many GPUs are available
partition the data set into chunks based on the number of GPUs available
successively transfer the chunks to each GPU, and launch the computation kernel on each GPU, switching GPUs using cudaSetDevice().
A program that follows the above approach, approximately, is the cuda simpleMultiGPU sample code. Once you have worked out the methodology for 2 GPUs, it's not much additional effort to go to 4 or 8 GPUs. This of course assumes your work is already separable and the data/algorithm partitioning work is "done".
I think this is an area of active research in many places, so if you do a google search you may turn up papers like this one or this one. Whether these are of interest to you will probably depend on your exact needs.
There are some new developments with CUDA libraries available with CUDA 6 that can perform certain specific operations (e.g. BLAS, FFT) "automatically" using multiple GPUs. To investigate this further, review the relevant CUBLAS XT documentation and CUFFT XT multi-GPU documentation and sample code. As far as I know, at the current time, these operations are limited to 2 GPUs for automatic work distribution. And these allow for automatic distribution of specific workloads (BLAS, FFT) not arbitrary kernels.
Secondly, I am using Windows+ Visual Studio for my development. I have read that CUDA is supported on Linux. what changes do I need to do in my program?
With the exception of the OGL/DX interop APIs CUDA is mostly orthogonal to choice of windows or linux as a platform. The typical IDE's are different (windows: nsight Visual Studio edition, Linux: nsight eclipse edition) but your code changes will mostly consist of ordinary porting differences between windows and linux. If you want to get started with linux, follow the getting started document.

Matlab 2011a Use all Cores Available on 64 bit Linux?

Hi I've looked online but I can't seem to find the answer whether I need to do anything to make matlab use all cores? From what I understand multi-threading has been supported since 2007. On my machine matlab only uses one core #100% and the rest hang at ~2%. I'm using a 64 bit Linux (Mint 12). On my other computer which has only 2 cores and is 32 bit Matlab seems to be utilizing both cores #100%. Not all of the time but in sufficient number of cases. On the 64 bit, 4 core PC this never happens.
Do I have to do anything in 64 bit to get Matlab to use all the cores whenever possible? I had to do some custom linking after install as Matlab wasn't finding the libraries (eg. libc.so.6) because it wasn't looking in the correct places.
By standard, since the latest release, you can use 12 cores using the Parallel Computing Toolbox. Without this toolbox, I guess you're out of luck. Any additional cores could be accessed by the MATLAB Distributed Computing Server, where you actually pay per number of worker threads.
To make matlab use your multiple cores you have to do
matlabpool open
And it of course works better if you actually have multithreaded code (like using the spmd function or parfor loops)
More info at the Matlab homepage
MATLAB has only one single thread for Computation.
That said, multiple threads would be created for certain functions which use the multithreaded features of the BLAS libraries that it uses underneath.
Thus, you would only be able to gain a 'multi threaded' advantage if you are calling functions which use these multi-threaded blas libraries.
This link has information on the list of functions which are multithreaded.
Now for the use of your cores, that would depend on your OS. I believe the OS would have to load balance your threads to be used on all cores. One CANNOT set affinities to threads from within MATLAB. One can however set worker MATLAB processes to have affinities to cores from within the Parallel Computing toolbox.
However, you could always try setting the affinity for the MATLAB process to all your processors manually by the details available at the following link for Linux
Windows users can simply right click on the process in the task manager and set affinity.
My understanding is that this is only a request to the OS and is not a hard binding rule that the OS must adhere to.

How to make R use all processors?

I have a quad-core laptop running Windows XP, but looking at Task Manager R only ever seems to use one processor at a time. How can I make R use all four processors and speed up my R programs?
I have a basic system I use where I parallelize my programs on the "for" loops. This method is simple once you understand what needs to be done. It only works for local computing, but that seems to be what you're after.
You'll need these libraries installed:
library("parallel")
library("foreach")
library("doParallel")
First you need to create your computing cluster. I usually do other stuff while running parallel programs, so I like to leave one open. The "detectCores" function will return the number of cores in your computer.
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl, cores = detectCores() - 1)
Next, call your for loop with the "foreach" command, along with the %dopar% operator. I always use a "try" wrapper to make sure that any iterations where the operations fail are discarded, and don't disrupt the otherwise good data. You will need to specify the ".combine" parameter, and pass any necessary packages into the loop. Note that "i" is defined with an equals sign, not an "in" operator!
data = foreach(i = 1:length(filenames), .packages = c("ncdf","chron","stats"),
.combine = rbind) %dopar% {
try({
# your operations; line 1...
# your operations; line 2...
# your output
})
}
Once you're done, clean up with:
stopCluster(cl)
The CRAN Task View on High-Performance Compting with R lists several options. XP is a restriction, but you still get something like snow to work using sockets within minutes.
As of version 2.15, R now comes with native support for multi-core computations. Just load the parallel package
library("parallel")
and check out the associated vignette
vignette("parallel")
I hear tell that REvolution R supports better multi-threading then the typical CRAN version of R and REvolution also supports 64 bit R in windows. I have been considering buying a copy but I found their pricing opaque. There's no price list on their web site. Very odd.
I believe the multicore package works on XP. It gives some basic multi-process capability, especially through offering a drop-in replacement for lapply() and a simple way to evaluate an expression in a new thread (mcparallel()).
On Windows I believe the best way to do this would probably be with foreach and snow as David Smith said.
However, Unix/Linux based systems can compute using multiple processes with the 'multicore' package. It provides a high-level function, 'mclapply', that performs a list comprehension across multiple cores. An advantage of the 'multicore' package is that each processor gets a private copy of the Global Environment that it may modify. Initially, this copy is just a pointer to the Global Environment, making the sharing of variable extremely quick if the Global Environment is treated as read-only.
Rmpi requires that the data be explicitly transferred between R processes instead of working with the 'multicore' closure approach.
-- Dan
If you do a lot of matrix operations and you are using Windows you can install revolutionanalytics.com/revolution-r-open for free, and this one comes with the intel MKL libraries which allow you to do multithreaded matrix operations. On Windows if you take the libiomp5md.dll, Rblas.dll and Rlapack.dll files from that install and overwrite the ones in whatever R version you like to use you'll have multithreaded matrix operations (typically you get a 10-20 x speedup for matrix operations). Or you can use the Atlas Rblas.dll from prs.ism.ac.jp/~nakama/SurviveGotoBLAS2/binary/windows/x64 which also work on 64 bit R and are almost as fast as the MKL ones. I found this the single easiest thing to do to drastically increase R's performance on Windows systems. Not sure why they don't come as standard in fact on R Windows installs.
On Windows, multithreading unfortunately is not well supported in R (unless you use OpenMP via Rcpp) and the available SOCKET-based parallelization on Windows systems, e.g. via package parallel, is very inefficient. On POSIX systems things are better as you can use forking there. (package multicore there is I believe the most efficient one). You could also try to use package Rdsm for multithreading within a shared memory model - I've got a version on my github that has unflagged -unix only flag and should work also on Windows (earlier Windows wasn't supported as dependency bigmemory supposedly didn't work on Windows, but now it seems it does) :
library(devtools)
devtools::install_github('tomwenseleers/Rdsm')
library(Rdsm)

Resources