If I have two datasets (having equal number of rows and columns) and I wish to run a piece of code that I have made, then there are two options obviously, either to go with sequential execution or parallel programming.
Now, the algorithm (code) that I have made is a big one and consists of multiple for loops. I wish to ask, is there any way to directly use it on both of them or will I have to transform the code in some way? A heads up would be great.
To answer your question: you do not have to transform the code to run it on two datasets in parallel, it should work fine like it is.
The need for parallel processing usually arises in two ways (for most users, I would imagine):
You have code you can run sequentially, but you would like to do it in parallel.
You have a function that is taking very long to execute on a large dataset, and you would like to run it in parallel to speed it up.
For the first case, you do not have to do anything, you can just execute it in parallel using one of the libraries designed for it, or just run two instances of R on the same computer and run the same code but with different datasets in each of them.
It doesn't matter how many for loops you have in there and you don't even need to have the same number of rows in columns in the datasets.
If it runs fine sequentially, it means there will be no dependence between the parallel chains and thus no problem.
Since your question falls in the first case, you can run it in parallel.
If you have the second case, you can sometimes turn it into the first case by splitting your dataset into pieces (where you can run each of the pieces sequentially) and then you run it in parallel. This is easier said than done, and won't always be possible. It is also why not all functions just have a run.in.parallel=TRUE option: it is not always obvious how you should split the data, nor is it always possible.
So you have already done most of the work by writing the functions, and splitting the data.
Here is a general way of doing parallel processing with one function, on two datasets:
library( doParallel )
cl <- makeCluster( 2 ) # for 2 processors, i.e. 2 parallel chains
registerDoParallel( cl )
datalist <- list(mydataset1 , mydataset2)
# now start the chains
nchains <- 2 # for two processors
results_list <- foreach(i=1:nchains ,
.packages = c( 'packages_you_need') ) %dopar% {
result <- find.string( datalist[[i]] )
return(result) }
The result will be a list with two elements, each containing the results from a chain. You can then combine it as you wish, or use a .combine function. See the foreach help for details.
You can use this code any time you have a case like number 1 described above. Most of the time you can also use it for cases like number 2, if you spend some time thinking about how you want to divide the data, and then combine the results. Think of it as a "parallel wrapper".
It should work in Windows, GNU/Linux, and Mac OS, but I haven't tested it on all of them.
I keep this script handy whenever I need a quick speed-up, but I still always start out by writing code I can run sequentially. Thinking in parallel hurts my brain.
Related
I'm facing some performance issues to execute a fuzzy match based on Leveinshtein distance algorithm.
I'm comparing two lists, a small one with 1k lines and a second one with 10k lines.
I have splitted the bigger list in 10 files of 1000 lines to check speed performance, but I checked that Python is using only 1 thread.
I have googled for many articles and people says how to execute TWO different functions in paralel.
I would like to know how to execute the SAME code in multiple threads.
For example: it's taking 1 second to compare 1 word in a 1000 lines. I would like to split this time in 4 threads.
Is it possible?
Sorry for the long text and thanks a lot for your help!
Running the same code in two or more threads won't assist performance. You could potentially split up the task so each handles 250, then have each thread handle 1 of those tasks. Then compare the results at the end.
I have a shell script that copies files into a location and another one that picks these up for further processing. I want to use multithreading to pick up files parallelly in Scala using a threadpool.
However, if there are two threads and two files, both of them are picking up the same file. I have tried the program a lot of times, and it always ends up like this. I need the threads to pick up different files in parallel.
Can someone help me out? What approaches can I use? If you could point me in the right direction that would be enough.
I think you can use a parallel sequence to do the processing in parallel.
You don't have to handle this logic yourself. for ex. the code could be like this:
newFiles:Seq[String] = listCurrentFilesNames()
newFiles.par.foreach { fileName =>
processFile(fileName)
}
This code will be executed in parallel. and you could set the number of threads to a specific number as mentioned here: https://stackoverflow.com/a/37725987/2201566
You can also try using actors - for eg- for your reference - https://github.com/tsheppard01/akka-parallel-read-csv-file
I'm new with multi-threading in Matlab so I guess that what I need to do will be simple for anyone with a little experience in it.
I have two functions f1 and f2 such that:
f1 - runs about 10 seconds and returns accurate results.
f2 - return estimated results immediately.
Both functions get the same input and return the same output (but in f1 the output is more accurate).
The functions get new inputs all the time and run constantly.
I want to run the functions in two different threads so that the user will never need to wait.
and every time one of them finishes, the other one uses its results.
The goal is to get more accurate results than running f2 all the time. Running f1 all the time is the optimal solution but then the user needs to wait 10 seconds each time. So I want to combine the two functions and to get something in the middle.
What is the best way to implement the above?
I read about batch but I'm not sure it's what I need (at least I didn't find how to do the above with batch).
Can you just lead me with the relevant functions and I'll read about them and learn how to use them?
Thanks
Shake builds things in parallel when possible, but what happens if an individual build step is itself parallelizable? For example I'm running BLAST commands. Each command compares two species' genomes. Several comparisons could be run in parallel, but there's also a flag to split a comparison into N chunks and run those in parallel. Do I need to pick one way of splitting the jobs up and stick with it, or can I tell Shake "Use N threads overall, and by the way each of these specific tasks takes up N threads on its own"?
(This comes up when comparing many small bacterial genomes and a few bigger eukaryotic ones)
EDIT: the question can be simplified to "how to tell how many Shake threads are currently running/queued from within Shake?"
No, but there is a ticket to add it: https://github.com/ndmitchell/shake/issues/603
I have a Large-Scale Gradient Descent optimization problem that I am running using Matlab. The code has got two parts:
A Sequential update part that fires every iteration that updates the parameter vector.
A validation error computation part that fires every 10 iterations or so using the parameter value at the end of the corresponding iteration in which its fired.
The way that I am running this now is to do (1) and (2) sequentially. But (2) takes a lot of time and its not the core part of my routine - I made it just to check the progress and plot the error of my model. Is it possible in Matlab to run (2) in a parallel manner to (1) ? Please note that (1) cannot be run in parallel since it performs sequential update. So a simple 'parfor' usage is not a solution, unless there is a really smart way of doing that.
I don't think Matlab has any way of multi-threading outside of the (rather restricted) parallel computing toolbox. There is a work over which may help you though:
Open 2 sessions of Matlab, sessions A and B (or instances, or workspaces, however you call it)
Matlab session A:
Calculate the 10 iterations of your sequential process (1)
Saves the result in a file (adequately and uniquely named)
Goes on to calculate the next 10 iterations (back to the top of this loop basically)
In parralel:
Matlab session B:
Check periodically for the existence of the file written by process A (define a timer that will do that at the time interval which make sense for your process, a few seconds or a few minutes ...)
If the file exist => load it then do the validation computation (your process (2)) and display/report the results.
note: This only works if process (1) doesn't need the result of process (2) to run its iterations, but if it is the case I don't know how you could parallelise anyway.
If you have multiple cores on your machine that should run smoothly, if you have a single core then the 2 sessions will have to share and you will see a performance impact.