I have a shell script that copies files into a location and another one that picks these up for further processing. I want to use multithreading to pick up files parallelly in Scala using a threadpool.
However, if there are two threads and two files, both of them are picking up the same file. I have tried the program a lot of times, and it always ends up like this. I need the threads to pick up different files in parallel.
Can someone help me out? What approaches can I use? If you could point me in the right direction that would be enough.
I think you can use a parallel sequence to do the processing in parallel.
You don't have to handle this logic yourself. for ex. the code could be like this:
newFiles:Seq[String] = listCurrentFilesNames()
newFiles.par.foreach { fileName =>
processFile(fileName)
}
This code will be executed in parallel. and you could set the number of threads to a specific number as mentioned here: https://stackoverflow.com/a/37725987/2201566
You can also try using actors - for eg- for your reference - https://github.com/tsheppard01/akka-parallel-read-csv-file
Related
Suppose I have a C program, and it creates threads for doing different tasks. Now, I want to redirect the stdout of a certain thread in bash scripts?
Here you can assume that I always have a way to get the process id and thread id, I only want to know if it's possible to do this using bash scripts and how?
Note: This is not about process, it's thread, and I haven't found any questions related to this yet.
There is only one console, not one per thread. So when 5 threads write in parallel to stdout, all of that goes into a single sink, basically in nondeterministic ways.
So unless each line contains a specific string that identifies the original thread, you can't take that output apart after the fact.
Alternatively, you could have your threads write to different files! When you don't throw random output together, it is much easier to get to the individual sources later on.
If I have two datasets (having equal number of rows and columns) and I wish to run a piece of code that I have made, then there are two options obviously, either to go with sequential execution or parallel programming.
Now, the algorithm (code) that I have made is a big one and consists of multiple for loops. I wish to ask, is there any way to directly use it on both of them or will I have to transform the code in some way? A heads up would be great.
To answer your question: you do not have to transform the code to run it on two datasets in parallel, it should work fine like it is.
The need for parallel processing usually arises in two ways (for most users, I would imagine):
You have code you can run sequentially, but you would like to do it in parallel.
You have a function that is taking very long to execute on a large dataset, and you would like to run it in parallel to speed it up.
For the first case, you do not have to do anything, you can just execute it in parallel using one of the libraries designed for it, or just run two instances of R on the same computer and run the same code but with different datasets in each of them.
It doesn't matter how many for loops you have in there and you don't even need to have the same number of rows in columns in the datasets.
If it runs fine sequentially, it means there will be no dependence between the parallel chains and thus no problem.
Since your question falls in the first case, you can run it in parallel.
If you have the second case, you can sometimes turn it into the first case by splitting your dataset into pieces (where you can run each of the pieces sequentially) and then you run it in parallel. This is easier said than done, and won't always be possible. It is also why not all functions just have a run.in.parallel=TRUE option: it is not always obvious how you should split the data, nor is it always possible.
So you have already done most of the work by writing the functions, and splitting the data.
Here is a general way of doing parallel processing with one function, on two datasets:
library( doParallel )
cl <- makeCluster( 2 ) # for 2 processors, i.e. 2 parallel chains
registerDoParallel( cl )
datalist <- list(mydataset1 , mydataset2)
# now start the chains
nchains <- 2 # for two processors
results_list <- foreach(i=1:nchains ,
.packages = c( 'packages_you_need') ) %dopar% {
result <- find.string( datalist[[i]] )
return(result) }
The result will be a list with two elements, each containing the results from a chain. You can then combine it as you wish, or use a .combine function. See the foreach help for details.
You can use this code any time you have a case like number 1 described above. Most of the time you can also use it for cases like number 2, if you spend some time thinking about how you want to divide the data, and then combine the results. Think of it as a "parallel wrapper".
It should work in Windows, GNU/Linux, and Mac OS, but I haven't tested it on all of them.
I keep this script handy whenever I need a quick speed-up, but I still always start out by writing code I can run sequentially. Thinking in parallel hurts my brain.
Concise-ish problem explanation:
I'd like to be able to run multiple (we'll say a few hundred) shell commands, each of which starts a long running process and blocks for hours or days with at most a line or two of output (this command is simply a job submission to a cluster). This blocking is helpful so I can know exactly when each finishes, because I'd like to investigate each result and possibly re-run each multiple times in case they fail. My program will act as a sort of controller for these programs.
for all commands in parallel {
submit_job_and_wait()
tries = 1
while ! job_was_successful and tries < 3{
resubmit_with_extra_memory_and_wait()
tries++
}
}
What I've tried/investigated:
I was so far thinking it would be best to create a thread for each submission which just blocks waiting for input. There is enough memory for quite a few waiting threads. But from what I've read, perl threads are closer to duplicate processes than in other languages, so creating hundreds of them is not feasible (nor does it feel right).
There also seem to be a variety of event-loop-ish cooperative systems like AnyEvent and Coro, but these seem to require you to rely on asynchronous libraries, otherwise you can't really do anything concurrently. I can't figure out how to make multiple shell commands with it. I've tried using AnyEvent::Util::run_cmd, but after I submit multiple commands, I have to specify the order in which I want to wait for them. I don't know in advance how long each submission will take, so I can't recv without sometimes getting very unlucky. This isn't really parallel.
my $cv1 = run_cmd("qsub -sync y 'sleep $RANDOM'");
my $cv2 = run_cmd("qsub -sync y 'sleep $RANDOM'");
# Now should I $cv1->recv first or $cv2->recv? Who knows!
# Out of 100 submissions, I may have to wait on the longest one before processing any.
My understanding of AnyEvent and friends may be wrong, so please correct me if so. :)
The other option is to run the job submission in its non-blocking form and have it communicate its completion back to my process, but the inter-process communication required to accomplish and coordinate this across different machines daunts me a little. I'm hoping to find a local solution before resorting to that.
Is there a solution I've overlooked?
You could rather use Scientific Workflow software such as fireworks or pegasus which are designed to help scientists submit large numbers of computing jobs to shared or dedicated resources. But they can also do much more so it might be overkill for your problem, but they are still worth having a look at.
If your goal is to try and find the tightest memory requirements for you job, you could also simply submit your job with a large amount or requested memory, and then extract actual memory usage from accounting (qacct), or , cluster policy permitting, logging on the compute node(s) where your job is running and view the memory usage with top or ps.
I have a question concerning the integration of split(), resequence() together with multithreading. My (naive) routes are looking like this (abbreviated to explain the problem):
from("file:input")
.process(prioAssign)
.split(body().tokenize("\n")).streaming()
.resequence().simple("${in.header.prio}").allowDuplicates().reverse()
.to("direct:process")
.end()
.process(exportProcessor)
from("direct:process")
.threads(10, 100, "process")
.process(importProcessor) // take some time for processing
I like to accomplish the following things:
The importProcessor work should be distributed over several threads
The items (coming from the splitter) should be processed by priority (resequenced)
The exportProcessor must be triggered when all splitted objects are processed (from one file)
The problem with the code above is, that if I include the resequence step, the export is triggered immediately and the resequencing itself doesn't work. It seems, I don't understand the threading model behind Camel.
Thanks a lot in advance for all hints!
Couldn't it be that your prioAssign processor doesn't build a body that can be split later, and so the split ends instantly and everything moves to the exportProcessor?
I'm writing a perl script to run some kind of a pipeline. I start by reading a JSON file with a bunch of parameters in it. I then do some work - mainly building some data structures needed later and calling external programs that generate some output files I keep references to.
I usually use a subroutine for each of these steps. Each such subroutine will usually write some data to a unique place that no other subroutine writes to (i.e. a specific key in a hash) and reads data that other subroutines may have generated.
These steps can take a good couple of minutes if done sequentially, but most of them can be run in parallel with some simple logic of dependencies that I know how to handle (using threads and a queue). So I wonder how I should implement this to allow sharing data between the threads. What would you suggest the framework to be? Perhaps use an object (of which I will have only one instance) and keep all the shared data in $self? Perhaps
a simple script (no objects) with some "global" shared variables? ...
I would obviously prefer a simple, neat solution.
Read threads::shared. By default, as perhaps you know, perl variables are not shared. But you place the shared attribute on them, and they are.
my %repository: shared;
Then if you want to synchronize access to them, the easiest way is to
{ lock( %repository );
$repository{JSON_dump} = $json_dump;
}
# %respository will be unlocked at the end of scope.
However you could use Thread::Queue, which are supposed to be muss-free, and do this as well:
$repo_queue->enqueue( JSON_dump => $json_dump );
Then your consumer thread could just:
my ( $key, $value ) = $repo_queue->dequeue( 2 );
$repository{ $key } = $value;
You can certainly do that in Perl, I suggest you look at perldoc threads and perldoc threads::shared, as these manual pages best describe the methods and pitfalls encountered when using threads in Perl.
What I would really suggest you use, provided you can, is instead a queue management system such as Gearman, which has various interfaces to it including a Perl module. This allows you to create as many "workers" as you want (the subs actually doing the work) and create one simple "client" which would schedule the appropriate tasks and then collate the results, without needing to use tricks as using hashref keys specific to the task or things like that.
This approach would also scale better, and you'd be able to have clients and workers (even managers) on different machines, should you choose so.
Other queue systems, such as TheSchwartz, would not be indicated as they lack the feedback/result that Gearman provides. To all effects, using Gearman this way is pretty much as the threaded system you described, just without the hassles and headaches that any system based on threads may eventually suffer from: having to lock variables, using semaphores, joining threads.