I want to implement a parallel DFS search in a tree. We have a (potentially huge) tree of business objects, and the user can search by a textId (or part of the textId). This is implemented in a serial DFS search at the moment. Which is a huge waste of resources. Our customers have 8 logical cores. No need to wait 5 minutes for a search result...
We already have a global ThreadManager. We use it mainly for parallel calculations of grid cells. The ThreadManager keeps track of how many tasks are queued, how many threads are available and starts the next queued task when a new thread is available.
This idea is to use this with a new task class to parallelize the tree search. Of course, I cannot start a new task on every childNode - that would mean hundreds or thousands of tasks queued. But only parallelize at a high tree level would underuse the cores when a task has only a small subtree.
I have the following idea for the task class:
One object of the task class knows its treeNode, and a common result object. The tasks "execute" method does the following:
if result object already has a result:
return
match treeNode with the search condition. If match:
put treeNodes object in the result object.
return
for each childNode of treeNode:
if the ThreadManager has a thread available:
Create a new task with childNode and put the task on the queue
else:
call "execute" for every childNode
The synchronization of the completed tasks seems complicated. For example, I have to know when the tree is searched but the target textID is not found. But it should be possible to expand the ThreadManager to known if there are still tasks which belong to this search.
Does someone have experience with that kind of algorithm? Will the synchronization overhead be too much to be worth it? Are there other pitfalls I am not seeing?
This discussion is similar: Depth first search in parallel
where "stack" or "global worklist" is my "ThreadManager". Did I get this right?
Thank you!
Related
I'm building a multithreaded web crawler.
I launch a thread that gets first n href links and parses some data. Then it should add those links to a Visited list that other threads can access and adds the data to a global map that will be printed when the program is done. Then the thread launches new n new threads all doing the same thing.
How can I setup a global list of Visited sites that all threads can access and a global map that all threads can also write to.
You can't share data between processes. That doesn't mean that you can't share information.
the usual way is either to use a special process (a server) in charge of this job: maintain a state; in your case the list of visited links.
Another way is to use ETS (or Mnesia the database build upon ETS) which is designed to share information between processes.
Just to clarify, erlang/elixir uses processes rather than threads.
Given a list of elements, a generic approach:
An empty list called processed is saved to ets, dets, mnesia or some DB.
The new list of elements is filtered against the processed list so the Task is not unnecessarily repeated.
For each element of the filtered list, a task is run (which in turn spawns a process) and does some work on each element that returns a map of the required data. See the Task module Task.async/1 and Task.yield_many/2 could be useful.
Once all the tasks have returned or yielded,
all the maps or parts of the data in the maps are merged and can be persisted if/as required/appropriate.
the elements whose tasks did not crash or timeout are added to the processed list in the DB.
Tasks which crash or timeout could be handled differently.
I'm trying to figure out how utilizing SWF Flow framework, I can have my activity worker poll multiple task list. The use case is for having two different priorities for activity tasks that need to be completed.
Bouns points if someone uses glisten and can point out a way to achieve that.
Thanks!
It is not possible for a single ActivityWorker to poll on multiple task lists. The reason for such design is that each poll request can take up to a minute due to long poll. If a few such polls feed into a single threaded activity implemenation it is not clear how to deal with conflicts that arise if tasks are received on multiple task lists.
Until the SWF natively supports priority task lists the solution is to instantiate one ActivityWorker per task list (priority) and deal with conflicts yourself.
I'm using ThreadPoolExecutor for some task. I need to know whether ThreadPoolExecutor has and method to find how many task remaining in assigned queue. is it possible? depends upon the return value i'll assign task again ,I don't know how many task assigned early.
Use the ThreadPoolExecutor's getTaskCount() method to get the tasks in the queue. This is an approximate value as tasks coulld be consumed dynamically and this value can change.
I have a partially ordered set of tasks, where for each task all of the tasks that are strictly before it in the partial order must be executed before it can be executed. I want to execute tasks which are not related (either before or after one other) concurrently to try to minimise the total execution time - but without starting a task before its dependencies are completed.
The tasks will run as (non-perl) child processes.
How should I approach solving a problem like this using Perl? What concurrency control facilities and data structures are available?
I would use a hash of arrays. For each task, all its prerequisities will be mentioned in the corresponding array:
$prereq{task1} = [qw/task2 task3 task4/];
I would keep completed tasks in a different hash, and then just
my #prereq = #{ $prereq{$task} };
if (#prereq == grep exists $completed{$_}, #prereq) {
run($task);
}
Looks like a full solution is NP-complete.
As for a partial solution, I would use some form of reference counting to determine which jobs are ready to run, Forks::Super::Job to run the background jobs and check their statuses and POSIX::pause to sleep when maximum number of jobs is spawned.
No threads are involved since you're already dealing with separate processes.
Read the first link for possible algorithms/heuristics to determine runnable jobs' priorities.
I have n tasks in a waiting list.
Each task has associated with it an entry that contains some meta information:
Task1 A,B
Task2 A
Task3 B,C
Task4 A,B,C
And an asssociated hashmap that contains entries like:
A 1
B 2
C 2
This implies that if a task, that contains in its meta information A, is already running, then no other task containing
A can run at the same time.
However, since B has a limit of 2 tasks, so either task1 and task3 can run together, or task3 and task4.
But task1, task3 and task4 cannot run together since both the limits of A and B will be violated, though limit of C is not
violated.
If I need to select tasks to run in different threads, what logic/algorithm would you suggest? And, when should this logic
be invoked? I view the task list as a shared resource which might need to be locked when tasks
are selected to run from it. Right now, I think this logic might have to be invoked when a task is added to the list and
also, when a running task has completed. But this could block the addition of new elements to the list, unless I make a copy of the list before running the logic.
How would your logic change if I were to give higher priority to tasks that contain more entries like 'A,B,C'
than that to 'A,B'?
This is kind of a continuation of Choosing a data structure for a variant of producer consumer problem and How to access the underlying queue of a ThreadpoolExecutor in a thread safe way, just in case any one is wondering about the background of the problem.
Yes, this is nasty. I immediately thought of an array/list of semaphores, initialized from the hashmap from which any thread attempting to execute a task would have to get units as defined by the metadata. About a second later, I realized that such a design would deadlock pretty quick!
I think that one dedicated producer thread is going to have to iterate a 'readyJobs' list in an attempt to find a task that can execute with the current resources avaliable. It could do this both when new tasks become available and after a task is completed, so releasing resources. The producer thread could wait on one input queue, (thread-safe producer-consumer queue), to which is queued both new tasks from [wherever] and completed tasks that are queued back from the work threads, (callback fired by the work threads pushes the completed task to the input queue?). Adding new tasks might be blocked briefly, but only while the input queue is blocked by some other task being added.
In the case of assigning 'priorites', you could insert-sort the 'readyJobs' list as you wish, so that higher-priority tasks are checked first to see if they can run with the resources available. If they cannot, then the rest of the list is iterated and a lower-priority job might be able to run.
I hope that you do not want to 'preempt' lower-priority tasks so as to release resources early - that would get really, really messy :(
Rgds,
Martin