TBB Flow Graph Nodes Running on Main Tread - multithreading

Is there a way to constraint nodes in the flow graph to run on the main thread? The case is that I have a big dependency graph where only some of the nodes can run on other threads. Instead of picking out the nodes that can run on other threads, I want to feed the entire graph to TBB and then mark some of the nodes to run on the main thread.

Related

Configure number of threads in Apache Samza

Apache Samza's documentation states that it can be run with multiple threads per worker:
Threading model and ordering
Samza offers a flexible threading model to run each task. When running your applications, you can control the number of workers needed to process your data. You can also configure the number of threads each worker uses to run its assigned tasks. Each thread can run one or more tasks. Tasks don’t share any state - hence, you don’t have to worry about coordination across these threads.
From my understanding, this means Samza uses the same architecture as Kafka Streams, i.e. tasks are statically assigned to threads. I think a reasonable choice would be to set the number of threads more or less equal to the number of CPU cores. Does that make sense?
I am now wondering how the number of threads can be configured in Samza. I found the option job.container.thread.pool.size. However, it reads like this option does something different, which is running operations of tasks in parallel (which could impair ordering (?)). It also confuses me that the default value is 0 instead of 1.

Vulkan Compute dispatch from a child CPU thread

Can Vulkan Compute dispatch from a child CPU thread, or does it have to dispatch from the main thread? I don't think this is possible to dispatch compute shaders in Unity from child threads and I wanted to find out if it could be done in Unreal Engine.
It depends on what you mean by "dispatch" and "main thread".
vkCmdDispatch, as the "Cmd" prefix suggests, puts a command in a command buffer. This can be called on any thread, so long as the VkCommandBuffer object will not have other vkCmd functions called on it at the same time (typically, you reserve specific command buffers for a single thread). So by one definition, you can "dispatch" compute operations from other threads.
Of course, recording commands in a command buffer doesn't actually do anything. Commands only get executed when you queue up those CBs via vkQueueSubmit. Like vkCmdDispatch, it doesn't matter what thread you call that function on. However, like vkCmdDispatch, it does matter that multiple threads be prevented from accessing the same VkQueue object at the same time.
Now, you don't have to use a single thread for that VkQueue; you can lock the VkQueue behind some kind of mutex, so that only one thread can own it at a time. And thus, a thread that creates a CB could submit its own work.
However, ignoring the fact that tasks often need to be inserted into the queue in an order (one task might generate some compute data that a graphics task needs to wait on, so the graphics task CB must be after the compute CB), there's a bigger problem. vkQueueSubmit takes a long time. If you look at the function, it can take an arbitrarily large number of CBs to insert, and it has the ability to have multiple batches, with each batch guarded by semaphores and fences for synchronization. As such, you are strongly encouraged to make as few vkQueueSubmit calls as possible, since each call has a quantity of overhead to it that has nothing to do with how many CBs you are queuing up.
There's even a warning about this in the spec itself.
So the typical way applications are structured is that you farm out tasks to the available CPU threads, and these tasks build command buffers. One particular thread will be anointed as the owner of the queue. That thread may perform some CB building, but once it is done, it will wait for the other tasks to complete and gather up all of the CBs from the other threads. Once gathered, that thread will vkQueueSubmit them in appropriate batches.
You could call that thread the "main thread", but Vulkan itself doesn't really care which thread "owns" the queue. It certainly doesn't have to be your process's initial thread.

How to run binary executables in multi-thread HPC cluster?

I have this tool called cgatools from complete genomics (http://cgatools.sourceforge.net/docs/1.8.0/). I need to run some genome analyses in High-Performance Computing Cluster. I tried to run the job allocating more than 50 cores and 250gb memory, but it only uses one core and limits the memory to less than 2GB. What would be my best option in this case? Is there a way to run binary executables in HPC cluster making it use all the allocated memory?
The scheduler just runs the binary provided by you on the first node allocated. The onus of splitting the job and running it in parallel is on the binary. Hence, you see that you are using one core out of the fifty allocated.
Parallelising at the code level
You will need to make sure that the binary that you are submitting as a job to the cluster has some mechanism to understand the nodes that are allocated (interaction with the Job Scheduler) and a mechanism to utilize the allocated resources (MPI, PGAS etc.).
If it is parallelized, submitting the binary through a job submission script (through a wrapper like mpirun/mpiexec) should utilize all the allocated resources.
Running black box serial binaries in parallel
If not, the only other possible workload distribution mechanism across the resources is the data parallel mode, wherein, you use the cluster to supply multiple inputs to the same binary and run the processes in parallel to effectively reduce the time taken to solve the problem.
You can set the granularity based on the memory required for each run. For example, if each process needs 1GB of memory, you can run 16 processes per node (with assumed 16 cores and 16GB memory etc.)
The parallel submission of multiple inputs on a single node can be done through the tool Parallel. You can then submit multiple jobs to the cluster, with each job requesting 1 node (exclusive access and the parallel tool) and working on different input elements respectively.
If you do not want to launch 'n' separate jobs, you can use the mechanisms provided by the scheduler like blaunch to specify the machine on which the job is supposed to be run dynamically. You can parse the names of the machines allocated by the scheduler and further use blaunch like script to emulate the submission of n jobs from the first node.
Note: These class of applications are better off being run on a cloud like setup instead of typical HPC systems [effective utilization of the cluster at all the levels of available parallelism (cluster, thread and SIMD) is a key part of HPC.]

should I use memory fences in a supervisor-workers model?

I am building multithreading support for my application.
In my application, it can happen that a worker should access the "work field" of another worker to complete its own job. I have tried to make this safe with pthread mutexes, but they turned out to be horribly slow, even when there is only one worker and no so contention.
So, I came up with another idea. Let a worker complete its job where it can, and then add to a (per-worker, own) queue the jobs that have the aforementioned problem: when all the workers are done, the main supervisor thread will complete the unfinished jobs, in the hope that they will be orders of magnitude fewer than the number of jobs got done by the workers.
My question is: should I throw in a memory fence, at the moment that I transfer the execution from the supervisor to the workers and vice-versa?
EDIT:
more details (the code is on github, see pool::collision_wsc()). Each thread reads pointers from various "cells" (which are basically a std::vector), and applies some operation on the objects pointed (collision between hard spheres).
The point is that a cell interacts with (some of) its neighbours, but some of these cells might be in the ownership of another worker (one sphere might be near the bounds of a cell, and collide with one of another cell).

Threading model for data processing in a directed graph

I'm going to be designing a simple data analysis tool which processes different kinds of data through a directed graph. The directed graph is somewhat customizable by the user. Each node will consist of logging, analysis, and mathematical operations on the data passing through. The graph is similar in many ways to a neural network except with additional processing at each node. Some nodes do simple operations to the data elements passing through while other nodes have complex algorithms.
How do I multithread the processing in this directed graph such that I can get the result out of the graph in the fastest and most efficient way? Memory is not an issue here, and neither is the time it takes to initialize this task.
I've thought of a couple different methods to multithread the work:
Each thread instance 'follows' each data element entering the start node in this graph. The thread will stay with this data element as it passes through each node, calling the processing method on each node all the way down the tree. This will essentially require a single thread per data element entering the system. Of course, once the data element has been carried through the entire system, the thread will be recycled. The problem here is when two outgoing edges on a node exist--the thread would need to follow both (does this mean pull a new thread from a thread pool?).
Create a thread per node and create a data buffer on each graph edge. The worker thread on the node will continually check to hold data in the instance that one thread takes longer with the data. The problem with this approach is the inherent 'polling' of the buffer for having enough data to start processing it--perhaps a small price to pay for simplifying the data flow for any graph configuration.
Can anyone think of a better way, or which one do you recommend? I'm looking for the least latency through the system and the ability to constantly process a stream of incoming data.
Thanks!
Brett
First of all, it is not a good idea to spawn unlimited amount of threads (e.g. thread per node). Usually you want to have at most 1.5-3 times more threads than your CPU cores (e.g. 6-12 threads for quad-core).
I would recommend to use thread-pool and tasks. In such case your problem can be rephrased as what size your tasks should have.
Both of the methods you mentioned are valid and each has its own pros and cons.
One task per data input should be easy to implement, as the algorithm for graph processing will stay single-threaded. The overhead of context-switching, synchronization and data-passing between threads is almost none.
When there are two outgoing edges on a node, then this single task has to follow both of them. This is a standard part of all algorithms for graph traversal, e.g. depth-first search or breadth-first search.
One task per graph node can improve the latency in case that your graphs have many "branches" that can be processed in parallel. However this approach requires more complex design of graph processing and there will be higher overhead of thread synchronization. Actually the cost of multi-threading might be higher than the benefits gained by parallel processing of the graph.
When there are two outgoing edges on a node, you can create two new task and queue then on the thread-pool. (Or queue one task and continue with processing the other one.)
The more difficult problem is when there are two incoming edges on a node. The task processing the node will have to wait until data for both edges are available.
Conclusion: I would personally start with the first option (one task per data input) and see, how far you can get with it.

Resources