Configure number of threads in Apache Samza - multithreading

Apache Samza's documentation states that it can be run with multiple threads per worker:
Threading model and ordering
Samza offers a flexible threading model to run each task. When running your applications, you can control the number of workers needed to process your data. You can also configure the number of threads each worker uses to run its assigned tasks. Each thread can run one or more tasks. Tasks don’t share any state - hence, you don’t have to worry about coordination across these threads.
From my understanding, this means Samza uses the same architecture as Kafka Streams, i.e. tasks are statically assigned to threads. I think a reasonable choice would be to set the number of threads more or less equal to the number of CPU cores. Does that make sense?
I am now wondering how the number of threads can be configured in Samza. I found the option job.container.thread.pool.size. However, it reads like this option does something different, which is running operations of tasks in parallel (which could impair ordering (?)). It also confuses me that the default value is 0 instead of 1.

Related

About multithreading, concurrency, and parallelism

Recently I had confusion with understanding concepts: multithreading, concurrency, and parallelism. In order to reduce confusion, I've tried to organize my understanding about these and drawn my conclusion. My question is,
Is there a misunderstanding or something wrong from conclusion below?
References I took can be found here.
1. Concurrency and parallelism are different level of category.
It is not either concurrent or parallelism. It is either concurrent or not, and either parallel or not.
For example,
Not concurrent (Sequential) / Not parallel
Not concurrent (Sequential) / Parallel
Concurrent / Not parallel
Concurrent / Parallel
2. Parallelism is not subset of concurrency.
3. How does threading or multithreading relates to concurrency and parallelism?
Definition of thread clarifies this. Thread is "unit of execution flow". This "execution flow" can be managed independently by a scheduler, which is typically a part of the operating system.
Having a thread means having one unit of execution flow.
Having multiple threads (Multithreading) means having multiple units of execution flow.
And,
Having multiple units of execution flow is having multiple things making progress, which is definition of concurrency.
And,
Multiple units of execution flow is done by time slicing in single core hardware environment.
Multiple units of execution flow is done in parallel in multi core hardware environment.
4. Is multithreading concurrent or parallel?
Multithreading, or having multiple units of execution flow, is concurrent.
Multithreading itself is just having multiple units of execution flow. This has nothing to do with parallelism.
How operating system deals with multiple units of execution flow relates to parallelism.
Parallelism is achieved by operating system and hardware environment.
"Code can be concurrent, but not parallel." (Parallelism implies concurrency but not the other way round right?, stackexchange)
Detailed description will be truly appreciated.
Parallelism Refers to any system in which a single application can make use of more computing hardware than a single CPU can provide. There are a number of different types of parallel computing architecture, but when people say "parallelism" they often are talking about one in particular...
...A Symmetric MultiProcessing (SMP) system is a computer with one memory system, and two or more traditional CPUs that have equal access to it. Most modern workstations, most mobile devices, and many server systems* are SMP.
Multithreading is a model of concurrent computing.** A computer scientist might tell you that two threads run concurrently when the order in which the operations they perform are interleaved is not strictly determined by the program itself. A software developer is more likely to say that two threads run concurrently with each other when both threads have been started and neither of them has finished.
One way to achieve parallelism in an application running on an SMP system is to use multiple concurrent threads.
* Some servers are NUMA, which is a close cousin to SMP. In a NUMA system, the CPUs all access the same memory system, just like in SMP, except that each CPU "owns" part of the physical memory space, and it can access its own memory locations more quickly than it can access memory locations that are owned by other CPUs.
** There are other models of concurrent computing. Some, such as Actors, are used in production software. Others are mostly of academic interest.

What is the difference between CPU core's thread and any software's application thread?

I have a web application which supports multi threading in which we can run async tasks simultaneously on different thread. I understood what that thread mean.
Now suppose the server on which the application is running has multiple cores CPU with hyper threading enabled.
Now, how my application is supposed to take advantage of these threads. Is there any relation between these two which I am missing.
What i understand from CPU's threads is that
A thread is a single line of commands that are getting processed, each application has at least one thread, most have multiples. A core is the physical hardware that works on the thread. In general a processor can only work on one thread per core, CPUs with hyper threading can work on up to two threads per core.
For processors with hyper threading, there are extra registers and execution units in the core so it can store the state of two threads and work on them both, normally to change threads you have to empty the registers into the cache, write that back to the main memory, then load up the cache with the new values and load up the registers, context switches hurt performance significantly.
But when you have too much backgrounds tasks running, how they are utilizing just limited number of core's threads (i.e. 2 to 8).
PS: I have already checked What is the difference between a process and a thread? and not looking for definition of process. So its not a duplicate.
If you are making use of multiple cores in your program, then the os will schedule which cores run which threads and will take many factors into account, including other processes running, what exactly your code is trying to do, and much more. In regards to async tasks, these may not necessarily be running on a different thread or core, they may be tasks that are not instantaneous, so some scheduler may decide to start doing other things until there is a signal that the async task is complete. It will vary widely depending on the language you are programming the web application in, and the implementation.

Number of CPUs per Task in Spark

I don't quite understand spark.task.cpus parameter. It seems to me that a “task” corresponds to a “thread” or a "process", if you will, within the executor. Suppose that I set "spark.task.cpus" to 2.
How can a thread utilize two CPUs simultaneously? Couldn't it require locks and cause synchronization problems?
I'm looking at launchTask() function in deploy/executor/Executor.scala, and I don't see any notion of "number of cpus per task" here. So where/how does Spark eventually allocate more than one cpu to a task in the standalone mode?
To the best of my knowledge spark.task.cpus controls the parallelism of tasks in you cluster in the case where some particular tasks are known to have their own internal (custom) parallelism.
In more detail:
We know that spark.cores.max defines how many threads (aka cores) your application needs. If you leave spark.task.cpus = 1 then you will have #spark.cores.max number of concurrent Spark tasks running at the same time.
You will only want to change spark.task.cpus if you know that your tasks are themselves parallelized (maybe each of your task spawns two threads, interacts with external tools, etc.) By setting spark.task.cpus accordingly, you become a good "citizen". Now if you have spark.cores.max=10 and spark.task.cpus=2 Spark will only create 10/2=5 concurrent tasks. Given that your tasks need (say) 2 threads internally the total number of executing threads will never be more than 10. This means that you never go above your initial contract (defined by spark.cores.max).

How to control the number of threads/cores used?

I am running Spark on a local machine, with 8 cores, and I understand that I can use "local[num_threads]" as the master, and use "num_threads" in the bracket to specify the number of threads used by Spark.
However, it seems that Spark often uses more threads than I required. For example, if I only specify 1 thread for Spark, by using the top command on Linux, I can still observe that the cpu usage is often more than 100% and even 200%, implying that more than 1 threads are actually used by Spark.
This may be a problem if I need to run multiple programs concurrently. How can I control the number of threads/cores used strictly by Spark?
Spark uses one thread for its scheduler, which explains the usage pattern you see it. If you launch n threads in parallel, you'll get n+1 cores used.
For details, see the scheduling doc.

Multiple pre-emptive thread pools in TBB

We have a requirement to create a number of real time processing chains one running at n Hz and others running at x, y and z Hz in a single process. Where x, y and z are some multiple (not necessarily a simple multiple) of n. For example one chain running at 1Hz and others running at 3Hz and 4Hz. Each processing chain needs to make use of TBB to parallelize some of their computations etc. and so needs to have a pool of TBB worker threads that match the number of hardware processors, but the higher frequency chains need to pre-empt the lower frequency chains in order for the system to work. How can this be achieved using TBB?
It seems from the documentation that it is not possible to create competing pools of TBB workers that will pre-empt each other, since TBBs task groups etc. seem to share a single pool of real threads and there does not appear to be anyway to set the real system priority of a TBB worker thread, or am I missing something?
First, TBB is not designed for real-time systems. Though, the requirement of a few Hz looks relaxed quite enough.
Second, TBB is not designed to serve as a thread pool. It suggests a user-level non-preemptive task scheduler instead of preemptive threads scheduling on OS-level. TBB supposes that tasks are finite and small enough to provide points where the scheduler can switch between them more efficiently than OS can switch execution contexts of threads. Thus except for some special cases, it does not make a sense to request more worker threads then HW provides (i.e. to oversubscribe) and so thread-level priorities do not make sense either with TBB.
Though if you are not convinced by the arguments above and want to try your design, there is a way using two community-preview features: task_arena for work isolation and task_scheduler_observer extensions to assign system priorities for worker thread entering an arena.
TBB task priority feature looks like a more native approach for TBB to the described requirements. Though, it is limited to only 3 levels of priorities, the priority of a task_group_context can be changed dynamically which can be used to reorder tasks on the fly.

Resources