Number of CPUs per Task in Spark - multithreading

I don't quite understand spark.task.cpus parameter. It seems to me that a “task” corresponds to a “thread” or a "process", if you will, within the executor. Suppose that I set "spark.task.cpus" to 2.
How can a thread utilize two CPUs simultaneously? Couldn't it require locks and cause synchronization problems?
I'm looking at launchTask() function in deploy/executor/Executor.scala, and I don't see any notion of "number of cpus per task" here. So where/how does Spark eventually allocate more than one cpu to a task in the standalone mode?

To the best of my knowledge spark.task.cpus controls the parallelism of tasks in you cluster in the case where some particular tasks are known to have their own internal (custom) parallelism.
In more detail:
We know that spark.cores.max defines how many threads (aka cores) your application needs. If you leave spark.task.cpus = 1 then you will have #spark.cores.max number of concurrent Spark tasks running at the same time.
You will only want to change spark.task.cpus if you know that your tasks are themselves parallelized (maybe each of your task spawns two threads, interacts with external tools, etc.) By setting spark.task.cpus accordingly, you become a good "citizen". Now if you have spark.cores.max=10 and spark.task.cpus=2 Spark will only create 10/2=5 concurrent tasks. Given that your tasks need (say) 2 threads internally the total number of executing threads will never be more than 10. This means that you never go above your initial contract (defined by spark.cores.max).

Related

Configure number of threads in Apache Samza

Apache Samza's documentation states that it can be run with multiple threads per worker:
Threading model and ordering
Samza offers a flexible threading model to run each task. When running your applications, you can control the number of workers needed to process your data. You can also configure the number of threads each worker uses to run its assigned tasks. Each thread can run one or more tasks. Tasks don’t share any state - hence, you don’t have to worry about coordination across these threads.
From my understanding, this means Samza uses the same architecture as Kafka Streams, i.e. tasks are statically assigned to threads. I think a reasonable choice would be to set the number of threads more or less equal to the number of CPU cores. Does that make sense?
I am now wondering how the number of threads can be configured in Samza. I found the option job.container.thread.pool.size. However, it reads like this option does something different, which is running operations of tasks in parallel (which could impair ordering (?)). It also confuses me that the default value is 0 instead of 1.

SLURM nodes, tasks, cores, and cpus

Would someone be able to clarify what each of these things actually are? From what I gathered, nodes are computing points within the cluster, essentially a single computer. Tasks are processes that can be executed either on a single node or on multiple nodes. And cores are basically how much of a CPU on a single node do you want to be allocated to executing the task assigned to that CPU. Is this correct? Am I confusing something?
The terms can have different meanings in different context, but if we stick to a Slurm context:
A (compute) node is a computer part of a larger set of nodes (a cluster). Besides compute nodes, a cluster comprises one or more login nodes, file server nodes, management nodes, etc. A compute node offers resources such as processors, volatile memory (RAM), permanent disk space (e.g. SSD), accelerators (e.g. GPU) etc.
A core is the part of a processor that does the computations. A processor comprises multiple cores, as well as a memory controller, a bus controller, and possibly many other components. A processor in the Slurm context is referred to as a socket, which actually is the name of the slot on the motherboard that hosts the processor. A single core can have one or two hardware threads. This is a technology that allows virtually doubling the number of cores the operating systems perceives while only doubling part of the core components -- typically the components related to memory and I/O and not the computation components. Hardware multi-threading is very often disabled in HPC.
a CPU in a general context refers to a processor, but in the Slurm context, a CPU is a consumable resource offered by a node. It can refer to a socket, a core, or a hardware thread, based on the Slurm configuration.
The role of Slurm is to match those resources to jobs. A job comprises one or more (sequential) steps, and each step has one or more (parallel) tasks. A task is an instance of a running program, i.e. at a process, possibly along with subprocesses or software threads.
Multiple tasks are dispatched on possibly multiple nodes depending on how many core each task needs. The number of cores a task needs depends on the number of subprocesses or software threads in the instance of the running program. The idea is to map each hardware thread to one core, and make sure that each task has all assigned cores assigned on the same node.

How does Spark achieve parallelism within one task on multi-core or hyper-threaded machines

I have been reading and trying to understand how does Spark framework use its cores in Standalone mode. According to Spark documentation, the parameter "spark.task.cpus"'s value is set to be 1 by default, which means number of cores to allocate for each task.
Question 1:
For a multi-core machine (e.g., 4 cores in total, 8 hardware threads), when "spark.task.cpus = 4", will Spark use 4 cores (1 thread per core) or 2 cores with hyper-thread?
What will it happen if I set "spark.task.cpus = 16", more than the number of available hardware threads on this machine?
Question 2:
How is this type of hardware parallelism achieved? I tried to look into the code but couldn't find anything that communicates with the hardware or JVM for core-level parallelism. For example, if the task is "filter" function, how is a single filter task spitted to multiple cores or threads?
Maybe I am missing something. Is this related to the Scala language?
To answer your title question, Spark by itself does not give you parallelism gains within a task. The main purpose of the spark.task.cpus parameter is to allow for tasks of multithreaded nature. If you call an external multithreaded routine within each task, or you want to encapsulate the finest level of parallelism yourself on the task level, you may want to set spark.task.cpus to more than 1.
Setting this parameter to more than 1 is not something you would do often, though.
The scheduler will not launch a task if the number of available cores is less than the cores required by the task, so if your executor has 8 cores, and you've set spark.task.cpus to 3, only 2 tasks will get launched.
If your task does not consume the full capacity of the cores all the time, you may find that using spark.task.cpus=1 and experiencing some contention within the task still gives you more performance.
Overhead from things like GC or I/O probably shouldn't be included in the spark.task.cpus setting, because it'd probably be a much more static cost, that doesn't scale linearly with your task count.
Question 1: For a multi-core machine (e.g., 4 cores in total, 8 hardware threads), when "spark.task.cpus = 4", will Spark use 4 cores (1 thread per core) or 2 cores with hyper-thread?
The JVM will almost always rely on the OS to provide it with info and mechanisms to work with CPUs, and AFAIK Spark doesn't do anything special here. If Runtime.getRuntime().availableProcessors() or ManagementFactory.getOperatingSystemMXBean().getAvailableProcessors() return 4 for your dual-core HT-enabled Intel® processor, Spark will also see 4 cores.
Question 2: How is this type of hardware parallelism achieved? I tried to look into the code but couldn't find anything that communicates with the hardware or JVM for core-level parallelism. For example, if the task is "filter" function, how is a single filter task spitted to multiple cores or threads?
Like mentioned above, Spark won't automatically parallelize a task according to the spark.task.cpus parameter. Spark is mostly a data parallelism engine and its parallelism is achieved mostly through representing your data as RDDs.

What happens if I try to use more cores than I have?

In my sparkconf, i can set the number of cores to use, i have 4 physical, 8 logical on my laptop, what does spark do if I specify a number that was not possible on the machine, like say 100 cores?
Number of cores doesn't describe physical cores but a number of running threads. It means that nothing really strange happens if the number is higher than a number of available cores.
Depending on your setup it can be actually a preferred configuration with value around twice a number of available cores being a commonly recommended setting. Obviously if number is to high your application will spend more time on switching between threads than actual processing.
It heavily depends on your cluster manager. I assume that you're asking about local[n] run mode.
If so, the driver and the one and only one executor are the same JVM with n number of threads.
DAGScheduler - the Spark execution planner will use n threads to schedule as many tasks as you've told it should.
If you have more tasks, i.e. threads, than cores, your OS will have to deal with more threads than cores and schedule them appropriately.

How to control the number of threads/cores used?

I am running Spark on a local machine, with 8 cores, and I understand that I can use "local[num_threads]" as the master, and use "num_threads" in the bracket to specify the number of threads used by Spark.
However, it seems that Spark often uses more threads than I required. For example, if I only specify 1 thread for Spark, by using the top command on Linux, I can still observe that the cpu usage is often more than 100% and even 200%, implying that more than 1 threads are actually used by Spark.
This may be a problem if I need to run multiple programs concurrently. How can I control the number of threads/cores used strictly by Spark?
Spark uses one thread for its scheduler, which explains the usage pattern you see it. If you launch n threads in parallel, you'll get n+1 cores used.
For details, see the scheduling doc.

Resources