I am running Spark on a local machine, with 8 cores, and I understand that I can use "local[num_threads]" as the master, and use "num_threads" in the bracket to specify the number of threads used by Spark.
However, it seems that Spark often uses more threads than I required. For example, if I only specify 1 thread for Spark, by using the top command on Linux, I can still observe that the cpu usage is often more than 100% and even 200%, implying that more than 1 threads are actually used by Spark.
This may be a problem if I need to run multiple programs concurrently. How can I control the number of threads/cores used strictly by Spark?
Spark uses one thread for its scheduler, which explains the usage pattern you see it. If you launch n threads in parallel, you'll get n+1 cores used.
For details, see the scheduling doc.
Related
Apache Samza's documentation states that it can be run with multiple threads per worker:
Threading model and ordering
Samza offers a flexible threading model to run each task. When running your applications, you can control the number of workers needed to process your data. You can also configure the number of threads each worker uses to run its assigned tasks. Each thread can run one or more tasks. Tasks don’t share any state - hence, you don’t have to worry about coordination across these threads.
From my understanding, this means Samza uses the same architecture as Kafka Streams, i.e. tasks are statically assigned to threads. I think a reasonable choice would be to set the number of threads more or less equal to the number of CPU cores. Does that make sense?
I am now wondering how the number of threads can be configured in Samza. I found the option job.container.thread.pool.size. However, it reads like this option does something different, which is running operations of tasks in parallel (which could impair ordering (?)). It also confuses me that the default value is 0 instead of 1.
Given a cluster of several nodes, each of which hosts multiple-core processor, is there any advantage of using MPI between nodes and OpenMP/pthreads within nodes over using pure all-MPI? If I understand correctly, if I run an MPI-program on a single node and indicate the number of processes equal to the number of cores, then I will have an honest parallel MPI-job of several processes running on separate cores. So why bother about hybrid parallelization using threads within nodes and MPI only between nodes? I have no question in case of MPI+CUDA hybrid, as MPI cannot employ GPUs, but it can employ CPU cores, so why use threads?
Using a combination of OpenMP/pthread threads and MPI processes is known as Hybrid Programming. It is tougher to program than pure MPI but with the recent reduction in latencies with OpenMP, it makes a lot of sense to use Hybrid MPI. Some advantages are:
Avoiding data replication: Since threads can share data within a node, if any data needs to be replicated between processes, we can avoid this.
Light-weight : Threads are lightweight and thus you reduce the meta-data associated with processes.
Reduction in number of messages : A single process within a node can communicate with other processes, reducing number of messages between nodes (and thus reducing pressure on the Network Interface Card). The number of messages involved in collective communication is notable.
Faster communication : As pointed out by #user3528438 above, since threads communicate using shared memory, you can avoid using point-to-point MPI communication within a node. A recent approach (2012) recommends using RMA shared memory instead of threads within a node - this model is called MPI+MPI (search google scholar using MPI plus MPI).
But Hybrid MPI has its disadvantages as well but you asked only about the advantages.
This is in fact a much more complex question that it looks like.
It depends of lot of factor. By experience I would say: You are always happy to avoid hibrid openMP-MPI. Which is a mess to optimise. But there is some momement you cannot avoid it, mainly dependent on the problem you are solving and the cluster you have access to.
Let say you are solving a problem highly parallelizable and you have a small cluster then Hibrid will be probably useless.
But if you have a problem which lets says scale well up to N processes but start to have a very bad efficiency at 4N. And you have access to a cluster with 10N cores... Then hybridization will be a solution. You will use a little amount of thread per MPI processes something like 4 (It is known that >8 is not efficient).
(its fun to think that on KNL most people I know use 4 to 8 Thread per MPI process even if one chip got 68 cores)
Then what about hybrid accelerator/openMP/MPI.
You are wrong with accelerator + MPI. As soon as you start to used a cluster which has accelerators you will need to use someting like openMP/MPI or CUDA/MPI or openACC/MPI as you will need to communicate between devices. Nowadays you can bypass the CPU using Direct GPU (at least for Nvidia, not clue for other builder but I expect that it would be the case). Then usually you will use 1 MPI process per GPU. Most cluster with GPU will have 1 socket and N accelerators (N
I don't quite understand spark.task.cpus parameter. It seems to me that a “task” corresponds to a “thread” or a "process", if you will, within the executor. Suppose that I set "spark.task.cpus" to 2.
How can a thread utilize two CPUs simultaneously? Couldn't it require locks and cause synchronization problems?
I'm looking at launchTask() function in deploy/executor/Executor.scala, and I don't see any notion of "number of cpus per task" here. So where/how does Spark eventually allocate more than one cpu to a task in the standalone mode?
To the best of my knowledge spark.task.cpus controls the parallelism of tasks in you cluster in the case where some particular tasks are known to have their own internal (custom) parallelism.
In more detail:
We know that spark.cores.max defines how many threads (aka cores) your application needs. If you leave spark.task.cpus = 1 then you will have #spark.cores.max number of concurrent Spark tasks running at the same time.
You will only want to change spark.task.cpus if you know that your tasks are themselves parallelized (maybe each of your task spawns two threads, interacts with external tools, etc.) By setting spark.task.cpus accordingly, you become a good "citizen". Now if you have spark.cores.max=10 and spark.task.cpus=2 Spark will only create 10/2=5 concurrent tasks. Given that your tasks need (say) 2 threads internally the total number of executing threads will never be more than 10. This means that you never go above your initial contract (defined by spark.cores.max).
I have been reading and trying to understand how does Spark framework use its cores in Standalone mode. According to Spark documentation, the parameter "spark.task.cpus"'s value is set to be 1 by default, which means number of cores to allocate for each task.
Question 1:
For a multi-core machine (e.g., 4 cores in total, 8 hardware threads), when "spark.task.cpus = 4", will Spark use 4 cores (1 thread per core) or 2 cores with hyper-thread?
What will it happen if I set "spark.task.cpus = 16", more than the number of available hardware threads on this machine?
Question 2:
How is this type of hardware parallelism achieved? I tried to look into the code but couldn't find anything that communicates with the hardware or JVM for core-level parallelism. For example, if the task is "filter" function, how is a single filter task spitted to multiple cores or threads?
Maybe I am missing something. Is this related to the Scala language?
To answer your title question, Spark by itself does not give you parallelism gains within a task. The main purpose of the spark.task.cpus parameter is to allow for tasks of multithreaded nature. If you call an external multithreaded routine within each task, or you want to encapsulate the finest level of parallelism yourself on the task level, you may want to set spark.task.cpus to more than 1.
Setting this parameter to more than 1 is not something you would do often, though.
The scheduler will not launch a task if the number of available cores is less than the cores required by the task, so if your executor has 8 cores, and you've set spark.task.cpus to 3, only 2 tasks will get launched.
If your task does not consume the full capacity of the cores all the time, you may find that using spark.task.cpus=1 and experiencing some contention within the task still gives you more performance.
Overhead from things like GC or I/O probably shouldn't be included in the spark.task.cpus setting, because it'd probably be a much more static cost, that doesn't scale linearly with your task count.
Question 1: For a multi-core machine (e.g., 4 cores in total, 8 hardware threads), when "spark.task.cpus = 4", will Spark use 4 cores (1 thread per core) or 2 cores with hyper-thread?
The JVM will almost always rely on the OS to provide it with info and mechanisms to work with CPUs, and AFAIK Spark doesn't do anything special here. If Runtime.getRuntime().availableProcessors() or ManagementFactory.getOperatingSystemMXBean().getAvailableProcessors() return 4 for your dual-core HT-enabled Intel® processor, Spark will also see 4 cores.
Question 2: How is this type of hardware parallelism achieved? I tried to look into the code but couldn't find anything that communicates with the hardware or JVM for core-level parallelism. For example, if the task is "filter" function, how is a single filter task spitted to multiple cores or threads?
Like mentioned above, Spark won't automatically parallelize a task according to the spark.task.cpus parameter. Spark is mostly a data parallelism engine and its parallelism is achieved mostly through representing your data as RDDs.
In my sparkconf, i can set the number of cores to use, i have 4 physical, 8 logical on my laptop, what does spark do if I specify a number that was not possible on the machine, like say 100 cores?
Number of cores doesn't describe physical cores but a number of running threads. It means that nothing really strange happens if the number is higher than a number of available cores.
Depending on your setup it can be actually a preferred configuration with value around twice a number of available cores being a commonly recommended setting. Obviously if number is to high your application will spend more time on switching between threads than actual processing.
It heavily depends on your cluster manager. I assume that you're asking about local[n] run mode.
If so, the driver and the one and only one executor are the same JVM with n number of threads.
DAGScheduler - the Spark execution planner will use n threads to schedule as many tasks as you've told it should.
If you have more tasks, i.e. threads, than cores, your OS will have to deal with more threads than cores and schedule them appropriately.