How many cores each Hadoop map task use? - multithreading

I'm running Hadoop task on a cluster of YARN with max of 8 tasks and 16 cores.
When I run the job I see 8 tasks running on a node yet all 16 cores been used.
Is map task is multi threaded ?
Map task use more than 1 core ?
Can I know which cores used each map task ?
Thanks,
Assaf

You can configure the number of cores per map, as well as the maximum number of usable cores - see here.
The question sounds a bit confused, so, some more details which may be relevant:
A task might do more than just run a map, and, if you're running hadoop, you might be using the cores with something else in the system (ie, maybe some other process is using the cores).
A mapping task might use more than one mapper to do its job - that's part of the point of using hadoop and a MR architecture - your work will get auto-magically distributed and split for you.
Also, beware, your number of tasks doesn't directly relate to the number of mappers, cores or other resources in use; if what you're looking to do is limit cpu usage, or in any other way control resource allocation, change the properties of your containers.
For a more detailed discussion of resource allocation (esp. when compared to MR1) see here.

Related

Limit cores per Apache Spark job

I have a dataset for which I'd like to run multiple jobs for in parallel.
I do this by launching each action in its own thread to get multiple Spark jobs per Spark application like the docs say.
Now the task I'm running doesn't benefit endlessly from throwing more cores at it - at like 50 cores or so the gain of adding more resources is quite minimal.
So for example if I have 2 jobs and 100 cores I'd like to run both jobs in parallel each of them only occupying 50 cores at max to get faster results.
One thing I could probably do is to set the amount of partitions to 50 so the jobs could only spawn 50 tasks(?). But apparently there are some performance benefits of having more partitions than available cores to get a better overall utilization.
But other than that I didn't spot anything useful in the docs to limit the resources per Apache Spark job inside one application. (I'd like to avoid spawning multiple applications to split up the executors).
Is there any good way to do this?
Perhaps asking Spark driver to use fair scheduling is the most appropriate solution in your case.
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
There is also a concept of pools, but I've not used them, perhaps that gives you some more flexibility on top of fair scheduling.
Seems like conflicting requirements with no silver bullet.
parallelize as much as possible.
limit any one job from hogging resources IF (and only if) another job is running as well.
So:
if you increase number of partitions then you'll address #1 but not #2.
if you specify spark.cores.max then you'll address #2 but not #1.
if you do both (more partitions and limit spark.cores.max) then you'll address #2 but not #1.
If you only increase number of partitions then only thing you're risking is that a long running big job will delay the completion/execution of some smaller jobs, though overall it'll take the same amount of time to run two jobs on given hardware in any order as long as you're not restricting concurrency (spark.cores.max).
In general I would stay away from restricting concurrency (spark.cores.max).
Bottom line, IMO
don't touch spark.cores.max.
increase partitions if you're not using all your cores.
use fair scheduling
if you have strict latency/response-time requirements then use separate auto-scaling clusters for long running and short running jobs

Can Spark executor be enabled for multithreading more than CPU cores?

I understand if executor-cores is set to more than 1, then the executor will run in parallel. However, from my experience, the number of parallel processes in the executor is always equal to the number of CPUs in the executor.
For example, suppose I have a machine with 48 cores and set executor-cores to 4, and then there will be 12 executors.
What we need is to run 8 threads or more for each executor (so 2 or more threads per CPU). The reason is that the task is quite light weight and CPU usage is quite low around 10%, so we want to boost CPU usage through multiple threads per CPU.
So asking if we could possibly achieve this in the Spark configuration. Thanks a lot!
Spark executors are processing tasks, which are derived from the execution plan/code and partitions of the dataframe. Each core on an executor is always processing only one task, so each executor only get the number of tasks at most the amount of cores. Having more tasks in one executor as you are asking for is not possible.
You should look for code changes, minimize amount of shuffles (no inner joins; use windows instead) and check out for skew in your data leading to non-uniformly distributed partition sizing (dataframe partitions, not storage partitions).
WARNING:
If you are however alone on your cluster and you do not want to change your code, you can change the YARN settings for the server and represent it with more than 48 cores, even though there are just 48. This can lead to severe instability of the system, since executors are now sharing CPUs. (And your OS also needs CPU power.)
This answer is meant as a complement to #Telijas' answer, because in general I agree with it. It's just to give that tiny bit of extra information.
There are some configuration parameters in which you can set the number of thread for certain parts of Spark. There is, for example, a section in the Spark docs that discusses some of them (for all of this I'm looking at the latest Spark version at the time of writing this post: version 3.3.1):
Depending on jobs and cluster configurations, we can set number of threads in several places in Spark to utilize available resources efficiently to get better performance. Prior to Spark 3.0, these thread configurations apply to all roles of Spark, such as driver, executor, worker and master. From Spark 3.0, we can configure threads in finer granularity starting from driver and executor. Take RPC module as example in below table. For other modules, like shuffle, just replace “rpc” with “shuffle” in the property names except spark.{driver|executor}.rpc.netty.dispatcher.numThreads, which is only for RPC module.
Property Name
Default
Meaning
Since Version
spark.{driver
executor}.rpc.io.serverThreads
Fall back on spark.rpc.io.serverThreads
Number of threads used in the server thread pool
spark.{driver
executor}.rpc.io.clientThreads
Fall back on spark.rpc.io.clientThreads
Number of threads used in the client thread pool
spark.{driver
executor}.rpc.netty.dispatcher.numThreads
Fall back on spark.rpc.netty.dispatcher.numThreads
Number of threads used in RPC message dispatcher thread pool
Then here follows a (non-exhaustive in no particular order, just been looking through the source code) list of some other number-of-thread-related configuration parameters:
spark.sql.streaming.fileSource.cleaner.numThreads
spark.storage.decommission.shuffleBlocks.maxThreads
spark.shuffle.mapOutput.dispatcher.numThreads
spark.shuffle.push.numPushThreads
spark.shuffle.push.merge.finalizeThreads
spark.rpc.connect.threads
spark.rpc.io.threads
spark.rpc.netty.dispatcher.numThreads (will be overridden by the driver/executor-specific ones from the table above)
spark.resultGetter.threads
spark.files.io.threads
I didn't add the meaning of these parameters to this answer because that's a different question and quite "Googleable". This is just meant as an extra bit of info.

Apache Spark: Understanding terminology of Driver and Executor Configuration

I have been seeing the following terms in every distributed computing open source projects more often particularly in Apache spark and hoping to get explanation with a simple example.
spark.driver.cores - Number of cores to use for the driver process, only in cluster mode.
spark.driver.memory - Amount of memory to use for the driver process
spark.executor.cores - The number of cores to use on each executor
spark.executor.memory - Amount of memory to use per executor process
spark.task.cpus - Number of cores to allocate for each task.
For example, if there are three computers C1, C2 and C3 with intel i5 processor(CPU) which has 4 cores(shorturl.at/apsSW) and assume 16 GB RAM and 1 TB secondary storage(mounted/in-built).
So, where does above mentioned terms would fit if I try to process 1 GB of CSV file using those three computers in spark cluster environment with YARN?
If we take C1 as master computer/server/node (Uff.. too many terms) and C2, C3 as slaves/workers/executers
On highlevel, I am thinking that (with an assumption that the file is not in C1, C2 or C3 but on some cloud storage) -
When we submit spark program(let's say the program about reading and displaying first row on drivers console) on driver(C1), it tells its executors(C2 and C3) to get the data from that CSV file through internet/intranet in partitions(if partitions are defined in program).
But,
don't know how driver cores and memory & executor cores and memory impact this entire process.
Oh.. coming to the term process, it is always been linked to cores. What exactly a process represent, does it represent process behind spark program?
And there comes, tasks.. are these tasks represent processes spawned from spark program?
And in cloud, there is a term vCPUs to confuse more, does vCPU corresponds to a single core?
So, where does above mentioned terms would fit if I try to process 1
GB of CSV file using those three computers in spark cluster
environment with YARN?
By default you would have a driver randomly assigned to one of those three nodes.(with 1 gig of memory, 1 core) and 2(if I remember correctly) executors assigned (with 1 gig of memory each 1 core each)
Of course all of this is configurable but it's best to stick to defaults until you run out of memory.
don't know how driver cores and memory & executor cores and memory
impact this entire process.
Good, leave it alone until you run out of memory/need more performance. Don't pre-tune your config. Wait until there is a performance issue then solve that. Way to often people assign 8 gigs 8 cores to a process and use 50k and 1 core. There are settings to tune performance later when it's an issue. Know that there there, and when you run into an issue then start to tweak them.
h.. coming to the term process, it is always been linked to cores.
What exactly a process represent, does it represent process behind
spark program?
I don't really get the ask. Computation take resources. You need space or cores to process information.
And there comes, tasks.. are these tasks represent processes spawned
from spark program?
A task is a unit of work defined by the program you write. Look into the spark UI and it will give you some intuition as to how things work.
And in cloud, there is a term vCPUs to confuse more, does vCPU
corresponds to a single core?
It allows you to define the number of cores on a machine. You can underprescibe or over prescribe them to a node, depends on your needs. For now imagine them as your way to define what the node has availble, and that spark will interpret them as cores.

Spark: understanding partitioning - cores

I'd like to understand partitioning in Spark.
I am running spark in local mode on windows 10.
My laptop has 2 physical cores and 4 logical cores.
1/ Terminology : to me, a core in spark = a thread. So a core in Spark is different than a physical core, right? A Spark core is associated to a task, right?
If so, since you need a thread for a partition, if my sparksql dataframe has 4 partitions, it needs 4 threads right?
2/ If I have 4 logical cores, does it mean that I can only run 4 concurrent threads at the same time on my laptop? So 4 in Spark?
3/ Setting the number of partitions : how to choose the number of partitions of my dataframe, so that further transformations and actions run as fast as possible?
-Should it have 4 partitions since my laptop has 4 logical cores?
-Is the number of partitions related to physical cores or logical cores?
-In spark documentations, it's written that you need 2-3 tasks per CPU. Since I have two physical coresn should the nb of partitions be equal to 4or6?
(I know that number of partitions will not have much effect on local mode, but this is just to understand)
Theres no such thing as a "spark core". If you are referring to options like --executor-cores then yes, that refers to how many tasks each executor will run concurrently.
You can set the number of concurrent tasks to whatever you want, but more than the number of logical cores you have probably won't give and advantage.
Number of partitions to use is situational. Without knowing the data or the transformations you are doing it's hard to give a number. Typical advice is to use just below a multiple of your total cores., for example, if you have 16 cores, maybe 47, 79, 127 and similar numbers just under a multiple of 16 are good to use. The reason for this is you want to make sure all cores are working (as little time as possible do you have resources idle, waiting for others to finish). but you leave a little extra to allow for speculative execution (spark may decide to run the same task twice if it is running slowly to see if it will go faster on a second try).
Picking the number is a bit of trial and error though, Take advantage of the spark job server to monitor how your tasks are running. Having few tasks with many of records each means you should probably increase the number of partitions, on the other hand, many partitions with only a few records each is also bad and you should try to reduce the partitioning in these cases.

Spark + EMR using Amazon's "maximizeResourceAllocation" setting does not use all cores/vcores

I'm running an EMR cluster (version emr-4.2.0) for Spark using the Amazon specific maximizeResourceAllocation flag as documented here. According to those docs, "this option calculates the maximum compute and memory resources available for an executor on a node in the core node group and sets the corresponding spark-defaults settings with this information".
I'm running the cluster using m3.2xlarge instances for the worker nodes. I'm using a single m3.xlarge for the YARN master - the smallest m3 instance I can get it to run on, since it doesn't do much.
The situation is this: When I run a Spark job, the number of requested cores for each executor is 8. (I only got this after configuring "yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" which isn't actually in the documentation, but I digress). This seems to make sense, because according to these docs an m3.2xlarge has 8 "vCPUs". However, on the actual instances themselves, in /etc/hadoop/conf/yarn-site.xml, each node is configured to have yarn.nodemanager.resource.cpu-vcores set to 16. I would (at a guess) think that must be due to hyperthreading or perhaps some other hardware fanciness.
So the problem is this: when I use maximizeResourceAllocation, I get the number of "vCPUs" that the Amazon Instance type has, which seems to be only half of the number of configured "VCores" that YARN has running on the node; as a result, the executor is using only half of the actual compute resources on the instance.
Is this a bug in Amazon EMR? Are other people experiencing the same problem? Is there some other magic undocumented configuration that I am missing?
Okay, after a lot of experimentation, I was able to track down the problem. I'm going to report my findings here to help people avoid frustration in the future.
While there is a discrepancy between the 8 cores asked for and the 16 VCores that YARN knows about, this doesn't seem to make a difference. YARN isn't using cgroups or anything fancy to actually limit how many CPUs the executor can actually use.
"Cores" on the executor is actually a bit of a misnomer. It is actually how many concurrent tasks the executor will willingly run at any one time; essentially boils down to how many threads are doing "work" on each executor.
When maximizeResourceAllocation is set, when you run a Spark program, it sets the property spark.default.parallelism to be the number of instance cores (or "vCPUs") for all the non-master instances that were in the cluster at the time of creation. This is probably too small even in normal cases; I've heard that it is recommended to set this at 4x the number of cores you will have to run your jobs. This will help make sure that there are enough tasks available during any given stage to keep the CPUs busy on all executors.
When you have data that comes from different runs of different spark programs, your data (in RDD or Parquet format or whatever) is quite likely to be saved with varying number of partitions. When running a Spark program, make sure you repartition data either at load time or before a particularly CPU intensive task. Since you have access to the spark.default.parallelism setting at runtime, this can be a convenient number to repartition to.
TL;DR
maximizeResourceAllocation will do almost everything for you correctly except...
You probably want to explicitly set spark.default.parallelism to 4x number of instance cores you want the job to run on on a per "step" (in EMR speak)/"application" (in YARN speak) basis, i.e. set it every time and...
Make sure within your program that your data is appropriately partitioned (i.e. want many partitions) to allow Spark to parallelize it properly
With this setting you should get 1 executor on each instance (except the master), each with 8 cores and about 30GB of RAM.
Is the Spark UI at http://:8088/ not showing that allocation?
I'm not sure that setting is really a lot of value compared to the other one mentioned on that page, "Enabling Dynamic Allocation of Executors". That'll let Spark manage it's own number of instances for a job, and if you launch a task with 2 CPU cores and 3G of RAM per executor you'll get a pretty good ratio of CPU to memory for EMR's instance sizes.
in the EMR version 3.x, this maximizeResourceAllocation was implemented with a reference table: https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/vcorereference.tsv
it used by a shell script: maximize-spark-default-config, in the same repo, you can take a look how they implemented this.
maybe in the new EMR version 4, this reference table was somehow wrong... i believe you can find all this AWS script in your EC2 instance of EMR, should be located in /usr/lib/spark or /opt/aws or something like this.
anyway, at least, you can write your own bootstrap action scripts for this in EMR 4, with a correct reference table, similar to the implementation in EMR 3.x
moreover, since we are going to use STUPS infrastructure, worth take a look the STUPS appliance for Spark: https://github.com/zalando/spark-appliance
you can explicitly specify the number of cores by setting senza parameter DefaultCores when you deploy your spark cluster
some of highlight of this appliance comparing to EMR are:
able to use it with even t2 instance type, auto-scalable based on roles like other STUPS appliance, etc.
and you can easily deploy your cluster in HA mode with zookeeper, so no SPOF on master node, HA mode in EMR is currently still not possible, and i believe EMR is mainly designed for "large clusters temporarily for ad-hoc analysis jobs", not for "dedicated cluster that is permanently on", so HA mode will not be possible in near further with EMR.

Resources