Spark streaming on Mesos - course grained - apache-spark

I have 2 cores on my vagrant development machine, and want to run 2 streaming applications.
If:
both of them take both available cores ( I didn't specify "spark.cores.max")
they have streaming interval of 15 seconds
5 seconds is enough to perform computation
Is expected behaviour of Mesos to shift these 2 available cores between 2 applications? I would expect that behaviour, because "Mesos locks the resources until job is executed", and in Spark Streaming one job is what is executed within batch interval.
Otherwise, If resources are locked for the life of application (in spark streaming it is forever), what is the benefit of using Mesos instead of Standalone cluster manager?

Spark Streaming locks each stream Reader to a core, plus you'll need at least one other core for the rest of the processing. So you can't run two jobs simultaneously on a 2-core machine.
Mesos gives you much better resource utilization in a cluster. Standalone is more static. It might fine, though, for a fixed number of long-running streams, as long as you have enough resources and you use the recommendations for capping the allowed resources each job can grab (default is to grab everything).
If you're really just running on a single machine, use local[*] to avoid the overhead of master and slave daemons, etc.

Related

Spark job in Dataproc dynamic vs static allocation

I have a Dataproc cluster:
master - 6cores| 32g
worker{0-7} - 6cores| 32g
Maximum allocation: memory:24576, vCores:6
Have two spark-streaming jobs to submit, one after another
In the first place, I tried to submit with default configurations spark.dynamicAllocation.enabled=true
In 30% of cases, I saw that the first job caught almost all available memory and the second was queued and waited for resources for ages. (This is a streaming job which took a small portion of resources every batch ).
My second try was to change a dynamic allocation. I submitted the same two jobs with identical configurations:
spark.dynamicAllocation.enabled=false
spark.executor.memory=12g
spark.executor.cores=3
spark.executor.instances=6
spark.driver.memory=8g
Surprisingly in Yarn UI I saw:
7 Running Containers with 84g Memory allocation for the first job.
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
In Spark UI there are 6 executors and driver for the first job and 2 executors and driver for the second job
After retrying(deleting previous jobs and submitting the same jobs) without dynamic allocation and same configurations, I got a totally different result:
5 containers 59g Memory allocation for both jobs and 71g Reserved Memory for the second job. In spark UI I see 4 executors and driver in both cases.
I have a couple of questions:
If dynamicAllocation=false, why the number of yarn containers is
different from the number of executors? (Firstly I thought that
additional yarn container is a driver, but it differs in memory.)
If dynamicAllocation=false, Why Yarn doesn't create containers by my
exact requirements- 6 containers(spark executors) for both jobs. Why two different attempts with the same configuration lead to different results?
If dynamicAllocation=true - how may it be possible that low consuming memory spark job takes control of all Yarn resources
Thanks
Spark and YARN scheduling are pretty confusing. I'm going to answer the questions in reverse order:
3) You should not be using dynamic allocation in Spark streaming jobs.
The issue is that Spark continuously asks YARN for more executors as long as there's a backlog of tasks to run. Once a Spark job gets an executor, it keeps it until the executor is idle for 1 minute (configurable, of course). In batch jobs, this is okay because there's generally a large, continuous backlog of tasks.
However, in streaming jobs, there's a spike of tasks at the start of every micro-batch, but executors are actually idle most of the time. So a streaming job will grab a lot of executors that it doesn't need.
To fix this, the old streaming API (DStreams) has its own version of dynamic allocation: https://issues.apache.org/jira/browse/SPARK-12133. This JIRA has more background on why Spark's batch dynamic allocation algorithm isn't a good fit for streaming.
However, Spark Structured Streaming (likely what you're using) does not support dynamic allocation: https://issues.apache.org/jira/browse/SPARK-24815.
tl;dr Spark requests executors based on its task backlog, not based on memory used.
1 & 2) #Vamshi T is right. Every YARN application has an "Application Master", which is responsible for requesting containers for the application. Each of your Spark jobs has an app master that proxies requests for containers from the driver.
Your configuration doesn't seem to match what you're seeing in YARN, so not sure what's going on there. You have 8 workers with 24g given to YARN. With 12g executors, you should have 2 executors per node, for a total of 16 "slots". An app master + 6 executors should be 7 containers per application, so both applications should fit within the 16 slots.
We configure the app master to have less memory, that's why total memory for an application isn't a clean multiple of 12g.
If you want both applications to schedule all their executors concurrently, you should set spark.executor.instances=5.
Assuming you're using structured streaming, you could also just run both streaming jobs in the same Spark application (submitting them from different threads on the driver).
Useful references:
Running multiple jobs in one application: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
Dynamic allocation: https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
Spark-on-YARN: https://spark.apache.org/docs/latest/running-on-yarn.html
I have noticed similar behavior in my experience as well and here is what I observed. Firstly the resource allocation by yarn depends on available resources on cluster when the job is submitted. When both jobs are submitted at almost the same time with same config, yarn distributes the available resources equally between the jobs. Now when you throw dynamic allocation in to the mix, things get a little confusing/complex. Now in your case below:
7 Running Containers with 84g Memory allocation for the first job.
--You got 7 containers because you requested 6 executors, one container for each executor and the extra one container is for the application Master
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
--Since the second job was submitted after some time, Yarn allocated the remaining resources...2 containers, one for each executor and the extra one for your application master.
Your containers will never match the executors you requested and will always be one more than the number of executors you requested because you need one container to run your application master.
Hope that answers part of your question.

How SPARK_WORKER_CORES setting impacts concurrency in Spark Standalone

I am using a Spark 2.2.0 cluster configured in Standalone mode. Cluster has 2 octa core machines. This cluster is exclusively for Spark jobs and no other process uses them. I have around 8 Spark Streaming apps which run on this cluster.I explicitly set SPARK_WORKER_CORES (in spark-env.sh) to 8 and allocate one core to each app using total-executor-cores setting. This config reduces the capability to work in parallel on multiple tasks. If a stage works on a partitioned RDD with 200 partitions, only one task executes at a time. What I wanted Spark to do was to start separate thread for each job and process in parallel. But I couldn't find a separate Spark setting to control the number of threads.So, I decided to play around and bloated the number of cores (i.e. SPARK_WORKER_CORES in spark-env.sh) to 1000 on each machine. Then I gave 100 cores to each Spark application. I found that spark started processing 100 partitons in parallel this time indicating that 100 threads were being used.I am not sure if this is the correct method of impacting the number of threads used by a Spark job.
You mixed up two things:
Cluster manger properties - SPARK_WORKER_CORES - total number of cores that worker can offer. Use it to control a fraction of resources that should be used by Spark in total
Application properties --total-executor-cores / spark.cores.max - number of cores that application requests from the cluster manager. Use it control in-app parallelism.
Only the second on is directly responsible for app parallelism as long as, the first one is not limiting.
Also CORE in Spark is a synonym of thread. If you:
allocate one core to each app using total-executor-cores setting.
then you specifically assign a single data processing thread.

Spark on Mesos is much slower than local

I'm running a Spark Streaming process on a 16 CPU's 64 GB RAM host with Mesos.
When I'm running it using Mesos as a cluster manager (by setting --master mesos://leader.mesos:5050) it's running much slower than when it is run in local mode (--master local[4]).
I can't find the reason for that and I have no clue. One of the things I've noticed is that there is one specific task that is taking significantly more time on Mesos than in Local.
The weird thing (maybe that should be the questions' title) is that the task itself takes 6s and its stage (it has only one stage) takes less than a second. See attached pictures (Mesos (1) and (2)). How come? Isn't a job equal to the sum of its parts?
Local:
Mesos:
(1)
(2)
Another note: I did manage to run this exact same Spark Streaming process on another Mesos cluster, and it runs in a sensible amount of time, pretty much like in the local mode described above. The only difference that I can think of is that this cluster has more than one host, and that Spark is running with 2 executors rather than 1. (I couldn't find a way to run more than 1 executor on the same host on Mesos). Is this may be the reason?
Any clues would be much appreciated.
Spark can run over Mesos in two modes: coarse-grained (default) and fine-grained (see documentation).
In coarse-grained mode Spark launches exactly one executor on each machine it was assigned to by Mesos. Inside this task Spark launches other mini-tasks. It has the benefit of lower startup overhead (in your case you don't want to change this mode).
Could you be more specific about your streaming job? Is it CPU, disk, or network bounded? You can easily compare performance if you run some of Spark examples.
If your task is CPU intensive you might consider setting spark.mesos.extra.cores. By default Spark tries to acquire all cores that are being offered by Mesos. So, if there's no other task running on that cluster it shouldn't be a problem.

Spark on Mesos - running multiple Streaming jobs

I have 2 spark streaming jobs that I want to run, as well as keeping some available resources for batch jobs and other operations.
I evaluated Spark Standalone cluster manager, but I realized that I would have to fix the resources for two jobs, which would leave almost no computing power to batch jobs.
I started evaluating Mesos, because it has "fine grained" execution model, where resources are shifted between Spark applications.
1) Does it mean that a single core can be shifted between 2 streaming applications?
2) Although I have spark & cassandra, in order to exploit data locality, do I need to have dedicated core on each of the slave machines to avoid shuffling?
3) Would you recommend running Streaming jobs in "fine grained" or "course grained" mode. I know that logical answer is course grained (in order to minimize the latency of streaming apps) but what when resource in total cluster are limited (cluster of 3 nodes, 4 cores each - there are 2 streaming applications to run and multiple time to time batch jobs)
4) In Mesos, when I run spark streaming job in cluster mode, will it occupy 1 core permanently (like Standalone cluster manager is doing), or will that core execute driver process and sometimes act as executor?
Thank you
Fine grained mode is actually deprecated now. Even with it, each core is allocated to task until completion, but in Spark Streaming, each processing interval is a new job, so tasks only last as long the time it takes to process each interval's data. Hopefully that time is less than the interval time or your processing will back up, eventually running out of memory to store all those RDDs waiting for processing.
Note also that you'll need to have one core dedicated to each stream's Reader. Each will be pinned for the life of the stream! You'll need extra cores in case the stream ingestion needs to be restarted; Spark will try to use a different core. Plus you'll have a core tied up by your driver, if it's also running on the cluster (as opposed to on your laptop or something).
Still, Mesos is a good choice, because it will allocate the tasks to nodes that have capacity to run them. Your cluster sounds pretty small for what you're trying to do, unless the data streams are small themselves.
If you use the Datastax connector for Spark, it will try to keep input partitions local to the Spark tasks. However, I believe that connector assumes it will manage Spark itself, using Standalone mode. So, before you adopt Mesos, check to see if that's really all you need.

Multiple spark streaming contexts on one worker

I have single node cluster with 2 CPUs, where I want to run 2 spark streaming jobs.
I also want to use submit mode "cluster". I am using Standalone cluster manager.
When I submit one application, I see that driver consumes 1 core, and worker 1 core.
Does it mean that there are no cores available for other streaming job? Can 2 streaming jobs reuse executors?
It is totally confusing me, and I don't find it really clear in documentation.
Srdjan
Does it mean that there are no cores available for other streaming job?
If you have a single worker with 2 CPU's and you're deploying in Cluster mode, than you'll have no available cores as the worker has to use a dedicated core for tge driver process to run on your worker machine.
Can 2 streaming jobs reuse executors?
No, each job needs to allocate dedicated resources given by the cluster manager. If one job is running with all available resources, the next scheduled job will be in WAITING state until the first completes. You can see it in the Spark UI.

Resources