As I understander spark decides no of jobs based on each Action performed. I have 6 actions in my spark which are further divided into stages but i see more than 6 jobs are being spawned.
Is my understanding correct or i am missing something ?
Thanks
Related
I am new to Spark. I have couple of questions regarding the Spark Web UI:-
I have seen that Spark can create multiple Jobs for the same
application. On what basis does it creates the Jobs ?
I understand Spark creates multiple Stages for a single Job around
Shuffle boundaries. Also I understand that there is 1 task per
partition. However, I have seen that a particular Stage (E.g. Stage1)
of a particular Job creating lesser number of tasks than the default
shuffle partitions value (for e.g. only 2/2 completed). And I have
also seen, the next Stage (Stage 2) of the same Job creating
1500 tasks (for E.g. 1500/1500 completed) which is more than
the default shuffle partitions value.
So, how does Spark determine how many tasks should it
create for any particular Stage to execute ?
Can anyone please help me understand the above.
the max number of task in one moment dependent on you cores and exec numbers,
different stage have different task number
I have one question regarding Spark execution which .
We all know that Each spark application (or the driver program) may contain one or many actions.
My question is which one is correct - Do a collection of jobs correspond to one action OR Does each job correspond to one action. Here job means the one that can be seen in the Spark execution UI.
I think the latter is true (each job correspond to one action). Please validate
Thanks.
Your understanding is correct.
Each action in spark corresponds to a Spark Job. And these actions are called by the driver program in the application.
And therefore an action can involve many transformation on the dataset(or RDD). Which creates stages in the job.
A stage can be thought of as the set of calculations(tasks) that can each be computed on an executor without communication with other executors or with the driver.
In other words, a new stage begins whenever network travel between workers is required; for example in a shuffle. These dependencies that create stage boundaries are called ShuffleDependencies.
do tasks in the same stage work simultaneously? if so, the line between partitions in a stage refers to what? example of a DAG
here is a good link for your reading. that explains DAG in detail and few other things that may be of interest. databricks blog on DAG
I can try to explain. as each stage is created it has a set of tasks that are divided. when an action is encountered. Driver sends the task to executors. based on how your data is partitioned N number tasks are invoked on the data in your distributed cluster. so the arrows that you are seeing is execution plan. as in it cannot do map function prior to reading the file. each node that has some data will execute those tasks in order that is provided by the DAG.
I use Spark 2.1 and Kafka 0.9.
Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish.
According to this if i have multiple jobs from multiple threads in case of spark streaming(one topic from each thread) is it possible that multiple topics can run simultaneously if i have enough cores in my cluster or would it just do a round robin across pools but run only one job at a time ?
Context:
I have two topics T1 and T2, both with one 1 partition. I have configured a pool with scheduleMode to be FAIR. I have 4 cores registered with spark. Now each topic has two actions(hence two jobs - totally 4 jobs across topics) Let's say J1 and J2 are jobs for T1 and J3 and J4 are jobs for topic T2. What spark is doing in FAIR mode is execute J1 J3 J2 J4, but at any time only one job is executing. Now as each topic has only one partition, only once core is being used and 3 are just free. This is something which i don't want.
Any way i can avoid this ?
if i have multiple jobs from multiple threads...is it possible that multiple topics can run simultaneously
Yes. That's the purpose of FAIR scheduling mode.
As you may have noticed, I removed "Spark Streaming" from your question since it does not contribute in any way to how Spark schedules Spark jobs. It does not really matter whether you start your Spark jobs from a "regular" application or Spark Streaming one.
Quoting Scheduling Within an Application (highlighting mine):
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads.
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into "stages" (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc.
And then the quote you used to ask the question that should now get clearer.
it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a "round robin" fashion, so that all jobs get a roughly equal share of cluster resources.
So, speaking about Spark Streaming you'd have to configure FAIR scheduling mode and Spark Streaming's JobScheduler should submit Spark jobs per topic in parallel (haven't tested it out myself so it's more theory than practice).
I think that fair scheduler alone will not help, as it's the Spark Streaming engine that takes care of submitting the Spark Jobs and normally does so in a sequential mode.
There's a non-documented configuration parameter in Spark Streaming: spark.streaming.concurrentJobs[1], which is set to 1 by default. It controls the parallelism level of jobs submitted to Spark.
By increasing this value, you may see parallel processing of the different spark stages of your streaming job.
I would think that combining this configuration with the fair scheduler in Spark, you will be able to achieve controlled parallel processing of the independent topic consumers. This is mostly uncharted territory.
I have been using Spark + Python to finish some works, it's great, but I have a question in my mind:
Where is the spark job of transformation and action done?
Is transformation job done in Spark Master (or Driver) while action job is done in Workers (Executors), or both of them are done in Workers (Executors)
Workers (aka slaves) are running Spark instances where executors live
to execute tasks.
Transformations are performed at the worker, when the action method is called the computed data is brought back to the driver.
An application in Spark is executed in three steps:
1.Create RDD graph, i.e. DAG (directed acyclic graph) of RDDs to represent entire computation.
2.Create stage graph, i.e. a DAG of stages that is a logical execution plan based on the RDD graph. Stages are created by breaking the RDD graph at shuffle boundaries.
3.Based on the plan, schedule and execute tasks on workers.
Transformations run at executors.
Actions run at executors and driver. Most of the work is still happening in the executors but the final steps like reducing outputs is executed in the driver.
When any action is called on the RDD, Spark creates the DAG and submits to the DAG scheduler.
The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together.
The Stages are passed on to the Task Scheduler.The task scheduler launches tasks via cluster manager.(Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies of the stages.
The tasks(transformation) executes on the Workers(Executors)
and when action(take/collect) is called it brings back the data at the
Driver.