sparksession getorcreate method stuck till other spark job is running - apache-spark

I'm trying to run spark job from oozie and I have two action in oozie workflow and they are running parallely. However when Oozie started one other is stuck till one is completed at the sparksession.getorcreate() method.
self.spark_session = SparkSession.builder.master(master).appName(appName).config(conf=conf).enableHiveSupport().getOrCreate()

If you run on YARN - open Resource Manager UI and check if there is enough resources for all the jobs (e.g. vCores/Memory).

Related

Why is there just 1 job id in dataproc when there are multiple actions in the pyspark script?

The definition of spark job is:
Job- A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.
So, why is it that each spark-submit creates just one job id in dataproc console that I can see?
Example: The following application should have 2 spark jobs
sc.parallelize(range(1000),10).collect()
sc.parallelize(range(1000),10).collect()
There is a difference between Dataproc job and Spark job. When you submit the script through Dataproc API/CLI, it creates a Dataproc job, which in turn calls spark-submit to submit the script to Spark. But inside Spark, the code above does create 2 Spark jobs. You can see it in the Spark UI:

Job inside spark application is complete but I still see the status as running, why?

I am running a spark application that completed all its jobs but still the status of this job yarn cluster portal is RUNNING (for more than 30 mins). Please let me know why it is happening.
Spark UI showing my jobs are completed
Spark application status is still running
I had the same problem with Spark 2.4.8 running on K8S, I didn’t understand why, but I solved it by stopping the context manually
spark.sparkContext.stop()

Application job submission with out duplication

We are using DataStax Spark 6.0.
We are submitting jobs using crontab to run every 5 mins. We wrote script to find if it is running to avoid duplicate submission of same application. Is there a way to stop job submission or keep job in Queue in Spark level, to avoid duplicate jobs with same application.
Thanks
Rakesh
I tried using Crontab only
You can use oozie to shedule your spark job .

What is 'Active Jobs' in Spark History Server Spark UI Jobs section

I'm trying to understand Spark History server components.
I know that, History server shows completed Spark applications.
Nonetheless, I see 'Active Jobs' set to 1 for a completed Spark application. I'm trying to understand what is 'Active Jobs' mean in Jobs section.
Also, Application completed within 30 minutes, but when I opened History Server after 8 hours, 'Duration' shows 8.0h.
Please see the screenshot.
Could you please help me understand 'Active Jobs', 'Duration' and 'Stages: Succeeded/Total' items in above image?
Finally after some research, found answer for my question.
A Spark application consists of a driver and one or more executors. The driver program instantiates SparkContext, which coordinates the executors to run the Spark application. This information is displayed on Spark History Server Web UI 'Active Jobs' section.
The executors run tasks assigned by the driver.
When Spark application runs on YARN, it has its own implementation of yarn client and yarn application master.
YARN application has a yarn client, yarn application master and list of container running on node managers.
In my case Yarn is running in standalone mode, thus driver program is running as a thread of the yarn application master. The Yarn client pulls status from the application master and application master coordinates the containers to run the tasks.
This running job could be monitored in YARN applications page in the Cloudera Manager Admin Console, while it is running.
If application succeeds, then History server will show list of 'Completed Jobs' and also 'Active Jobs' section will be removed.
If application fails at the containers level and YARN communicates this information to Driver then, History server will show list of 'Failed Jobs' and also 'Active Jobs' section will be removed.
Nonetheless, if application fails at the containers level and YARN couldn't communicate that to driver, then Driver instantiated job gets into oblivion state. It thinks job is still being run and keeps waiting to hear from YARN application master for the job status. Hence, in History Server, it still shows up in 'Active Jobs' as running.
So my take away from this is:
To check the status of running job, go to YARN applications page in the Cloudera Manager Admin Console or use YARN CLI command.
After job completion/failure, Open the Spark History Server to get more details on resources usage, DAG and execution timeline information.
Invoking an action(count is action in your case) inside a Spark application triggers the launch of a job to fulfill it. Spark examines the dataset on which that action depends and formulates an execution plan. The execution plan assembles the dataset transformations into stages.
A stage is a physical unit of the execution plan. In shorts, Stage is a set of parallel tasks i.e. one task per partition. Basically, each job which gets divided into smaller sets of tasks is a stage. Although, it totally depends on each other. However, it somewhat same as the map and reduce stages in MapReduce.
each type of Spark Stages in detail:
a. ShuffleMapStage in Spark
ShuffleMapStage is considered as an intermediate Spark stage in the physical execution of DAG.
Basically, it produces data for another stage(s).
consider ShuffleMapStage in Spark as input for other following Spark stages in the DAG of stages.
However, it is possible that there is n number of multiple pipeline operations, in ShuffleMapStage.
like map and filter, before shuffle operation. Furthermore, we can share single ShuffleMapStage among different jobs.
b. ResultStage in Spark
By running a function on a spark RDD Stage which executes a Spark action in a user program is a ResultStage.It is considered as a final stage in spark. ResultStage implies as a final stage in a job that applies a function on one or many partitions of the target RDD in Spark, helps for computation of the result of an action.
coming back to the question of active jobs on history sever there some notes listed on official docs
as history server.Also there is jira [SPARK-7889] issue regarding the same link.
for more details follow the link
source-1

Apache Nifi - Submitting Spark batch jobs through Apache Livy

I want to schedule my spark batch jobs from Nifi. I can see there is ExecuteSparkInteractive processor which submit spark jobs to Livy, but it executes the code provided in the property or from the content of the incoming flow file. How should I schedule my spark batch jobs from Nifi and also take different actions if the batch job fails or succeeds?
You could use ExecuteProcess to run a spark-submit command.
But what you seem to be looking for, is not a DataFlow management tool, but a workflow manager. Two great examples for workflow managers are: Apache Oozie & Apache Airflow.
If you still want to use it to schedule spark jobs, you can use the GenerateFlowFile processor to be scheduled(on primary node so it won't be scheduled twice - unless you want to), and then connect it to the ExecuteProcess processor, and make it run the spark-submit command.
For a little more complex workflow, I've written an article about :)
Hope it will help.

Resources