How to submit spark batch jobs to HDInsight from Jupyter PySpark/Livy? - apache-spark

I'm using an Azure HDInsight cluster and through the Jupyter interface included with HDI I can run interactive spark queries, but I was wondering how to run long running jobs. E.g. if during my interactive querying I realize I want to do some long running job that will take a few hours, is there way to run a command from PySpark itself, e.g. read data from path x, do some transformation, and save in path y?
Currently if I just try to do that job inside the PySpark session itself and leave it running, Livy will eventually timeout and kill the job. Is there some command to submit the batch job and get an ID I can query later to get job status?

Related

Get information about the current dataproc cluster created after workflow submission

Suppose I run a pyspark job using a dataproc workflow template and an ephemeral cluster... How can I get the name of the cluster created inside my pyspark job
One way would be to fork out and run this command:
/usr/share/google/get_metadata_value attributes/dataproc-cluster-name
The only output will be the cluster name, without any new line characters or anything else to cleanup. See Running shell command and capturing the output

Using airflow to run spark streaming jobs?

We have in our hadoop cluster Spark Batch jobs and and Spark streaming jobs.
We would like to schedule and manage them both on the same platform.
We came across airflow, Which fits our need for a
"platform to author, schedule, and monitor workflows".
I just want to be able to stop and start spark streaming job. Using airflow graphs and profiling is less of an issue.
My question is,
Beside losing some functionality(graphs, profiling) , Why shouldn't I use Airflow to run spark streaming jobs?
I came across this question :
Can airflow be used to run a never ending task?
which says it's possible and not why you shouldn't.
#mMorozonv's Looks good. You could have one DAG start the stream if it does not exist. Then a second DAG as a health checker to track it's progress. If the health check fails you could trigger the first DAG again.
Alternatively you can run the stream with a trigger interval of once[1].
# Load your Streaming DataFrame
sdf = spark.readStream.load(path="data/", format="json", schema=my_schema)
# Perform transformations and then write…
sdf.writeStream.trigger(once=True).start(path="/out/path", format="parquet")
This gives you all the same benefits of spark streaming, with the flexibility of batch processing.
You can simply point the stream at your data and this job will detect all the new files since the last iteration (using checkpointing), run a streaming batch, then terminate. You could trigger your airflow DAG's schedule to suit whatever lag you'd like to process data at (every minute, hour, etc.).
I wouldn't recommend this for low latency requirements, but its very suitable to be run every minute.
[1] https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html
Using Airflow branching functionality we can have one dag which will do both scheduling and monitoring of our streaming job. Dag will do a status check of the application and in case application is not running dag will submit a streaming job. In another case dag execution can be finished or you can add a sensor which will check streaming job status after some time with alerts and other stuff you need.
There are two main problems:
Submit streaming application without waiting until it will be
finished. Otherwise our operator will run until it will reach execution_timeout;
That problem can be solved by scheduling out streaming job under cluster mode with spark.yarn.submit.waitAppCompletion configuration parameter set tofalse
Check the status of our streaming operator;
We can check streaming application status using Yarn. For example we can use command yarn application -list -appStates RUNNING . In case our application will be among the list of running applications we should no trigger our streaming job. The only thing is to make streaming job name unique.
There are no strict reasons why you shouldn't use Airflow to run Spark Streaming job. In fact you can monitor your process by periodically logging some metrics with
LOG.info(query.lastProgress)
LOG.info(query.status)
and see them in task log

Kill Spark Job or terminate EMR Cluster if job takes longer than expected

I have a spark job that periodically hangs, leaving my AWS EMR cluster in a state where an application is RUNNING but really the cluster is stuck. I know that if my job doesn't get stuck, it'll finish in 5 hours or less. If it's still running after that, it's a sign that the job is stuck. Yarn and the Spark UI is still responsive, the it's just that an executor gets stuck on a task.
Background: I'm using an ephemeral EMR cluster that performs only one step before terminating, so it's not a problem to kill it off if I notice this job is hanging.
What's the easiest way to kill the task, job, or cluster in this case? Ideally this would not involve setting up some extra service to monitor the job -- ideally there would be some kind of spark / yarn / emr setting I could use.
Note: I've tried using spark speculation to unblock the stuck spark job, but that doesn't help.
EMR has a Bootstrap Actions feature where you can run scripts that start up when initializing the cluster. I've used this feature along with a startup script that monitors how long the cluster has been online and terminates itself after a certain time.
I use a script based off this one for the bootstrap action. https://github.com/thomhopmans/themarketingtechnologist/blob/master/6_deploy_spark_cluster_on_aws/files/terminate_idle_cluster.sh
Basically make a script that checks /proc/uptime to see how long the EC2 machine has been online and after uptime surpasses your time limit you can send a shutdown command to the cluster.

Spark job submission using Airflow by submitting batch POST method on Livy and tracking job

I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.
Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.
What is best way to track Spark job using Airflow if even I submitted?
My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated against other possibilities:
Specifying remote master IP: Requires modifying global configurations / environment variables
Using SSHOperator: SSH connection might break
Using EmrAddStepsOperator: Dependent on EMR
Regarding tracking
Livy only reports state and not progress (% completion of stages)
If your'e OK with that, you can just poll the Livy server via REST API and keep printing logs in console, those will appear on task logs in WebUI (View Logs)
Other considerations
Livy doesn't support reusing SparkSession for POST/batches request
If that's imperative, you'll have to write your application code in PySpark and use POST/session requests
References
How to submit Spark jobs to EMR cluster from Airflow?
livy/examples/pi_app
rssanders3/livy_spark_operator_python_example
Useful links
How to submit Spark jobs to EMR cluster from Airflow?
Remote spark-submit to YARN running on EMR

Apache Nifi - Submitting Spark batch jobs through Apache Livy

I want to schedule my spark batch jobs from Nifi. I can see there is ExecuteSparkInteractive processor which submit spark jobs to Livy, but it executes the code provided in the property or from the content of the incoming flow file. How should I schedule my spark batch jobs from Nifi and also take different actions if the batch job fails or succeeds?
You could use ExecuteProcess to run a spark-submit command.
But what you seem to be looking for, is not a DataFlow management tool, but a workflow manager. Two great examples for workflow managers are: Apache Oozie & Apache Airflow.
If you still want to use it to schedule spark jobs, you can use the GenerateFlowFile processor to be scheduled(on primary node so it won't be scheduled twice - unless you want to), and then connect it to the ExecuteProcess processor, and make it run the spark-submit command.
For a little more complex workflow, I've written an article about :)
Hope it will help.

Resources