How to Launch a Spark Job in EMR creation with terraform - apache-spark

My use case is the following. Via Terraform I want to create an EMR cluster, Start a Spark Job and terminate the cluster when the job is finished.
I found this step mechanism in Terraform documentation (https://www.terraform.io/docs/providers/aws/r/emr_cluster.html#step-1) but I didn't find any example for a Spark Job on Google (an
Maybe i'm doing wrong because my use case seems pretty simple but i can't find an other way to do it.
Thanks for your help

I found it finally
With step instruction it's possible to launch a Spark Job form a Jar stored in s3
step {
action_on_failure = "TERMINATE_CLUSTER"
name = "Launch Spark Job"
hadoop_jar_step {
jar = "command-runner.jar"
args = ["spark-submit","--class","com.mycompany.App","--master","yarn","s3://my_bucket/my_jar_with_dependencies.jar"]
}
}

Related

How to read stderr logs from AWS logs

I am using EMR steps to run my jobs.
Typically when I want to analyze the performance of a job or to understand why it failed, I look at the spark history server for DAG visualizations, and job errors, etc.
For example, if the job failed due to heap error, or Fetchfailed, etc, I can see it clearly specified in the spark history server.
However, I can't seem to be able to find such descriptions when I look at the stderr log files that are written to the LOG URI S3 bucket.
Is there a way to obtain such information?
I use pyspark and set the log level to
sc = spark.sparkContext
sc.setLogLevel('DEBUG')
Any insight as to what I am doing wrong?
I haven't really tested this but as it's a bit long to fit in a comment, I post it here as an answer.
Like pointed out in my comment, the logs you're viewing using Spark History Server UI aren't the same as the Spark driver logs that are saved to S3 from EMR.
To get the spark history server logs written into S3, you'll have to add some additional configuration to your cluster. These configuration options are described in the section Monitoring and Instrumentation of Spark documentation.
In AWS EMR, you could try to add something like this into your cluster configuration:
...
{
'Classification': 'spark-defaults',
'Properties': {
'spark.eventLog.dir': 's3a://your_bucket/spark_logs',
'spark.history.fs.logDirectory': 's3a://your_bucket/spark_logs',
'spark.eventLog.enabled': 'true'
}
}
...
I found this interesting post which describes how to set this for Kubernetes cluster, you may want to check it for further details.

Unable to gracefully finish an Airflow DAG

I have a spark-streaming job that runs on EMR, scheduled by Airflow. We want to gracefully terminate this EMR cluster every week.
But when I issue the kill or SIGTERM signal to the running spark-streaming application it is reporting as "failed" task in the Airflow DAG. This is preventing the DAG to move further, preventing the next run from triggering.
Is there any way either to kill the running spark-streaming app to mark success or to let the DAG complete even though it sees the task as failed?
Is there any way either to kill the running spark-streaming app to mark success or to let the DAG complete even though it sees the task as failed?
For the first part, can you share your code that kills the Spark app? I think you should be able to have this task return success and have everything downstream "just work".
I'm not too familiar with EMR, but looking at the docs it looks like "job flow" is their name for the Spark cluster. In that case, are you using the built-in EmrTerminateJobFlowOperator?
I wonder if the failed task is the cluster terminating propagating back an error code or something? Also, is it possible that the cluster is failing to terminate and your code is raising an exception leading to a failed task?
To answer the second part, if you have multiple upstream tasks, you can use an alternate trigger rule on the operator to determine which downstream tasks run.
class TriggerRule(object):
ALL_SUCCESS = 'all_success'
ALL_FAILED = 'all_failed'
ALL_DONE = 'all_done'
ONE_SUCCESS = 'one_success'
ONE_FAILED = 'one_failed'
DUMMY = 'dummy'
https://github.com/apache/incubator-airflow/blob/master/airflow/utils/trigger_rule.py
https://github.com/apache/incubator-airflow/blob/master/docs/concepts.rst#trigger-rules

spark-submit in cluster deploy mode get application id to console

I am stuck in one problem which I need to resolve quickly. I have gone through many posts and tutorial about spark cluster deploy mode, but I am clueless about the approach as I am stuck for some days.
My use-case :- I have lots of spark jobs submitted using 'spark2-submit' command and I need to get the application id printed in the console once they are submitted. The spark jobs are submitted using cluster deploy mode. ( In normal client mode , its getting printed )
Points I need to consider while creating solution :- I am not supposed to change code ( as it would take long time, cause there are many applications running ), I can only provide log4j properties or some custom coding.
My approach:-
1) I have tried changing the log4j levels and various log4j parameters but the logging still goes to the centralized log directory.
Part from my log4j.properties:-
log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=ALL,console
log4j.appender.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.Target=System.out
log4j.logger.org.apache.spark.deploy.SparkSubmit=ALL
log4j.appender.org.apache.spark.deploy.SparkSubmit=console
log4j.logger.org.apache.spark.deploy.SparkSubmit=TRACE,console
log4j.additivity.org.apache.spark.deploy.SparkSubmit=false
log4j.logger.org.apache.spark.deploy.yarn.Client=ALL
log4j.appender.org.apache.spark.deploy.yarn.Client=console
log4j.logger.org.apache.spark.SparkContext=WARN
log4j.logger.org.apache.spark.scheduler.DAGScheduler=INFO,console
log4j.logger.org.apache.hadoop.ipc.Client=ALL
2) I have also tried to add custom listener and I am able to get the spark application id after the applications finishes , but not to console.
Code logic :-
public void onApplicationEnd(SparkListenerApplicationEnd arg0)
{
for (Thread t : Thread.getAllStackTraces().keySet())
{
if (t.getName().equals("main"))
{
System.out.println("The current state : "+t.getState());
Configuration config = new Configuration();
ApplicationId appId = ConverterUtils.toApplicationId(getjobUId);
// some logic to write to communicate with the main thread to print the app id to console.
}
}
}
3) I have enabled the spark.eventLog to true and specified a directory in HDFS to write the event logs from spark-submit command .
If anyone could help me in finding an approach to the solution, it would be really helpful. Or if I am doing something very wrong, any insights would help me.
Thanks.
After being stuck at the same place for some days, I was finally able to get a solution to my problem.
After going through the Spark Code for the cluster deploy mode and some blogs, few things got clear. It might help someone else looking to achieve the same result.
In cluster deploy mode, the job is submitted via a Client thread from the machine from which the user is submitting. Actually I was passing the log4j configs to the driver and executors, but missed out on the part that the log 4j configs for the "Client" was missing.
So we need to use :-
SPARK_SUBMIT_OPTS="-Dlog4j.debug=true -Dlog4j.configuration=<location>/log4j.properties" spark-submit <rest of the parameters>
To clarify:
client mode means the Spark driver is running on the same machine you ran spark submit from
cluster mode means the Spark driver is running out on the cluster somewhere
You mentioned that it is getting logged when you run the app in client mode and you can see it in the console. Your output is also getting logged when you run in cluster mode you just can't see it because it is running on a different machine.
Some ideas:
Aggregate the logs from the worker nodes into one place where you can parse them to get the app ID.
Write the appIDs to some shared location like HDFS or a database. You might be able to use a Log4j appender if you want to keep log4j.

pass custom exitcode from yarn-cluster mode spark to CLI

I started a yarn cluster mode spark job through spark-submit.
To indicate partial failure etc I want to pass exitcode from driver to script calling spark-submit.
I tried both, System.exit and throwing SparkUserAppException in driver, but in both cases CLI only got 1, not what exitcode I passed.
I think it is impossible to pass custom exitcode, since any exitcode passed by driver will be converted to yarn status and yarn will convert any failed exitCode to 1 or failed.
By looking at spark code, I can conclude this:
It is possible in client mode. Look at runMain() method of SparkSubmit class
Whereas in cluster mode, it is not possible to get the exit status of the driver because your driver class will be running in one of the executors.
There an alternate solution that might/might not be suitable for your use case:
Host a REST API with an endpoint to receive the status update from your driver code. In the case of any exceptions, let your driver code use this endpoint to update the status.
You can save the exit code in the output file (on HDFS or local FS) and make your script wait for this file appearance, read and proceed. This is definitely is not an elegant way, but it may help you to proceed.
When saving file, pay attention to the permissions to this location. Your spark process has to have RW access.

How to leverage a spark cluster from a web app?

A lot of people have asked this question but there is no clear answer except links and references and also most of them are not recent. The question is this :
I have a web app that needs to leverage a spark cluster to run a spark-sql query. My understanding is that submit-job script is asynchronous hence this won't work here. How do I leverage spark in such a setup? Can I just write code in the web app like I do in a self-contained spark application i.e. create a context, set the master URL and do what I need to do ? Will this work in a web app ? If yes, then when would I need the job server that provides REST APIs to submit jobs?
Library for launching Spark applications.
This library allows applications to launch Spark programmatically. There's only one entry point to the library - the SparkLauncher class.
To launch a Spark application, just instantiate a SparkLauncher and configure the application to run. For example:
import org.apache.spark.launcher.SparkLauncher;
public class MyLauncher {
public static void main(String[] args) throws Exception {
Process spark = new SparkLauncher()
.setAppResource("/my/app.jar")
.setMainClass("my.spark.app.Main")
.setMaster("local")
.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
.launch();
spark.waitFor();
}
}
References:
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/launcher/package-summary.html
I think options will be
Through rest api like Livy (Livy is a new open source Spark REST
Server for submitting and interacting with your Spark jobs from
anywhere. ) or spark server (REST APIs) - See how they connect to
spark interactively from using kernel -
https://www.youtube.com/watch?v=TD1J7MzYcFo&feature=youtu.be&t=33m19s
https://developer.ibm.com/open/apache-toree/
Through jdbc (Running via the Thrift JDBC/ODBC server)
Through ssh and submit a job and wait for yarn status (this will
be SSH to the cluster and do a spark submit through YARN - YARN
give you an application ID and you can keep track of application
status with yarn application status command)

Resources