I want to call multiple spark jobs using spark-submit within single EMR cluster. Does EMR supports this?
How to achieve this?
I use AWS Lambda to invoke EMR job for my spark job at this point of time but we would like to extend to multiple spark jobs within single EMR cluster.
You can run multiple spark jobs on one EMR sequentially - that is, the next job will be launched after the previous job completes. This is done using EMR steps.
I used the Java SDK to run this, but you can see in this documentation how to add step using CLI only.
My code below uses spark-submit, but it's not run directly as you would run it in the CLI. Instead I ran it as a shell script, and included an environment variable for HADOOP_USER_NAME so the spark job is run under the username I specify. You can skip it if you want to run the job under the username you logged into your EMR (hadoop, by default).
In the code excerpt below the object emr is of type AmazonElasticMapReduce, provided in the sdk. If you're using the CLI approach you will not need it.
Some assisting methods like uploadConfFile are self-explanatory. I used an extensive configuration for the spark application, and unlike the files and jars which can be local or in s3/hdfs, the configuration file must be in a local file on the EMR itself.
When you finish, you will have created a step on your EMR cluster that will launch a new spark application. You can specify many steps on your EMR which will run one after the other.
//Upload the spark configuration you wish to use to a local file
uploadConfFile(clusterId, sparkConf, confFileName);
//create a list of arguments - which is the complete command for spark-submit
List<String> stepargs = new ArrayList<String>();
//start with an envelope to specify the hadoop user name
stepargs.add("/bin/sh");
stepargs.add("-c");
//call to spark-submit with the incantation stating its arguments are provided next.
stepargs.add("HADOOP_USER_NAME="+task.getUserName()+" spark-submit \"$#\"");
stepargs.add("sh");
//add the spark-submit arguments
stepargs.add("--class");
stepargs.add(mainClass);
stepargs.add("--deploy-mode");
stepargs.add("cluster");
stepargs.add("--master");
stepargs.add("yarn");
stepargs.add("--files");
//a comma-separated list of file paths in s3
stepargs.add(files);
stepargs.add("--jars");
//a comma-separated list of file paths in s3
stepargs.add(jars);
stepargs.add("--properties-file");
//the file we uploaded to the EMR, with its full path
stepargs.add(confFileName);
stepargs.add(jar);
//add the jar specific arguments in here
AddJobFlowStepsResult result = emr.addJobFlowSteps(new AddJobFlowStepsRequest()
.withJobFlowId(clusterId)
.withSteps(new StepConfig()
.withName(name)
.withActionOnFailure(ActionOnFailure.CONTINUE)
.withHadoopJarStep(new HadoopJarStepConfig()
.withJar("command-runner.jar")
.withArgs(stepargs))));
Related
I am trying to read data from Kafka topic using Spark Structured Streaming. Kafka Brokers are SSL enabled. So I need to install/import the private CA certificate into TrustStore file present on spark driver and executors.
I cannot use a separate step to import certificate before the main spark submit command because the spark script is dynamically submitted (downloaded from s3). This spark script from s3 has the information about where the private CA certificate file (.pem) is located (on a separate s3 location).
I looked up the ways to do that. Most of the solutions require RDD or DataFrame to be created and calling either a Map or a MapPartition function on it (essentially defining the partition). But it is like a circular dependency for me. Neither I can create Dataframe or RDD without first importing private ca certificate nor I can import ca certificate without creating DataFrame or RDD.
I can create a dummy DataFrame and try to distribute them on all executors but this solution will not always work (e.g. what if a executor node crashes then recovers or what if the DataFrame is not properly distributed on all executor nodes because of partitioning algorithm limitations).
Can anyone suggest a better way to execute a small function on Driver and all the executors without creating DataFrame or RDD?
If you are running your Spark application on AWS EMR, then the solution to your problem can be handled by the bootstrap action in the EMR.
From the official documentation of bootstrap action bootstrap action, you will find this
You can use a bootstrap action to install additional software or
customize the configuration of cluster instances. Bootstrap actions
are scripts that run on clusters after Amazon EMR launches the
instance using the Amazon Linux Amazon Machine Image (AMI). Bootstrap
actions run before Amazon EMR installs the applications that you
specify when you create the cluster and before cluster nodes begin
processing data. If you add nodes to a running cluster, bootstrap
actions also run on those nodes in the same way. You can create custom
bootstrap actions and specify them when you create your cluster.
You can make these scripts run on either of the driver or executor node or both depending on the use case. By default, it would run on all the instances in the EMR.
You can either place the bootstrap script on S3 or paste the entire script while creating the cluster from the AWS console. I personally prefer placing the script in S3 and specify this file path in the bootstrap action while launching the EMR.
Now to fulfill your use case, you can put the logic of downloading CA certificate in a script and any other custom logic that you want to get executed on all the nodes in the cluster.
Suppose I run a pyspark job using a dataproc workflow template and an ephemeral cluster... How can I get the name of the cluster created inside my pyspark job
One way would be to fork out and run this command:
/usr/share/google/get_metadata_value attributes/dataproc-cluster-name
The only output will be the cluster name, without any new line characters or anything else to cleanup. See Running shell command and capturing the output
I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.
Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.
What is best way to track Spark job using Airflow if even I submitted?
My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated against other possibilities:
Specifying remote master IP: Requires modifying global configurations / environment variables
Using SSHOperator: SSH connection might break
Using EmrAddStepsOperator: Dependent on EMR
Regarding tracking
Livy only reports state and not progress (% completion of stages)
If your'e OK with that, you can just poll the Livy server via REST API and keep printing logs in console, those will appear on task logs in WebUI (View Logs)
Other considerations
Livy doesn't support reusing SparkSession for POST/batches request
If that's imperative, you'll have to write your application code in PySpark and use POST/session requests
References
How to submit Spark jobs to EMR cluster from Airflow?
livy/examples/pi_app
rssanders3/livy_spark_operator_python_example
Useful links
How to submit Spark jobs to EMR cluster from Airflow?
Remote spark-submit to YARN running on EMR
I am trying to execute Spark jar on Dataproc using Airflow's DataProcSparkOperator. The jar is located on GCS, and I am creating Dataproc cluster on the fly and then executing this jar on the newly created Dataproc cluster.
I am able to execute this with DataProcSparkOperator of Airflow with default settings, but I am not able to configure Spark job properties (e.g. --master, --deploy-mode, --driver-memory etc.).
From documentation of airflow didn't got any help. Also tried many things but didn't worked out.
Help is appreciated.
To configure Spark job through DataProcSparkOperator you need to use dataproc_spark_properties parameter.
For example, you can set deployMode like this:
DataProcSparkOperator(
dataproc_spark_properties={ 'spark.submit.deployMode': 'cluster' })
In this answer you can find more details.
We have a requirement to schedule spark jobs, since we are familiar with apache-airflow we want to go ahead with it to create different workflows. I searched web but did not find a step by step guide to schedule spark job on airflow and option to run them on different server running master.
Answer to this will be highly appreciated.
Thanks in advance.
There are 3 ways you can submit Spark jobs using Apache Airflow remotely:
(1) Using SparkSubmitOperator: This operator expects you have a spark-submit binary and YARN client config setup on our Airflow server. It invokes the spark-submit command with given options, blocks until the job finishes and returns the final status. The good thing is, it also streams the logs from the spark-submit command stdout and stderr.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work.
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath
(2) Using SSHOperator: Use this operator to run bash commands on a remote server (using SSH protocol via paramiko library) like spark-submit. The benefit of this approach is you don't need to copy the hdfs-site.xml or maintain any file.
(3) Using SimpleHTTPOperator with Livy: Livy is an open source REST interface for interacting with Apache Spark from anywhere. You just need to have REST calls.
I personally prefer SSHOperator :)