How to pass Spark job properties to DataProcSparkOperator in Airflow? - apache-spark

I am trying to execute Spark jar on Dataproc using Airflow's DataProcSparkOperator. The jar is located on GCS, and I am creating Dataproc cluster on the fly and then executing this jar on the newly created Dataproc cluster.
I am able to execute this with DataProcSparkOperator of Airflow with default settings, but I am not able to configure Spark job properties (e.g. --master, --deploy-mode, --driver-memory etc.).
From documentation of airflow didn't got any help. Also tried many things but didn't worked out.
Help is appreciated.

To configure Spark job through DataProcSparkOperator you need to use dataproc_spark_properties parameter.
For example, you can set deployMode like this:
DataProcSparkOperator(
dataproc_spark_properties={ 'spark.submit.deployMode': 'cluster' })
In this answer you can find more details.

Related

Google cloud dataproc --files is not working

I want to copy some property files to master and workers while submitting spark job,
so as stated in the doc I am using --files to copy the files on executors working directory.
but below command is not copying anything in executors working directory. So anybody have idea please share.
gcloud dataproc jobs submit spark --cluster=cluster-name --class=dataproc.codelab.word_count.WordCount --jars=gs://my.jar --region=us-central1 --files=gs://my.properties -- gs://my/input/ gs://my/output3/
According to official Spark documentation, when Spark is running on Yarn, the Spark executor will use local directory configured for Yarn as working directory, which is by default - /hadoop/yarn/nm-local-dir/usercache/{userName}/appcache/{applicationId}.
So based on you description if it dose show up there then it's working as expected.

Is there a way to submit spark job on different server running master

We have a requirement to schedule spark jobs, since we are familiar with apache-airflow we want to go ahead with it to create different workflows. I searched web but did not find a step by step guide to schedule spark job on airflow and option to run them on different server running master.
Answer to this will be highly appreciated.
Thanks in advance.
There are 3 ways you can submit Spark jobs using Apache Airflow remotely:
(1) Using SparkSubmitOperator: This operator expects you have a spark-submit binary and YARN client config setup on our Airflow server. It invokes the spark-submit command with given options, blocks until the job finishes and returns the final status. The good thing is, it also streams the logs from the spark-submit command stdout and stderr.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work.
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath
(2) Using SSHOperator: Use this operator to run bash commands on a remote server (using SSH protocol via paramiko library) like spark-submit. The benefit of this approach is you don't need to copy the hdfs-site.xml or maintain any file.
(3) Using SimpleHTTPOperator with Livy: Livy is an open source REST interface for interacting with Apache Spark from anywhere. You just need to have REST calls.
I personally prefer SSHOperator :)

Airflow and Spark/Hadoop - Unique cluster or one for Airflow and other for Spark/Hadoop

I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop.
I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
I prefer submitting Spark Jobs using SSHOperator and running spark-submit command which would save you from copy/pasting yarn-site.xml. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor should be fine.
There are a variety of options for remotely performing spark-submit via Airflow.
Emr-Step
Apache-Livy (see this for hint)
SSH
Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.

How to submit job(jar) to the Azure Spark cluster through commandline interface?

I am new to HDInsight Spark, I am trying to run a use-case to learn how things work in Azure Spark cluster. This is what I have done so far.
Able to create azure spark cluster.
Create jar by following steps as described in the link: create standalone scala application to run on HDInsight Spark cluster. I have used the same scala code as given in the link.
ssh into head node
upload jar to the blob storage using link: using azure CLI with azure storage
copy zip to machine
hadoop fs -copyToLocal
I have checked that the jar gets uploaded to the headnode(machine).
I want to run that jar and get the results as stated in the link given in
point 2 above.
What will be the next step? How can I submit spark job and get results using command line interface?
For example considering you are created jar for program submit.jar. In order to submit this to your cluster with dependency you can use below syntax.
spark-submit --master yarn --deploy-mode cluster --packages "com.microsoft.azure:azure-eventhubs-spark_2.11:2.2.5" --class com.ex.abc.MainMethod "wasb://space-hdfs#yourblob.blob.core.windows.net/xx/xx/submit.jar" "param1.json" "param2"
Here --packages :is to include dependency to you program, you can use --jars option and then followed by jar path. --jars "path/to/dependency/abc.jar"
--class : Main method of your program
after that specify path for your program jar.
you can pass parameters with if you needed as shown above
A couple of options on submitting spark jars:
1) If you want to submit the job on the headnode already, you can use spark-submit
See Apache submit jar documentation
2) An easier alternative is to submit spark jar via livy after uploading the jar to wasb storage.
See submit via livy doc. You can skip step 5 if you do it this way.

Submitting Spark Job On Scheduler Pool

I am running a spark streaming job on cluster mode , i have created a pool with memory of 200GB(CDH). I wanted to run my spark streaming job on that pool, i tried setting
sc.setLocalProperty("spark.scheduler.pool", "pool")
in code but its not working and i also tried the
spark.scheduler.pool seems not working in spark streaming, whenever i run the job it goes in the default pool. What would be the possible issue? Is there any configuration i can add while submitting the job?
In yarn we can add the
--conf spark.yarn.queue="que_name"
to the spark-submit command . Then it will use that particular queue and its resources only.
I ran into this same issue with Spark 2.4. In my case, the problem was resolved by removing the default "spark.scheduler.pool" option in my Spark config.
I traced the issue to a bug in Spark - https://issues.apache.org/jira/browse/SPARK-26988. The problem is that if you set the config property "spark.scheduler.pool" in the base configuration, you can't then override it using setLocalProperty. Removing it from the base configuration made it work correctly. See the bug description for more detail.

Resources