How to Run a spark job in cluster mode in GCP? - apache-spark

In GCP, we want to run a spark job in cluster mode on a data[proc cluster. Currently we are using the following command:-
gcloud dataproc jobs submit spark --cluster xxxx-xxxx-dataproc-cluster01 --region us-west2 --xxx.xxxx.xxx.xxx.xxx.xxx.xxxx.xxxx --jars gs://xxx-xxxx-poc/cluster-compute/lib/xxxxxxxx-cluster-computation-jar-0.0.1-SNAPSHOT-allinone.jar --properties=spark:spark.submit.deployMode=cluster --properties=spark.driver.extraClassPath=/xxxx/xxxx/xxxx/ -- -c xxxxxxxx -a
However using above the job is being submitted in local mode. We need to run in cluster mode.

You can run it in cluster mode by specifying the following --properties spark.submit.deployMode=cluster
In your example the deployMode doesn't look correct.
--properties=spark:spark.submit.deployMode=cluster
Looks like spark: is extra.
Here is the entire command for the job submission
gcloud dataproc jobs submit pyspark --cluster XXXXX --region us-central1 --properties="spark.submit.deployMode=cluster" gs://dataproc-examples/pyspark/hello-world/hello-world.py
Below is the screenshot of the job running in cluster mode
Update
To pass multiple properties below is the dataproc job submit
gcloud dataproc jobs submit pyspark --cluster cluster-e0a0 --region us-central1 --properties="spark.submit.deployMode=cluster","spark.driver.extraClassPath=/xxxxxx/configuration/cluster-mode/" gs://dataproc-examples/pyspark/hello-world/hello-world.py
On running the job below is the screenshot which shows the deployMode is Cluster and the extra class path is also set

If want to run the spark job through cloud shell use below command
gcloud dataproc jobs submit spark --cluster cluster-test
-- class org.apache.spark.examples.xxxx --jars file:///usr/lib/spark/exampleas/jars/spark-examples.jar --1000

Related

hadoop multi node with spark sample job

I have just configured spark on my Hadoop cluster and i want to run the spark sample job.
before that I want to understand what, this below job code stands for.
spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10
You can see all possible parameters for submitting a spark job on here. I summarized the ones in your submit script as below:
spark-submit
--deploy-mode client # client/cluster. default value client. Whether to deploy your driver on the worker nodes or locally
--class org.apache.spark.examples.SparkPi # The entry point for your application
$SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10 #jar file path and expected arguments
--master is another parameter usually defined in submit scripts. For my HDP cluster default value of master is yarn. You can see all possible values for master in spark documentation again.

Spark job not showing up on standalone cluster GUI

I am playing with running spark jobs in my lab and have a three node standalone cluster. When I execute a new job on the master node via CLI
spark-submit sparktest.py --master spark://myip:7077
while the job completes as expected it does not show up at all on the cluster GIU. After some investigation, I added the --master to the submit command but to no avail. During job execution as well as after completion when I navigate to http://mymasternodeip:8080/
none of these jobs are recognized in Running Jobs nor Completed Jobs. Any thoughts as to why the jobs dont show up would be appreciated.
You should specify --master flag first then remaining flags/options. If not master will be considered as local.
spark-submit --master spark://myip:7077 sparktest.py
Make sure that you don't override master config in your code while creating SparkSession object. Provide same master url in code also or don't add it.

Submitting Job Arguments to Spark Job in Dataproc

Trying to run Spark-Wiki-Parser on a GCP Dataproc cluster. The code takes in two arguments "dumpfile" and "destloc". When I submit the following I get a [scallop] Error: Excess arguments provided: 'gs://enwiki-latest-pages-articles.xml.bz2 gs://output_dir/'.
gcloud dataproc jobs submit spark --cluster $CLUSTER_NAME --project $CLUSTER_PROJECT \
--class 'com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild.DatabaseBuildMain' \
--properties=^#^spark.jars.packages='com.databricks:spark-xml_2.11:0.5.0,com.github.nielsenbe:spark-wiki-parser_2.11:1.0' \
--region=$CLUSTER_REGION \
-- 'gs://enwiki-latest-pages-articles.xml.bz2' 'gs://output_dir/'
How do I get the code to recognize the input arguments?
I spent probably 8 hours figuring this out, but figured I'd dump the solution here since it had not been shared yet.
The gcloud CLI separates the dataproc parameters from the class arguments by -- as noted by another user. However, Scallop also requires a -- prior to each named argument. Your cli should look something like this.
gcloud dataproc jobs submit spark --cluster $CLUSTER_NAME --project $CLUSTER_PROJECT
--class 'com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild.DatabaseBuildMain'
--properties=^#^spark.jars.packages='com.databricks:spark-xml_2.11:0.5.0,com.github.nielsenbe:spark-wiki-parser_2.11:1.0'
--region=$CLUSTER_REGION
-- --dumpfile'gs://enwiki-latest-pages-articles.xml.bz2' --destloc 'gs://output_dir/'
It seems like Scala class needs dumpfile and destloc as args.
Could you run following command instead and try if it works?
gcloud dataproc jobs submit spark --cluster $CLUSTER_NAME --project $CLUSTER_PROJECT \
--class 'com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild.DatabaseBuildMain' \
--properties=^#^spark.jars.packages='com.databricks:spark-xml_2.11:0.5.0,com.github.nielsenbe:spark-wiki-parser_2.11:1.0' \
--region=$CLUSTER_REGION \
-- dumpfile gs://enwiki-latest-pages-articles.xml.bz2 destloc gs://output_dir/

How to get the progress bar (with stages and tasks) with yarn-cluster master?

When running a Spark Shell query using something like this:
spark-shell yarn --name myQuery -i ./my-query.scala
Inside my query is simple Spark SQL query where I read parquet files and run simple queries and write out parquet files. When running these queries I get a nice progress bar like this:
[Stage7:===========> (14174 + 5) / 62500]
When I create a jar using the exact same query and run it with the following command-line:
spark-submit \
--master yarn-cluster \
--driver-memory 16G \
--queue default \
--num-executors 5 \
--executor-cores 4 \
--executor-memory 32G \
--name MyQuery \
--class com.data.MyQuery \
target/uber-my-query-0.1-SNAPSHOT.jar
I don't get any such progress bar. The command simply says repeatedly
17/10/20 17:52:25 INFO yarn.Client: Application report for application_1507058523816_0443 (state: RUNNING)
The query works fine and the results are fine. But I just need to have feedback when the process will finish. I have tried the following.
The web page of RUNNING Hadoop Applications does have a progress bar but it basically never moves. Even in the case of the spark-shell query that progress bar is useless.
I have tried get the progress bar through the YARN logs but they are not aggregated until the job is complete. Even then there is no progress bar in the logs.
Is there is a way to launch a spark query in jar on a cluster and have a progressbar?
When I create a jar using the exact same query and run it with the following command-line (...) I don't get any such progress bar.
The difference between these two seemingly similar Spark executions is the master URL.
In the former Spark execution with spark-shell yarn, the master is YARN in client deploy mode, i.e. the driver runs on the machine where you start spark-shell from.
In the latter Spark execution with spark-submit --master yarn-cluster, the master is YARN in cluster deploy mode (which is actually equivalent to --master yarn --deploy-mode cluster), i.e. the driver runs on a YARN node.
With that said, you won't get the nice progress bar (which is actually called ConsoleProgressBar) on the local machine but on the machine where the driver runs.
A simple solution is to replace yarn-cluster with yarn.
ConsoleProgressBar shows the progress of active stages to standard error, i.e. stderr.
The progress includes the stage id, the number of completed, active, and total tasks.
ConsoleProgressBar is created when spark.ui.showConsoleProgress Spark property is turned on and the logging level of org.apache.spark.SparkContext logger is WARN or higher (i.e. less messages are printed out and so there is a "space" for ConsoleProgressBar).
You can find more information in Mastering Apache Spark 2's ConsoleProgressBar.

add file to spark driver classpath file on dataproc

I need to add a config file to driver spark classpath on google dataproc.
I have try to use --files option of gcloud dataproc jobs submit spark but this not work.
Is there a way to do it on google dataproc?
In Dataproc, anything listed as a --jar will be added to the classpath and anything listed as a --file will be made available in each spark executor's working directory. Even though the flag is --jars, it should be safe to put non-jar entries in this list if you require the file to be on the classpath.
I know, I am answering too late. Posting for new visitors.
One can execute this using cloud shell. Have tested this.
gcloud dataproc jobs submit spark --properties spark.dynamicAllocation.enabled=false --cluster=<cluster_name> --class com.test.PropertiesFileAccess --region=<CLUSTER_REGION> --files gs://<BUCKET>/prod.predleads.properties --jars gs://<BUCKET>/snowflake-common-3.1.34.jar

Resources