Submitting Job Arguments to Spark Job in Dataproc - apache-spark

Trying to run Spark-Wiki-Parser on a GCP Dataproc cluster. The code takes in two arguments "dumpfile" and "destloc". When I submit the following I get a [scallop] Error: Excess arguments provided: 'gs://enwiki-latest-pages-articles.xml.bz2 gs://output_dir/'.
gcloud dataproc jobs submit spark --cluster $CLUSTER_NAME --project $CLUSTER_PROJECT \
--class 'com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild.DatabaseBuildMain' \
--properties=^#^spark.jars.packages='com.databricks:spark-xml_2.11:0.5.0,com.github.nielsenbe:spark-wiki-parser_2.11:1.0' \
--region=$CLUSTER_REGION \
-- 'gs://enwiki-latest-pages-articles.xml.bz2' 'gs://output_dir/'
How do I get the code to recognize the input arguments?

I spent probably 8 hours figuring this out, but figured I'd dump the solution here since it had not been shared yet.
The gcloud CLI separates the dataproc parameters from the class arguments by -- as noted by another user. However, Scallop also requires a -- prior to each named argument. Your cli should look something like this.
gcloud dataproc jobs submit spark --cluster $CLUSTER_NAME --project $CLUSTER_PROJECT
--class 'com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild.DatabaseBuildMain'
--properties=^#^spark.jars.packages='com.databricks:spark-xml_2.11:0.5.0,com.github.nielsenbe:spark-wiki-parser_2.11:1.0'
--region=$CLUSTER_REGION
-- --dumpfile'gs://enwiki-latest-pages-articles.xml.bz2' --destloc 'gs://output_dir/'

It seems like Scala class needs dumpfile and destloc as args.
Could you run following command instead and try if it works?
gcloud dataproc jobs submit spark --cluster $CLUSTER_NAME --project $CLUSTER_PROJECT \
--class 'com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild.DatabaseBuildMain' \
--properties=^#^spark.jars.packages='com.databricks:spark-xml_2.11:0.5.0,com.github.nielsenbe:spark-wiki-parser_2.11:1.0' \
--region=$CLUSTER_REGION \
-- dumpfile gs://enwiki-latest-pages-articles.xml.bz2 destloc gs://output_dir/

Related

Spark in AKS. Error: Could not find or load main class org.apache.spark.launcher.Main

Update 1: After adding missing pieces and env variables from Spark installation - Error: Could not find or load main class org.apache.spark.launcher.Main, the command no longer throws an error, but prints itself and doesn't do anything else. This is the new result of running the command:
"C:\Program Files\Java\jdk1.8.0_271\bin\java" -cp "C:\Users\xxx\repos\spark/conf\;C:\Users\xxx\repos\spark\assembly\target\scala-2.12\jars\*" org.apache.spark.deploy.SparkSubmit --master k8s://http://127.0.0.1:8001 --deploy-mode cluster --conf "spark.kubernetes.container.image=xxx.azurecr.io/spark:spark2.4.5_scala2.12.12" --conf "spark.kubernetes.authenticate.driver.serviceAccountName=spark" --conf "spark.executor.instances=3" --class com.xxx.bigdata.xxx.XMain --name xxx_app https://storage.blob.core.windows.net/jars/xxx.jar
I have been following this guide for setting up Spark in AKS: https://learn.microsoft.com/en-us/azure/aks/spark-job. I am using Spark tag 2.4.5 with scala 2.12.12. I have done all the following steps:
created AKS with ACR and Azure storage, serviceaccount and role
built spark source
built docker image and push to ACR
built sample SparkPi jar and push to storage
proxied api-server (kubectl proxy) and executed spark-submit:
./bin/spark-submit \
--master k8s://http://127.0.0.1:8001 \
--deploy-mode cluster \
--name xxx_app\
--class com.xxx.bigdata.xxx.XMain\
--conf spark.executor.instances=3 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=xxx.azurecr.io/spark:spark2.4.5_scala2.12.12 \
"https://storage.blob.core.windows.net/jars/xxx.jar"
All I am getting is Error: Could not find or load main class org.apache.spark.launcher.Main
Now, the funny thing is that it doesn't matter at all what I change in the command. I can mess up ACR address, spark image name, jar location, api-server address, anything, and I still get the same error.
I guess I must be making some silly mistake as it seems nothing can break the command more than it already is, but I can't really nail it down.
Does someone have some ideas what might be wrong?
Looks like it might be a problem on the machine you are executing spark-submit. You might be missing some jars on the classpath on the machine you are executing spark-submit. Worth checking out Spark installation - Error: Could not find or load main class org.apache.spark.launcher.Main
Alright, so I managed to submit jobs with spark-submit.cmd, instead. It works, without any additional setup.
I didn't manage to get the bash script to work in the end and I do not have the time to investigate it further at this moment. So, sorry for providing a half-assed answer only partially resolving original problem, but it is a solution nonetheless.
The below command works fine
bin\spark-submit.cmd --master k8s://http://127.0.0.1:8001 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace=dev --conf spark.kubernetes.container.image=xxx.azurecr.io/spark:spark-2.4.5_scala-2.12_hadoop-2.7.7 https://xxx.blob.core.windows.net/jars/SparkPi-assembly-0.1.0-SNAPSHOT.jar

How to Run a spark job in cluster mode in GCP?

In GCP, we want to run a spark job in cluster mode on a data[proc cluster. Currently we are using the following command:-
gcloud dataproc jobs submit spark --cluster xxxx-xxxx-dataproc-cluster01 --region us-west2 --xxx.xxxx.xxx.xxx.xxx.xxx.xxxx.xxxx --jars gs://xxx-xxxx-poc/cluster-compute/lib/xxxxxxxx-cluster-computation-jar-0.0.1-SNAPSHOT-allinone.jar --properties=spark:spark.submit.deployMode=cluster --properties=spark.driver.extraClassPath=/xxxx/xxxx/xxxx/ -- -c xxxxxxxx -a
However using above the job is being submitted in local mode. We need to run in cluster mode.
You can run it in cluster mode by specifying the following --properties spark.submit.deployMode=cluster
In your example the deployMode doesn't look correct.
--properties=spark:spark.submit.deployMode=cluster
Looks like spark: is extra.
Here is the entire command for the job submission
gcloud dataproc jobs submit pyspark --cluster XXXXX --region us-central1 --properties="spark.submit.deployMode=cluster" gs://dataproc-examples/pyspark/hello-world/hello-world.py
Below is the screenshot of the job running in cluster mode
Update
To pass multiple properties below is the dataproc job submit
gcloud dataproc jobs submit pyspark --cluster cluster-e0a0 --region us-central1 --properties="spark.submit.deployMode=cluster","spark.driver.extraClassPath=/xxxxxx/configuration/cluster-mode/" gs://dataproc-examples/pyspark/hello-world/hello-world.py
On running the job below is the screenshot which shows the deployMode is Cluster and the extra class path is also set
If want to run the spark job through cloud shell use below command
gcloud dataproc jobs submit spark --cluster cluster-test
-- class org.apache.spark.examples.xxxx --jars file:///usr/lib/spark/exampleas/jars/spark-examples.jar --1000

PySpark Job fails with workflow template

To follow with this question I decided to try the workflow template API.
Here's what it looks like :
gcloud beta dataproc workflow-templates create lifestage-workflow --region europe-west2
gcloud beta dataproc workflow-templates set-managed-cluster lifestage-workflow \
--master-machine-type n1-standard-8 \
--worker-machine-type n1-standard-16 \
--num-workers 6 \
--cluster-name lifestage-workflow-cluster \
--initialization-actions gs://..../init.sh \
--zone europe-west2-b \
--region europe-west2 \
gcloud beta dataproc workflow-templates add-job pyspark gs://.../main.py \
--step-id prediction \
--region europe-west2 \
--workflow-template lifestage-workflow \
--jars gs://.../custom.jar \
--py-files gs://.../jobs.zip,gs://.../config.ini \
-- --job predict --conf config.ini
The template is correctly created.
The job works when I run it manually from one of my already existing clusters. It also runs when I use an existing cluster instead of asking the workflow to create one.
The thing is I want the cluster to be created before running the job and deleted just after, that's why I'm using a managed cluster.
But with the managed cluster I just can't make it run. I tried to use the same configuration as my existing clusters but it doesn't change anything.
I always get the same error.
Any idea why my job runs perfectly except for when it is run from a generated cluster ?
The problem came from the version of the managed cluster.
By default the image version was 1.2.31 and my existing cluster was using the image 1.2.28. When I changed the config to add --image-version=1.2.28 it worked.
Dataproc image 1.2.31 Upgraded Spark to 2.2.1 which introduced [SPARK-22472]:
SPARK-22472: added null check for top-level primitive types. Before
this release, for datasets having top-level primitive types, and it
has null values, it might return some unexpected results. For example,
let’s say we have a parquet file with schema , and we read it
into Scala Int. If column a has null values, when transformation is
applied some unexpected value can be returned.
This likely added just enough generated code to take classes of 64k limit.

How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?

I am following the instructions for starting a Google DataProc cluster with an initialization script to start a jupyter notebook.
https://cloud.google.com/blog/big-data/2017/02/google-cloud-platform-for-data-scientists-using-jupyter-notebooks-with-apache-spark-on-google-cloud
How can I include extra JAR files (spark-xml, for example) in the resulting SparkContext in Jupyter notebooks (particularly pyspark)?
The answer depends slightly on which jars you're looking to load. For example, you can use spark-xml with the following when creating a cluster:
$ gcloud dataproc clusters create [cluster-name] \
--zone [zone] \
--initialization-actions \
gs://dataproc-initialization-actions/jupyter/jupyter.sh \
--properties spark:spark.jars.packages=com.databricks:spark-xml_2.11:0.4.1
To specify multiple Maven coordinates, you will need to swap the gcloud dictionary separator character from ',' to something else (as we need to use that to separate the packages to install):
$ gcloud dataproc clusters create [cluster-name] \
--zone [zone] \
--initialization-actions \
gs://dataproc-initialization-actions/jupyter/jupyter.sh \
--properties=^#^spark:spark.jars.packages=artifact1,artifact2,artifact3
Details on how escape characters are changed can be found in gcloud:
$ gcloud help topic escaping

Why does Spark (on Google Dataproc) not use all vcores?

I'm running a spark job on a Google DataProc cluster. But looks like Spark is not using all the vcores available in the cluster as you can see below
Based on some other questions like this and this, i have setup the cluster to use DominantResourceCalculator to consider both vcpus and memory for resource allocation
gcloud dataproc clusters create cluster_name --bucket="profiling-
job-default" \
--zone=europe-west1-c \
--master-boot-disk-size=500GB \
--worker-boot-disk-size=500GB \
--master-machine-type=n1-standard-16 \
--num-workers=10 \
--worker-machine-type=n1-standard-16 \
--initialization-actions gs://custom_init_gcp.sh \
--metadata MINICONDA_VARIANT=2 \
--properties=^--^yarn:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
But when i submit my job with custom spark flags, looks like YARN doesn't respect these custom parameters and defaults to using memory as the yardstick for resource calculation
gcloud dataproc jobs submit pyspark --cluster cluster_name \
--properties spark.sql.broadcastTimeout=900,spark.network.timeout=800\
,yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator\
,spark.dynamicAllocation.enabled=true\
,spark.executor.instances=10\
,spark.executor.cores=14\
,spark.executor.memory=15g\
,spark.driver.memory=50g \
src/my_python_file.py
Can help somebody figure out what's going on here?
What I did wrong was to add the configuration yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator to YARN instead of the capacity-scheduler.xml (as it should be rightly) while cluster creation
Secondly, i changed yarn:yarn.scheduler.minimum-allocation-vcores which was initially set to 1.
I'm not sure if either one of these or both of these changes led to the solution (i will update soon). My new cluster creation looks like below:
gcloud dataproc clusters create cluster_name --bucket="profiling-
job-default" \
--zone=europe-west1-c \
--master-boot-disk-size=500GB \
--worker-boot-disk-size=500GB \
--master-machine-type=n1-standard-16 \
--num-workers=10 \
--worker-machine-type=n1-standard-16 \
--initialization-actions gs://custom_init_gcp.sh \
--metadata MINICONDA_VARIANT=2 \
--properties=^--^yarn:yarn.scheduler.minimum-allocation-vcores=4--capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
First, as you have dynamic allocation enabled, you should set the properties spark.dynamicAllocation.maxExecutors and spark.dynamicAllocation.minExecutors (see https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation)
Second, make sure you have enough partitions in your spark job. As you are using dynamic allocation, yarn only allocates just enough executors to match the number of tasks (partitions). So check SparkUI whether your jobs (more specific : stages) have more than tasks than you have vCores available

Resources