Rest to equivalent unix command for dataproc job submit spark - apache-spark

I have configuration and cluster set in GCP and i can submit a spark job, but I am trying to run cloud dataproc job submit spark from my CLI for the same configuration.
I've set the service account in my local, I am just unable to build the equivalent command for the console configuration.
console config:
"sparkJob": {
"mainClass": "main.class",
"properties": {
"spark.executor.extraJavaOptions": "-DARGO_ENV_FILE=gs://file.properties",
"spark.driver.extraJavaOptions": "-DARGO_ENV_FILE=gs://file.properties"
},
"jarFileUris": [
"gs://my_jar.jar"
],
"args": [
"arg1",
"arg2",
"arg3"
]
}
And the equivalent command that I built is-
cloud dataproc job submit spark
-t spark
-p spark.executor.extraJavaOptions:-DARGO_ENV_FILE=gs://file.properties,spark.driver.extraJavaOptions-DARGO_ENV_FILE=gs://file.properties
-m main.class
-c my_cluster
-f gs://my_jar.jar
-a ‘arg1’,‘arg2’,‘arg3’
It's not reading the file.properties files and giving this error-
error while opening file spark.executor.extraJavaOptions=-DARGO_ENV_FILE=gs://file.properties,spark.driver.extraJavaOptions=-DARGO_ENV_FILE=gs://file.properties: error: open spark.executor.extraJavaOptions=-DARGO_ENV_FILE=gs://file.properties,spark.driver.extraJavaOptions=-DARGO_ENV_FILE=gs://file.properties: no such file or directory
And when I run command without mentioning the -p (properties) flag and those files, it runs but eventually fails because of those missing properties files.
Where I am doing something wrong, I can't figure it out.
ps: I'm trying to run dataproc command from CLI something like a spark-submit command-
spark-submit --conf "spark.driver.extraJavaOptions=-Dkafka.security.config.filename=file.properties"
--conf "spark.executor.extraJavaOptions=-Dkafka.security.config.filename=file.properties"
--class main.class my_jar.jar
--arg1
--arg2
--arg3

Related

How to get basic Spark program running on Kubernetes

I'm trying to get off the ground with Spark and Kubernetes but I'm facing difficulties. I used the helm chart here:
https://github.com/bitnami/charts/tree/main/bitnami/spark
I have 3 workers and they all report running successfully. I'm trying to run the following program remotely:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("spark://<master-ip>:<master-port>").getOrCreate()
df = spark.read.json('people.json')
Here's the part that's not entirely clear. Where should the file people.json actually live? I have it locally where I'm running the python code and I also have it on a PVC that the master and all workers can see at /sparkdata/people.json.
When I run the 3rd line as simply 'people.json' then it starts running but errors out with:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
If I run it as '/sparkdata/people.json' then I get
pyspark.sql.utils.AnalysisException: Path does not exist: file:/sparkdata/people.json
Not sure where I go from here. To be clear I want it to read files from the PVC. It's an NFS share that has the data files on it.
Your people.json file needs to be accessible to your driver + executor pods. This can be achieved in multiple ways:
having some kind of network/cloud drive that each pod can access
mounting volumes on your pods, and then uploading the data to those volumes using --files in your spark-submit.
The latter option might be the simpler to set up. This page discusses in more detail how you could do this, but we can shortly go to the point. If you add the following arguments to your spark-submit you should be able to get your people.json on your driver + executors (you just have to choose sensible values for the $VAR variables in there):
--files people.json \
--conf spark.kubernetes.file.upload.path=$SOURCE_DIR \
--conf spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH \
--conf spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH \
--conf spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH \
--conf spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH \
You can always verify the existence of your data by going inside of the pods themselves like so:
kubectl exec -it <driver/executor pod name> bash
(now you should be inside of a bash process in the pod)
cd <mount-path-you-chose>
ls -al
That last ls -al command should show you a people.json file in there (after having done your spark-submit of course).
Hope this helps!

AWS EMR - ModuleNotFoundError: No module named 'arrow'

i'm running into this issue when trying to upgrade to Python3.9 for our EMR jobs using Pyspark 3.0.1/ EMR release 6.2.1. I've created the EMR Cluster using a bootstrap script and here are spark environment variables that were set:
export PYSPARK_PYTHON=/usr/local/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3
export SPARK_HOME=/usr/lib/spark
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip
I've installed all the application dependency libs using a shell script and are located in /home/ec2-user. However, when I try to spark submit a job with following command by user hadoop, i'm seeing the "ModuleNotFoundError".
Spark-submit cmd:
/bin/sh -c "MYAPP_ENV=dev PYSPARK_PYTHON=/usr/local/bin/python3 PYTHONHASHSEED=0 SETUPTOOLS_USE_DISTUTILS=stdlib spark-submit --master yarn --deploy-mode client --jars /home/hadoop/ext_lib/*.jar --py-files /home/hadoop/myapp.zip --conf spark.sql.parquet.compression.codec=gzip --conf spark.executorEnv.MYAPP_ENV=dev /home/hadoop/myapp/oasis/etl/spark/daily/run_daily_etl.py '--lookback_days' '1' '--s3_file_system' 's3'"
Error: ModuleNotFoundError: No module named 'arrow'
However, the same works when we use the EMR cluster settings with "EMR Release label:emr-5.28.0 and Spark 2.4.4.
Can someone provide help on identifying the cause as I'm fully stuck with this. I suspect it may be due to the access of ec2-user home folder from hadoop user.
Thanks

How to Run a spark job in cluster mode in GCP?

In GCP, we want to run a spark job in cluster mode on a data[proc cluster. Currently we are using the following command:-
gcloud dataproc jobs submit spark --cluster xxxx-xxxx-dataproc-cluster01 --region us-west2 --xxx.xxxx.xxx.xxx.xxx.xxx.xxxx.xxxx --jars gs://xxx-xxxx-poc/cluster-compute/lib/xxxxxxxx-cluster-computation-jar-0.0.1-SNAPSHOT-allinone.jar --properties=spark:spark.submit.deployMode=cluster --properties=spark.driver.extraClassPath=/xxxx/xxxx/xxxx/ -- -c xxxxxxxx -a
However using above the job is being submitted in local mode. We need to run in cluster mode.
You can run it in cluster mode by specifying the following --properties spark.submit.deployMode=cluster
In your example the deployMode doesn't look correct.
--properties=spark:spark.submit.deployMode=cluster
Looks like spark: is extra.
Here is the entire command for the job submission
gcloud dataproc jobs submit pyspark --cluster XXXXX --region us-central1 --properties="spark.submit.deployMode=cluster" gs://dataproc-examples/pyspark/hello-world/hello-world.py
Below is the screenshot of the job running in cluster mode
Update
To pass multiple properties below is the dataproc job submit
gcloud dataproc jobs submit pyspark --cluster cluster-e0a0 --region us-central1 --properties="spark.submit.deployMode=cluster","spark.driver.extraClassPath=/xxxxxx/configuration/cluster-mode/" gs://dataproc-examples/pyspark/hello-world/hello-world.py
On running the job below is the screenshot which shows the deployMode is Cluster and the extra class path is also set
If want to run the spark job through cloud shell use below command
gcloud dataproc jobs submit spark --cluster cluster-test
-- class org.apache.spark.examples.xxxx --jars file:///usr/lib/spark/exampleas/jars/spark-examples.jar --1000

Apache Toree to connect to a remote spark cluster

Is there a way to connect Apache Toree to a remote spark cluster? I see the common command is
jupyter toree install --spark_home=/usr/local/bin/apache-spark/
How can I go about using spark on a remote server without having to install locally?
There is indeed a way of getting Toree to connect to a remote Spark cluster.
The easiest way I've discovered is to clone the existing Toree Scala/Python kernel, and create a new Toree Scala/Python Remote kernel. That way you can have the choice of running locally or remotely.
Steps:
Make a copy of the existing kernel. On my particular Toree install, the path to the Kernels was located at: /usr/local/share/jupyter/kernels/, so I performed the following command:
cp -pr /usr/local/share/jupyter/kernels/apache_toree_scala/ /usr/local/share/jupyter/kernels/apache_toree_scala_remote/
Edit the new kernel.json file in /usr/local/share/jupyter/kernels/apache_toree_scala_remote/ and add the requisite Spark options to the __TOREE_SPARK_OPTS__ variable. Technically, only --master <path> is required, but you can also add --num-executors, --executor-memory, etc to the variable as well.
Restart Jupyter.
My kernel.json file looks like this:
{
"display_name": "Toree - Scala Remote",
"argv": [
"/usr/local/share/jupyter/kernels/apache_toree_scala_remote/bin/run.sh",
"--profile",
"{connection_file}"
],
"language": "scala",
"env": {
"PYTHONPATH": "/opt/spark/python:/opt/spark/python/lib/py4j-0.9-src.zip",
"SPARK_HOME": "/opt/spark",
"DEFAULT_INTERPRETER": "Scala",
"PYTHON_EXEC": "python",
"__TOREE_OPTS__": "",
"__TOREE_SPARK_OPTS__": "--master spark://192.168.0.255:7077 --deploy-mode client --num-executors 4 --executor-memory 4g --executor-cores 8 --packages com.databricks:spark-csv_2.10:1.4.0"
}
}
This is a possible example with some intuitive details for ANY remote cluster install. For my remote cluster, which is a Cloudera 5.9.2 these are specific steps. (You can also use this example to install with non-Cloudera clusters with some smart edits.)
With OS/X to build CDH version (skip if using a distribution):
Goto https://github.com/Myllyenko/incubator-toree and clone this repo
Download Docker
Setup 'signing' - It's been a some time since I set this up - you'll need to sign the build above. TBD
'new branch git', edit the .travis.xml, README.md, and build.sbt files to change 5.10.x to 5.9.2
Start Docker, cd within the make release dir, build with make release, wait, wait, sign 3 builds
Copy the file ./dist/toree-pip/toree-0.2.0-spark-1.6.0-cdh5.9.2.tar.gz to your spark-shell machine that can reach your YARN-controlled Spark cluster
Merge, commit, etc your repo to your master repo if this will be mission critical
Spark Machine Installs:
Warning: Some steps may need to be done as root as a last resort
Install pip / anaconda (see other docs)
Install Jupyter sudo pip install jupyter
Install toree sudo pip install toree-0.2.0-spark-1.6.0-cdh5.9.2 or use the apache-toree distribution
Configure Toree to run with Jupyter (example):
Edit & add to ~/.bash_profile
echo $PATH
PATH=$PATH:$HOME/bin
export PATH
echo $PATH
export CDH_SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/spark
export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib
export SPARK_CONF_DIR=/etc/spark/conf
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
PATH=$PATH:$SPARK_HOME/bin
export PATH
echo $PATH
export SPARK_PKGS=$(cat << END | xargs echo | sed 's/ /,/g'
com.databricks:spark-csv_2.10:1.5.0
END
)
export SPARK_JARS=$(cat << END | xargs echo | sed 's/ /,/g'
/home/mymachine/extras/someapp.jar
/home/mymachine/extras/jsoup-1.10.3.jar
END
)
export TOREE_JAR="/usr/local/share/jupyter/kernels/apache_toree_scala/lib/toree-assembly-0.2.0-spark-1.6.0-cdh5.9.2-incubating.jar"
export SPARK_OPTS="--master yarn-client --conf spark.yarn.config.gatewayPath=/opt/cloudera/parcels --conf spark.scheduler.mode=FAIR --conf spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.yarn.historyServer.address=http://yourCDHcluster.net:18088 --conf spark.default.parallelism=20 --conf spark.driver.maxResultSize=1g --conf spark.driver.memory=1g --conf spark.executor.cores=4 --conf spark.executor.instances=5 --conf spark.executor.memory=1g --packages $SPARK_PKGS --jars $SPARK_JARS"
function jti() {
jupyter toree install \
--replace \
--user \
--kernel_name="CDH 5.9.2 Toree" \
--debug \
--spark_home=${SPARK_HOME} \
--spark_opts="$SPARK_OPTS" \
--log-level=0
}
function jn() {
jupyter notebook --ip=127.0.0.1 --port=8888 --debug --log-level=0
}
If you want a different port to hit Toree - now is your chance to edit 8888
Log out of your Toree / spark-shell machine
ssh back to that machine ssh -L 8888:localhost:8888 toreebox.cdhcluster.net (assuming that 8888 is the port in the bash file)
I expect as a user (not root) you can type jti to install Toree into Jupyter (Note: understanding this step may help to install other kernels into Jupyter - sidebar: #jamcom mentioned
the produced file, but this step automatically produces this part. The file is buried in your home dir's tree as a user rather than root.
As user, type jn to start a Jupyter Notebook. Wait a few seconds until the browser url is available and paste that URL into your browser.
You now have Jupyter running and so pick a new CDH 5.9.2 Toree or the version you installed. This launches a new browser window. Since you have some Toree experience, pick something like sc.getConf.getAll.sortWith(_._1 < _._1).foreach(println) in order to get the lazily instantiated spark context going. Be really patient as your jobs is submitted to the cluster and your may have to wait a long time if your cluster is busy or a little while for your job to process in the cluster.
Tips and Tricks:
I ran into an issue on the first run and the subsequent runs never saw that issue. (The issue issue might be fixed in the github)
Sometimes, I have to kill the old 'Apache Toree' app on YARN to start a new Toree.
Sometimes, my VM can has an orphaned JVM. If you get memory errors starting a Jupyter Notebook/ Toree or have unexpectedly disconnected, check your process list with top. And ... kill the extra JVM (be careful ID-ing your lost process).

Spark streaming on dataproc throws FileNotFoundException

When I try to submit a spark streaming job to google dataproc cluster, I get this exception:
16/12/13 00:44:20 ERROR org.apache.spark.SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/tmp/0afbad25-cb65-49f1-87b8-9cf6523512dd/skyfall-assembly-0.0.1.jar does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
...
16/12/13 00:44:20 INFO org.spark_project.jetty.server.ServerConnector: Stopped ServerConnector#d7bffbc{HTTP/1.1}{0.0.0.0:4040}
16/12/13 00:44:20 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
16/12/13 00:44:20 ERROR org.apache.spark.util.Utils: Uncaught exception in thread main
java.lang.NullPointerException
at org.apache.spark.network.shuffle.ExternalShuffleClient.close(ExternalShuffleClient.java:152)
at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1360)
...
Exception in thread "main" java.io.FileNotFoundException: File file:/tmp/0afbad25-cb65-49f1-87b8-9cf6523512dd/skyfall-assembly-0.0.1.jar does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
Full output here.
It seems this error happens when hadoop configuration is not correctly defined in spark-env.sh - link1, link2
Is it configurable somewhere? Any pointers on how to resolve it?
Running the same code in local mode works fine:
sparkConf.setMaster("local[4]")
For additional context: the job was invoked like this:
gcloud dataproc jobs submit spark \
--cluster my-test-cluster \
--class com.company.skyfall.Skyfall \
--jars gs://my-bucket/resources/skyfall-assembly-0.0.1.jar \
--properties spark.ui.showConsoleProgress=false
This is the boilerplate setup code:
lazy val conf = {
val c = new SparkConf().setAppName(this.getClass.getName)
c.set("spark.ui.port", (4040 + scala.util.Random.nextInt(1000)).toString)
if (isLocal) c.setMaster("local[4]")
c.set("spark.streaming.receiver.writeAheadLog.enable", "true")
c.set("spark.streaming.blockInterval", "1s")
}
lazy val ssc = if (checkPointingEnabled) {
StreamingContext.getOrCreate(getCheckPointDirectory, createStreamingContext)
} else {
createStreamingContext()
}
private def getCheckPointDirectory: String = {
if (isLocal) localCheckPointPath else checkPointPath
}
private def createStreamingContext(): StreamingContext = {
val s = new StreamingContext(conf, Seconds(batchDurationSeconds))
s.checkpoint(getCheckPointDirectory)
s
}
Thanks in advance
Is it possible that this wasn't the first time you ran the job with the given checkpoint directory, as in the checkpoint directory already contains a checkpoint?
This happens because the checkpoint hard-codes the exact jarfile arguments used to submit the YARN application, and when running on Dataproc with a --jars flag pointing to GCS, this is actually syntactic sugar for Dataproc automatically staging your jarfile from GCS into a local file path /tmp/0afbad25-cb65-49f1-87b8-9cf6523512dd/skyfall-assembly-0.0.1.jar that's only used temporarily for the duration of a single job-run, since Spark isn't able to invoke the jarfile directly out of GCS without staging it locally.
However, in a subsequent job, the previous tmp jarfile will already be deleted, but the new job tries to refer to that old location hard-coded into the checkpoint data.
There are also additional issues caused by hard-coding in the checkpoint data; for example, Dataproc also uses YARN "tags" to track jobs, and will conflict with YARN if an old Dataproc job's "tag" is reused in a new YARN application. To run your streaming application, you'll need to first clear out your checkpoint directory if possible to start from a clean slate, and then:
You must place the job jarfile somewhere on the master node before starting the job, and then your "--jar" flag must specify "file:///path/on/master/node/to/jarfile.jar".
When you specify a "file:///" path dataproc knows its already on the master node so it doesn't re-stage into a /tmp directory, so in that case it's safe for the checkpoint to point to some fixed local directory on the master.
You can do this either with an init action or you can submit a quick pig job (or just ssh into the master and download that jarfile):
# Use a quick pig job to download the jarfile to a local directory (for example /usr/lib/spark in this case)
gcloud dataproc jobs submit pig --cluster my-test-cluster \
--execute "fs -cp gs://my-bucket/resources/skyfall-assembly-0.0.1.jar file:///usr/lib/spark/skyfall-assembly-0.0.1.jar"
# Submit the first attempt of the job
gcloud dataproc jobs submit spark --cluster my-test-cluster \
--class com.company.skyfall.Skyfall \
--jars file:///usr/lib/spark/skyfall-assembly-0.0.1.jar \
--properties spark.ui.showConsoleProgress=false
Dataproc relies on spark.yarn.tags under the hood to track YARN applications associated with jobs. However, the checkpoint holds a stale spark.yarn.tags which causes Dataproc to get confused with new applications that seem to be associated with old jobs.
For now, it only "cleans up" suspicious YARN applications as long as the recent killed jobid is held in memory, so rebooting the dataproc agent will fix this.
# Kill the job through the UI or something before the next step.
# Now use "pig sh" to restart the dataproc agent
gcloud dataproc jobs submit pig --cluster my-test-cluster \
--execute "sh systemctl restart google-dataproc-agent.service"
# Re-run your job without needing to change anything else,
# it'll be fine now if you ever need to resubmit it and it
# needs to recover from the checkpoint again.
Keep in mind though that by nature of checkpoints this means you won't be able to change the arguments you pass on subsequent runs, because the checkpoint recovery is used to clobber your command-line settings.
You can also run the job in yarn cluster mode to avoid adding jar to your master machine. The potential trade off is the spark driver will run in worker node instead of the master.

Resources