PySpark virtual environment archive on S3 - apache-spark

I'm trying to deploy PySpark applications to an EMR cluster that have various, differing, third-party dependencies, and I am following this blog post, which describes a few approaches to packaging a virtual environment and distributing that across the cluster.
So, I've made a virtual environment with virtualenv, used venv-pack to create a tarball of the virtual environment, and I'm trying to pass that as an --archives argument to spark-submit:
spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.pyspark.python=./venv/bin/python \
--archives s3://path/to/my/venv.tar.gz#venv \
s3://path/to/my/main.py
This fails with Cannot run program "./venv/bin/python": error=2, No such file or directory. Without the spark.pyspark.python option, the job fails with import errors, so my question is mainly about the syntax of this command when the archive is a remote object.
I can run a job that has no extra dependencies and a main method that's located remote to the cluster in S3, so I know at least something on S3 can be referenced (much like a Spark application JAR, which I'm much more familiar with). The problem is the virtual environment. I've found much literature about this, but it's all where the virtual environment archive is physically on the cluster. For many reasons, I would like to avoid having to copy virtual environments to the cluster every time a developer makes a new application.
Can I reference a remote archive? If so, what is the syntax for this and what other configuration options might I need?
I don't think it should matter, but just in case, I'm using a Livy client to submit this job remotely (the thing analogous to the above spark-submit).

Related

Spark on kubernetes with zeppelin

I am following this guide to run up a zeppelin container in a local kubernetes cluster set up using minikube.
https://zeppelin.apache.org/docs/0.9.0-SNAPSHOT/quickstart/kubernetes.html
I am able to set up zeppelin and run some sample code there. I have downloaded spark 2.4.5 & 2.4.0 source code and built it for kubernetes support with the following command:
./build/mvn -Pkubernetes -DskipTests clean package
Once spark is built I created a docker container as explained in the article:
bin/docker-image-tool.sh -m -t 2.4.X build
I configured zeppelin to use the spark image which was built with kubernetes support. The article above explains that the spark interpreter will auto configure spark on kubernetes to run in client mode and run the job.
But whenever I try to run any parahgraph with spark I receive the following error
Exception in thread "main" java.lang.IllegalArgumentException: basedir must be absolute: ?/.ivy2/local
I tried setting the spark configuration spark.jars.ivy in zeppelin to point to a temp directory but that does not work either.
I found a similar issue here:
basedir must be absolute: ?/.ivy2/local
But I can't seem to configure spark to run with the spark.jars.ivy /tmp/.ivy config. I tried building spark with the spark-defaults.conf when building spark but that does not seems to be working either.
Quite stumped at this problem and how to solve it any guidance would be appreciated.
Thanks!
I have also run into this problem, but a work-around I used for setting spark.jars.ivy=/tmp/.ivy is to rather set it is as an environment variable.
In your spark interpreter settings, add the following property: SPARK_SUBMIT_OPTIONS and set its value to --conf spark.jars.ivy=/tmp/.ivy.
This should pass additional options to spark submit and your job should continue.

how to add third party library to spark running on local machine

i am listening to eventhub stream and have seen article to attach library to cluster(databricks) and my code runs file.
For debugging i am running the code on local machine/cluster, but it fails for missing library. How can i add library when running on local machine.
i tried sparkcontext.addfile(fullpathtojar), but still same error.
You can use spark-submit --packages
Example: spark-submit --packages org.postgresql:postgresql:42.1.1
You would need to find the package that you are using and check the compatibility with spark.
With a single jar file you'd use spark-submit --jars instead.
i used spark-submit --packages {package} and it works.

Getting Started with Mobius SparkClr (on Linux)

I am looking to try the C# driver with an existing (stand alone) spark cluster (on Ubuntu Linux) which I interact happily with via python or scala.
I am unclear as to how to run a simple c# example having downloaded the latest Mobius release to the linux box. What I am unclear about are those two extra parameters required for the clr spark submit (over and above the ones that are normally required). I am encountering various errors when i try to follow the submit args as documented (or I have misunderstood the instructions)
Firstly, for the --exe, does one simply point to the .exe file or is it required to pass; --exe [mono] [my_app.exe] [params]
Secondly, remote-spark-clr seems to insist on a HDFS path but I am running spark without HDFS. Is HDFS actually necessary?
Thirdly, and related to question (two), if distributing exe/packages for workers, must these also be in a hdfs path or can I put them somewhere sensible on the "regular" file system.
In short, I am looking for confirmation that HDFS is not required and a simple one-liner submit example that can run an exe in some location. The combinations I have tried are not working for me sadly.
Running Mobius on Linux requires a small trick:
Create shell scripts that are launching your executables using mono
Add the extension .exe to your shell scripts so that they are accepted by sparkclr-submit.
Make sure your shell scripts are linux encoded - we had issues when they had CRLF line endings.
If your application is called Driver.exe, I recommend to create a file driver.sh.exe with the following content:
#!/bin/sh
exec mono ./Driver.exe "$#"
Similarly, create a file CSharpDriver.sh.exe with the following content:
#!/bin/sh
exec mono ./CSharpWorker.exe "$#"
In your App.config set the following value in appSettings:
<add key="CSharpWorkerPath" value="CSharpWorker.sh.exe"/>
Finally, when submitting your application, use the following arguments:
$SPARKCLR_HOME/scripts/sparkclr-submit.sh \
--master yarn \
--deploy-mode client \
--exe driver.sh.exe \
/path/to/driver
Note that the --exe argument only takes the name of the file, the path is the next argument.
You can place your applications on the regular file system (don't need to use HDFS), but according to my experience, Mobius will internally use HDFS to distribute the application to the workers. I don't know if you can avoid it.

Error: Unrecognized option: --packages

I'm porting an existing script from BigInsights to Spark on Bluemix. I'm trying to run the following against Spark on Bluemix:
./spark-submit.sh --vcap ./vcap.json --deploy-mode cluster \
--master https://x.x.x.x:8443 --jars ./truststore.jar \
--packages org.elasticsearch:elasticsearch-spark_2.10:2.3.0 \
./export_to_elasticsearch.py ...
However, I get the following error:
Error: Unrecognized option: --packages
How can I pass the --packages parameter?
Bluemix uses a customized Spark version, with a customized spark-submit.sh script that only supports a subset of the original script parameters. You can see all the configuration properties and parameters you can use on its documentation.
Additionally, you can download the Bluemix version of the script from this link, and there you can see that there is no argument --packages.
Therefore, the problem with your approach is that the Bluemix version of spark-submit does not accept the --packages parameter, probably due to security reasons. However, alternatively, you can download the jar for the package you want (and maybe a fat jar for the dependencies) and upload them using the --jars parameter. Note: To avoid the necessity of uploading the jar files each time you call spark-submit, you can pre-upload them using curl. The details of this procedure can be found on this link.
Adding to Daniel's post, while using the method to pre-upload your package, you might want to upload your package to "${cluster_master_url}/tenant/data/libs", since Spark service sets these four spark properties "spark.driver.extraClassPath", "spark.driver.extraLibraryPath", "spark.executor.extraClassPath", and "spark.executor.extraLibraryPath" to ./data/libs/*
Reference: https://console.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index-gentopic3.html#spark-submit_properties

How to lauch prorams in Apache spark?

I have a “myprogram.py” and my “myprogram.scala” that I need to run on my spark machine. How Can I upload and launch them?
I have been using shell to do my transformation and calling actions, but now I want to launch a complete program on spark machine instead of entering single commands every time. Also I believe that will make it easy for me to make changes to my program instead of starting to enter commands in shell.
I did standalone installation in Ubuntu 14.04, on single machine, not a cluster, used spark 1.4.1.
I went through spark docs online, but I only find instruction on how to do that on cluster. Please help me on that.
Thank you.
The documentation to do this (as commented above) is available here: http://spark.apache.org/docs/latest/submitting-applications.html
However, the code you need is here:
# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
You'll need to compile the scala file using sbt (documentation here: http://www.scala-sbt.org/0.13/tutorial/index.html)
Here's some information on the build.sbt file you'll need in order to grab the right dependencies: http://spark.apache.org/docs/latest/quick-start.html
Once the scala file is compiled, you'll send the resulting jar using the above submit command.
Put it simply:
In Linux terminal, cd to the directory that spark is unpacked/installed
Note, this folder normally contains subfolders like “bin”, “conf”, “lib”, “logs” and so on.
To run the Python program locally with simple/default settings, type command
./bin/spark-submit --master local[*] myprogram.py
More complete descriptions are here like zero323 and ApolloFortyNine described.

Resources