Add additional jars using PYSPARK_SUBMIT_ARGS - apache-spark

I have code to start a spark session
spark_session = SparkSession.builder.appName(app_name)
spark_session = spark_session.getOrCreate()
sc = spark_session.sparkContext
Now I want to dynamically be able to add jars and packages using PYSPARK_SUBMIT_ARGS so I added an environment variable with the following value before I hit the code
--jars /usr/share/aws/redshift/jdbc/RedshiftJDBC4.jar --packages com.databricks:spark-redshift_2.10:2.0.0,org.apache.spark:spark-avro_2.11:2.4.0,com.eclipsesource.minimal-json:minimal-json:0.9.4
But I get the following error:
Error: Missing application resource.
From looking online I know its because of the fact that I am explicitly passing jars and packages so I need to provide the path to my main jar file. But I am confused as to what that will be. I am just tring to run some code by starting a pyspark shell. I know the other way of passing these while starting the shell but my use case is such that I want it to be able to do it using the env variable and I have not been able to find the answers to this issue online

Related

How to get all pyspark session properties (even the default values)?

My real problem is that I have a SQL query that successfully run from inside a spark session in a Spark Jupyter Notebook, but it fails when I submit it using Livy. I need to compare what's different in the sessions, but the values returned from spark.sparkContext.getConf().getAll() are the same.
In a pyspark shell I can get all the properties that were explicitly set with the command:
spark.sparkContext.getConf().getAll()
I can also get a lot of the cluster configurations with this code:
hadoopConf = {prop.getKey(): prop.getValue()
for prop
in spark.sparkContext._jsc.hadoopConfiguration().iterator()}
for i, j in sorted(hadoopConf.items()):
print(i, '=', j)
But if I try to get the value of a property that wasn't explicitly set:
spark.conf.get("spark.memory.offHeap.size")
I get a java.util.NoSuchElementException: spark.memory.offHeap.size but it has a default value configured in the spark environment.
Even weirder, some variables I can get a value even it it isn't listed above:
[30]: spark.conf.get('spark.sql.shuffle.partitions')
[30]: '200'
There are other questions but the answers there doesn't list the properties above.
How can get the default value of these properties from inside a spark shell?

Sharing a spark session

Lets say I have a python file my_python.py in which I have created a SparkSession 'spark' . I have a jar say my_jar.jar in which some spark logic is written. I am not creating SparkSession in my jar , rather I want to use the same session created in my_python.py. How to write a spark-submit command which take my python file , my jar and my sparksession 'spark' as an argument to my jar file.
Is it possible ?
If not , please share the alternative to do so.
So I feel there are two questions -
Q1. How in scala file you can reuse already created spark session?
Ans: Inside your scala code, you should use builder to get an existing session:
SparkSession.builder().getOrCreate()
Please check the Spark doc
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.html
Q2: How you do spark-submit with a .py file as driver and scala jar(s) as supporting jars?
And: It should be in something like this
./spark-submit --jars myjar.jar,otherjar.jar --py-files path/to/myegg.egg path/to/my_python.py arg1 arg2 arg3
So if you notice the method name, it is getOrCreate() - that means if a spark session is already created, no new session will be created rather existing session will be used.
Check this link for full implementation example:
https://www.crowdstrike.com/blog/spark-hot-potato-passing-dataframes-between-scala-spark-and-pyspark/

Set spark configuration

I am trying to set the configuration of a few spark parameters inside the pyspark shell.
I tried the following
spark.conf.set("spark.executor.memory", "16g")
To check if the executor memory has been set, I did the following
spark.conf.get("spark.executor.memory")
which returned "16g".
I tried to check it through sc using
sc._conf.get("spark.executor.memory")
and that returned "4g".
Why do these two return different values and whats the correct way to set these configurations.
Also, I am fiddling with a bunch of parameters like
"spark.executor.instances"
"spark.executor.cores"
"spark.executor.memory"
"spark.executor.memoryOverhead"
"spark.driver.memory"
"spark.driver.cores"
"spark.driver.memoryOverhead"
"spark.memory.offHeap.size"
"spark.memory.fraction"
"spark.task.cpus"
"spark.memory.offHeap.enabled "
"spark.rpc.io.serverThreads"
"spark.shuffle.file.buffer"
Is there a way that will set the configurations for all the variables.
EDIT
I need to set the configuration programmatically. How do I change it after I have done spark-submit or started the pyspark shell? I am trying to reduce the runtime of my jobs for which I am going through multiple iterations changing the spark configuration and recording the runtimes.
You can set environment variables by using: (e.g. in spark-env.sh, only stand-alone)
SPARK_EXECUTOR_MEMORY=16g
You can also set the spark-defaults.conf:
spark.executor.memory=16g
But these solutions are hardcoded and pretty much static, and you want to have different parameters for different jobs, however, you might want to set up some defaults.
The best approach is to use spark-submit:
spark-submit --executor-memory 16G
The problem of defining variables programmatically is that some of them need to be defined at startup time if not precedence rules will take over and your changes after the initiation of the job will be ignored.
Edit:
The amount of memory per executor is looked up when SparkContext is created.
And
once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
See: SparkConf Documentation
Have you tried changing the variable before the SparkContext is created, then running your iteration, stopping your SparkContext and changing your variable to iterate again?
import org.apache.spark.{SparkContext, SparkConf}
val conf = new SparkConf.set("spark.executor.memory", "16g")
val sc = new SparkContext(conf)
...
sc.stop()
val conf2 = new SparkConf().set("spark.executor.memory", "24g")
val sc2 = new SparkContext(conf2)
You can debug your configuration using: sc.getConf.toDebugString
See: Spark Configuration
Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
You'll need to make sure that your variable is not defined with higher precedence.
Precedence order:
conf/spark-defaults.conf
--conf or -c - the command-line option used by spark-submit
SparkConf
I hope this helps.
In Pyspark,
Suppose I want to increase the driver memory and executor in code. I can do it as below:
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '23g'), ('spark.driver.memory','9.7g')])
To view the updated settings:
spark.sparkContext._conf.getAll()

spark connecting to Phoenix NoSuchMethod Exception

I am trying to connect to Phoenix through Spark/Scala to read and write data as a DataFrame. I am following the example on GitHub however when I try the very first example Load as a DataFrame using the Data Source API I get the below exception.
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.setWriteToWAL(Z)Lorg/apache/hadoop/hbase/client/Put;
There are couple of things that are driving me crazy from those examples:
1)The import statement import org.apache.phoenix.spark._ gives me below exception in my code:
cannot resolve symbol phoenix
I have included below jars in my sbt
"org.apache.phoenix" % "phoenix-spark" % "4.4.0.2.4.3.0-227" % Provided,
"org.apache.phoenix" % "phoenix-core" % "4.4.0.2.4.3.0-227" % Provided,
2) I get the deprecated warning for symbol load.
I googled about that warnign but didn't got any reference and I was not able to find any example of the suggested method. I am not able to find any other good resource which guides on how to connect to Phoenix. Thanks for your time.
please use .read instead of load as shown below
val df = sparkSession.sqlContext.read
.format("org.apache.phoenix.spark")
.option("zkUrl", "localhost:2181")
.option("table", "TABLE1").load()
Its late to answer but here's what i did to solve a similar problem(Different method not found and deprecation warning):
1.) About the NoSuchMethodError: I took all the jars from hbase installation lib folder and add it to your project .Also add pheonix spark jars .Make sure to use compatible versions of spark and pheonix spark.Spark 2.0+ is compatible with pheonix-spark-4.10+
maven-central-link.This resolved the NoSuchMethodError
2.) About the load - The load method has long since been deprecated .Use sqlContext.phoenixTableAsDataFrame.For reference see this Load as a DataFrame directly using a Configuration object

Reading csv files in zeppelin using spark-csv

I wanna read csv files in Zeppelin and would like to use databricks'
spark-csv package: https://github.com/databricks/spark-csv
In the spark-shell, I can use spark-csv with
spark-shell --packages com.databricks:spark-csv_2.11:1.2.0
But how do I tell Zeppelin to use that package?
Thanks in advance!
You need to add the Spark Packages repository to Zeppelin before you can use %dep on spark packages.
%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.databricks:spark-csv_2.10:1.2.0")
Alternatively, if this is something you want available in all your notebooks, you can add the --packages option to the spark-submit command setting in the interpreters config in Zeppelin, and then restart the interpreter. This should start a context with the package already loaded as per the spark-shell method.
Go to the Interpreter tab, click Repository Information, add a repo and set the URL to http://dl.bintray.com/spark-packages/maven
Scroll down to the spark interpreter paragraph and click edit, scroll down a bit to the artifact field and add "com.databricks:spark-csv_2.10:1.2.0" or a newer version. Then restart the interpreter when asked.
In the notebook, use something like:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("my_data.txt")
Update:
In the Zeppelin user mailing list, it is now (Nov. 2016) stated by Moon Soo Lee (creator of Apache Zeppelin) that users prefer to keep %dep as it allows for:
self-documenting library requirements in the notebook;
per Note (and possible per User) library loading.
The tendency is now to keep %dep, so it should not be considered depreciated at this time.
BEGIN-EDIT
%dep is deprecated in Zeppelin 0.6.0. Please refer Paul-Armand Verhaegen's answer.
Please read further in this answer, if you are using zeppelin older than 0.6.0
END-EDIT
You can load the spark-csv package using %dep interpreter.
like,
%dep
z.reset()
// Add spark-csv package
z.load("com.databricks:spark-csv_2.10:1.2.0")
See Dependency Loading section in https://zeppelin.incubator.apache.org/docs/interpreter/spark.html
If you've already initialized Spark Context, quick solution is to restart zeppelin and execute zeppelin paragraph with above code first and then execute your spark code to read the CSV file
You can add jar files under Spark Interpreter dependencies:
Click 'Interpreter' menu in navigation bar.
Click 'edit' button for Spark interpreter.
Fill artifact and exclude fields.
Press 'Save'
if you define in conf/zeppelin-env.sh
export SPARK_HOME=<PATH_TO_SPARK_DIST>
Zeppelin will then look in $SPARK_HOME/conf/spark-defaults.conf and you can define jars there:
spark.jars.packages com.databricks:spark-csv_2.10:1.4.0,org.postgresql:postgresql:9.3-1102-jdbc41
then look at
http://zepplin_url:4040/environment/ for the following:
spark.jars file:/root/.ivy2/jars/com.databricks_spark-csv_2.10-1.4.0.jar,file:/root/.ivy2/jars/org.postgresql_postgresql-9.3-1102-jdbc41.jar
spark.jars.packages com.databricks:spark-csv_2.10:1.4.0,org.postgresql:postgresql:9.3-1102-jdbc41
For more reference: https://zeppelin.incubator.apache.org/docs/0.5.6-incubating/interpreter/spark.html
Another solution:
In conf/zeppelin-env.sh (located in /etc/zeppelin for me) add the line:
export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.10:1.2.0"
Then start the service.

Resources