How to start sparksession in pyspark

How to start sparksession in pyspark - apache-spark

I want to change the default memory, executor and core settings of a spark session.
The first code in my pyspark notebook on HDInsight cluster in Jupyter looks like this:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Juanita_Smith")\
.config("spark.executor.instances", "2")\
.config("spark.executor.cores", "2")\
.config("spark.executor.memory", "2g")\
.config("spark.driver.memory", "2g")\
.getOrCreate()
On completion, I read the parameters back, which looks like the statement worked
However if I look in yarn, the setting have indeed not worked.
Which settings or commands do I need to make to let the session configuration take effect ?
Thank you for help in advance

By the time your notebook kernel has started, the SparkSession is already created with parameters defined in a kernel configuration file. To change this, you will need to update or replace the kernel configuration file, which I believe is usually somewhere like <jupyter home>/kernels/<kernel name>/kernel.json.
Update
If you have access to the machine hosting your Jupyter server, you can find the location of the current kernel configurations using jupyter kernelspec list. You can then either edit one of the pyspark kernel configurations, or copy it to a new file and edit that. For your purposes, you will need to add the following arguments to the PYSPARK_SUBMIT_ARGS:
"PYSPARK_SUBMIT_ARGS": "--conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=2g --conf spark.driver.memory=2g"

Related

Not able to write data in Hive using sparksql

I am loading Data from one Hive table to another using spark Sql. I've created sparksession with enableHiveSupport and I'm able to create table in hive using sparksql, but when I'm loading data from one hive table to another hive table using sparksql I'm getting permission issue:
Permission denied: user=anonymous,access=WRITE, path="hivepath".
I am running this using spark user but not able to understand why its taking anonymous as user instead of spark. Can anyone suggest how should I resolve this issue?
I'm using below code.
sparksession.sql("insert overwrite into table dbname.tablename" select * from dbname.tablename").

If you're using spark, you need to set username in your spark context.
System.setProperty("HADOOP_USER_NAME","newUserName")
val spark = SparkSession
.builder()
.appName("SparkSessionApp")
.master("local[*]")
.getOrCreate()
println(spark.sparkContext.sparkUser)

First thing is you may try this for ananymous user
root#host:~# su - hdfs
hdfs#host:~$ hadoop fs -mkdir /user/anonymous
hdfs#host:~$ hadoop fs -chown anonymous /user/anonymous
In general
export HADOOP_USER_NAME=youruser before spark-submit will work.
along with spark-submit configuration like below.
--conf "spark.yarn.appMasterEnv.HADOOP_USER_NAME=${HADDOP_USER_NAME}" \
alternatively you can try using
sudo -su username spark-submit --class your class
see this
Note : This user name setting should be part of your initial
cluster setup ideally if its done then no need to do all these above
and its seemless.
I personally dont prefer user name hard coding in the code it should be from outside the spark job.

To validate with which user you are running,
run below command: -
sc.sparkUser
It will show you the current user and then
you can try setting new user as per the below code
And in scala, you can set the username by
System.setProperty("HADOOP_USER_NAME","newUserName")

Sharing a spark session

Lets say I have a python file my_python.py in which I have created a SparkSession 'spark' . I have a jar say my_jar.jar in which some spark logic is written. I am not creating SparkSession in my jar , rather I want to use the same session created in my_python.py. How to write a spark-submit command which take my python file , my jar and my sparksession 'spark' as an argument to my jar file.
Is it possible ?
If not , please share the alternative to do so.

So I feel there are two questions -
Q1. How in scala file you can reuse already created spark session?
Ans: Inside your scala code, you should use builder to get an existing session:
SparkSession.builder().getOrCreate()
Please check the Spark doc
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.html
Q2: How you do spark-submit with a .py file as driver and scala jar(s) as supporting jars?
And: It should be in something like this
./spark-submit --jars myjar.jar,otherjar.jar --py-files path/to/myegg.egg path/to/my_python.py arg1 arg2 arg3
So if you notice the method name, it is getOrCreate() - that means if a spark session is already created, no new session will be created rather existing session will be used.
Check this link for full implementation example:
https://www.crowdstrike.com/blog/spark-hot-potato-passing-dataframes-between-scala-spark-and-pyspark/

Set spark configuration

I am trying to set the configuration of a few spark parameters inside the pyspark shell.
I tried the following
spark.conf.set("spark.executor.memory", "16g")
To check if the executor memory has been set, I did the following
spark.conf.get("spark.executor.memory")
which returned "16g".
I tried to check it through sc using
sc._conf.get("spark.executor.memory")
and that returned "4g".
Why do these two return different values and whats the correct way to set these configurations.
Also, I am fiddling with a bunch of parameters like
"spark.executor.instances"
"spark.executor.cores"
"spark.executor.memory"
"spark.executor.memoryOverhead"
"spark.driver.memory"
"spark.driver.cores"
"spark.driver.memoryOverhead"
"spark.memory.offHeap.size"
"spark.memory.fraction"
"spark.task.cpus"
"spark.memory.offHeap.enabled "
"spark.rpc.io.serverThreads"
"spark.shuffle.file.buffer"
Is there a way that will set the configurations for all the variables.
EDIT
I need to set the configuration programmatically. How do I change it after I have done spark-submit or started the pyspark shell? I am trying to reduce the runtime of my jobs for which I am going through multiple iterations changing the spark configuration and recording the runtimes.

You can set environment variables by using: (e.g. in spark-env.sh, only stand-alone)
SPARK_EXECUTOR_MEMORY=16g
You can also set the spark-defaults.conf:
spark.executor.memory=16g
But these solutions are hardcoded and pretty much static, and you want to have different parameters for different jobs, however, you might want to set up some defaults.
The best approach is to use spark-submit:
spark-submit --executor-memory 16G
The problem of defining variables programmatically is that some of them need to be defined at startup time if not precedence rules will take over and your changes after the initiation of the job will be ignored.
Edit:
The amount of memory per executor is looked up when SparkContext is created.
And
once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
See: SparkConf Documentation
Have you tried changing the variable before the SparkContext is created, then running your iteration, stopping your SparkContext and changing your variable to iterate again?
import org.apache.spark.{SparkContext, SparkConf}
val conf = new SparkConf.set("spark.executor.memory", "16g")
val sc = new SparkContext(conf)
...
sc.stop()
val conf2 = new SparkConf().set("spark.executor.memory", "24g")
val sc2 = new SparkContext(conf2)
You can debug your configuration using: sc.getConf.toDebugString
See: Spark Configuration
Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
You'll need to make sure that your variable is not defined with higher precedence.
Precedence order:
conf/spark-defaults.conf
--conf or -c - the command-line option used by spark-submit
SparkConf
I hope this helps.

In Pyspark,
Suppose I want to increase the driver memory and executor in code. I can do it as below:
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '23g'), ('spark.driver.memory','9.7g')])
To view the updated settings:
spark.sparkContext._conf.getAll()

Change default stack size for spark driver running from jupyter?

I'm running python script on Spark cluster using jupyter. I want to change driver default stack size. I found in the documentation that I can use spark.driver.extraJavaOptions to send any options to driver JVM, but there is a note in the documentation:
Note: In client mode, this config must not be set through the
SparkConf directly in your application, because the driver JVM has
already started at that point. Instead, please set this through the
--driver-java-options command line option or in your default properties file.
The question is: How to change default driver parameter when running from jupyter ?

You can customize the Java options used for the driver by passing spark.driver.extraJavaOptions as a configuration value into the SparkConf, eg:
from pyspark import SparkConf, SparkContext
conf = (SparkConf()
.setMaster("spark://spark-master:7077")
.setAppName("MyApp")
.set("spark.driver.extraJavaOptions", "-Xss4M"))
sc = SparkContext.getOrCreate(conf = conf)
Note that in http://spark.apache.org/docs/latest/configuration.html it states about spark.driver.extraJavaOptions:
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-java-options command line option or in your default properties file.
However this is talking about the JVM SparkConf class. When it’s set in the PySpark Python SparkConf, that passes it as a command-line parameter to spark-submit, which then uses it when instantiating the JVM, so that comment in the Spark docs does not apply.

Reading csv files in zeppelin using spark-csv

I wanna read csv files in Zeppelin and would like to use databricks'
spark-csv package: https://github.com/databricks/spark-csv
In the spark-shell, I can use spark-csv with
spark-shell --packages com.databricks:spark-csv_2.11:1.2.0
But how do I tell Zeppelin to use that package?
Thanks in advance!

You need to add the Spark Packages repository to Zeppelin before you can use %dep on spark packages.
%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.databricks:spark-csv_2.10:1.2.0")
Alternatively, if this is something you want available in all your notebooks, you can add the --packages option to the spark-submit command setting in the interpreters config in Zeppelin, and then restart the interpreter. This should start a context with the package already loaded as per the spark-shell method.

Go to the Interpreter tab, click Repository Information, add a repo and set the URL to http://dl.bintray.com/spark-packages/maven
Scroll down to the spark interpreter paragraph and click edit, scroll down a bit to the artifact field and add "com.databricks:spark-csv_2.10:1.2.0" or a newer version. Then restart the interpreter when asked.
In the notebook, use something like:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("my_data.txt")
Update:
In the Zeppelin user mailing list, it is now (Nov. 2016) stated by Moon Soo Lee (creator of Apache Zeppelin) that users prefer to keep %dep as it allows for:
self-documenting library requirements in the notebook;
per Note (and possible per User) library loading.
The tendency is now to keep %dep, so it should not be considered depreciated at this time.

BEGIN-EDIT
%dep is deprecated in Zeppelin 0.6.0. Please refer Paul-Armand Verhaegen's answer.
Please read further in this answer, if you are using zeppelin older than 0.6.0
END-EDIT
You can load the spark-csv package using %dep interpreter.
like,
%dep
z.reset()
// Add spark-csv package
z.load("com.databricks:spark-csv_2.10:1.2.0")
See Dependency Loading section in https://zeppelin.incubator.apache.org/docs/interpreter/spark.html
If you've already initialized Spark Context, quick solution is to restart zeppelin and execute zeppelin paragraph with above code first and then execute your spark code to read the CSV file

You can add jar files under Spark Interpreter dependencies:
Click 'Interpreter' menu in navigation bar.
Click 'edit' button for Spark interpreter.
Fill artifact and exclude fields.
Press 'Save'

if you define in conf/zeppelin-env.sh
export SPARK_HOME=<PATH_TO_SPARK_DIST>
Zeppelin will then look in $SPARK_HOME/conf/spark-defaults.conf and you can define jars there:
spark.jars.packages com.databricks:spark-csv_2.10:1.4.0,org.postgresql:postgresql:9.3-1102-jdbc41
then look at
http://zepplin_url:4040/environment/ for the following:
spark.jars file:/root/.ivy2/jars/com.databricks_spark-csv_2.10-1.4.0.jar,file:/root/.ivy2/jars/org.postgresql_postgresql-9.3-1102-jdbc41.jar
spark.jars.packages com.databricks:spark-csv_2.10:1.4.0,org.postgresql:postgresql:9.3-1102-jdbc41
For more reference: https://zeppelin.incubator.apache.org/docs/0.5.6-incubating/interpreter/spark.html

Another solution:
In conf/zeppelin-env.sh (located in /etc/zeppelin for me) add the line:
export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.10:1.2.0"
Then start the service.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string