Read application configuration from Sparkcontext Object - apache-spark

I am developing a Spark application using pyspark shell.
I kickstarted the iPython notebook service using the command below, see here how I created the profile:
IPYTHON_OPTS="notebook --port 8889 --profile pyspark" pyspark
Based on the documentation, there is a sc spark context object already created for me with some default configuration.
"In the PySpark shell, a special interpreter-aware SparkContext is
already created for you, in the variable called sc. Making your own
SparkContext will not work."
I basically have two questions here:
(1) How can I get a summary of the configuration for the default sc object?
I want to know how much memory has been allocated, how many cores I can use...etc. However, I only found a method called getLocalProperty for object sc from pyspark API without knowing what is the key argument that I should call.
(2) Is it possible to modify the sparkcontext working with iPythonnotebook. If you cannot modify the configurations once you started the iPython notebook, if there a file somewhere to configure the sc somewhere?
I am fairly new to Spark, the more information(resource) you can provide, the better it would be. Thanks!

It is not required to use pyspark: you can import the pyspark classes and then instantiate the SparkContext yourself
from pyspark import SparkContext, SparkConf
Set up your custom config:
conf = SparkConf().setAppName(appName).setMaster(master)
# set values into conf here ..
sc = SparkContext(conf=conf)
You may also want to look at the general spark-env.sh
conf/spark-env.sh.template # copy to conf/spark-env.sh and then modify vals as useful to you
eg. some of the values you may customize:
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append

Related

In Databricks Spark, how is SparkContext shared across different processes?

In https://docs.databricks.com/workflows/jobs/jobs.html#use-the-shared-sparkcontext it says:
Because Databricks initializes the SparkContext, programs that invoke
new SparkContext() will fail. To get the SparkContext, use only the
shared SparkContext created by Databricks:
val goodSparkContext = SparkContext.getOrCreate()
val goodSparkSession = SparkSession.builder().getOrCreate()
In SparkContext#getOrCreate it says:
This function may be used to get or instantiate a SparkContext and
register it as a singleton object. Because we can only have one active
SparkContext per JVM, this is useful when applications may wish to
share a SparkContext. This method allows not passing a SparkConf
(useful if just retrieving).
In SparkSession.Builder#getOrCreate it says:
Gets an existing SparkSession or, if there is no existing one, creates
a new one based on the options set in this builder. This method first
checks whether there is a valid thread-local SparkSession, and if yes,
return that one. It then checks whether there is a valid global
default SparkSession, and if yes, return that one. If no valid global
default SparkSession exists, the method creates a new SparkSession and
assigns the newly created SparkSession as the global default.
In case an existing SparkSession is returned, the non-static config
options specified in this builder will be applied to the existing
SparkSession.
So my understanding is that Databricks somehow creates a SparkContext in some process, probably a JVM, and then executes the submitted JAR in a different JVM process. Is this understanding correct?
If it is, then how does the SparkContext sharing mechanism work across multiple processes?
If not, then what actually happens and how does SparkContext get shared?
Thanks
On Databricks, SparkContext/SparkSession are created when cluster is starting, and then you submitted jar is executed in the same JVM where SparkContext/SparkSession was created. The recommendations about not stopping SparkContext especially important when you are submitting job to the interactive cluster (not recommended for multiple reasons).
When you're using Python or R, you get separate Python/R processes, but they will use the same SparkContext.

Why I don't need to create a SparkSession in Databricks?

Why I don't need to create a SparkSession in Databricks? Is a SparkSession created automatically when the cluster is configured? Or somebodyelse did it for me?
That is done only in the notebooks, to simplify user's work & avoiding them to specify different parameters, many of them won't have any effect because Spark is already started. This behavior is similar to what you get when you start spark-shell or pyspark - both of them initialize the SparkSession and SparkContext:
Spark context available as 'sc' (master = local[*], app id = local-1635579272032).
SparkSession available as 'spark'.
But if you're running code from jar or Python wheel as job, then it's your responsibility to create corresponding objects.
In Databricks environment, Whereas in Spark 2.0 the same effects can be achieved through SparkSession, without expliciting creating SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession. Using a builder design pattern, it instantiates a SparkSession object if one does not already exist, along with its associated underlying contexts.ref: link

Why does the code for initializing Spark Context vary widely between different sources?

I know that I need to initialize Spark Context to create resilient distributed datasets (RDDs) in PySpark. However, different sources give different code for how to do so. To resolve this once and for all, what is the right code?
1) Code from Tutorials Point:
https://www.tutorialspoint.com/pyspark/pyspark_sparkcontext.htm
from pyspark import SparkContext
sc = SparkContext("local", "First App")
2) Code from Apache:
https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#resilient-distributed-datasets-rdds
from pyspark import SparkContext, SparkConf
Then, later down the page, there is:
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
These are just two examples. I can list more, but the main problem for me is the lack of uniformity for something so simple and basic. Please help and clarify.
1)
In local[N] - N is the maximum number of cores can be used in a node at any point of time. This will use your local host resources.
In cluster mode (when you specify a Master node IP) you can set --executor-cores N. It means that each executor can run a maximum of N tasks at the same time in an executor.
2)
And when you don't specify an app name, it could be left blank or spark could ne creating a random name. I am trying to get the source code for setAppName() but not able to find any meat

Spark Context is not automatically created in Scala Spark Shell

I read in a Spark book :
Driver programs access Spark through a SparkContext object, which represents a
connection to a computing cluster. In the shell, a SparkContext is automatically created for you as the variable called sc. Try printing out sc to see its type
sc
When I enter sc, it gives me an error 20 value sc not found. Any idea why is sc not automatically created in my scala spark shell?
I try to manually create a sc and it gave me an error saying there is already a spark context in the JVM. Please see pic :
http://s30.photobucket.com/user/kctestingeas1/media/No%20Spark%20Context.jpg.html
I believe i am already in scala spark shell as you can see on the top of my cmd window indicating bin\spark-shell
Please advise. Thanks
Hopefully you found the answer to your question, because I am encountering the same issue as well.
In the meantime, use this workaround. In the scala spark shell, enter:
import org.apache.spark.SparkContext
val sc = SparkContext.getOrCreate()
You then have access to sc.

how to set spark conf for pyspark standalone ipython notebook [duplicate]

In Spark, there are 3 primary ways to specify the options for the SparkConf used to create the SparkContext:
As properties in the conf/spark-defaults.conf
e.g., the line: spark.driver.memory 4g
As args to spark-shell or spark-submit
e.g., spark-shell --driver-memory 4g ...
In your source code, configuring a SparkConf instance before using it to create the SparkContext:
e.g., sparkConf.set( "spark.driver.memory", "4g" )
However, when using spark-shell, the SparkContext is already created for you by the time you get a shell prompt, in the variable named sc. When using spark-shell, how do you use option #3 in the list above to set configuration options, if the SparkContext is already created before you have a chance to execute any Scala statements?
In particular, I am trying to use Kyro serialization and GraphX. The prescribed way to use Kryo with GraphX is to execute the following Scala statement when customizing the SparkConf instance:
GraphXUtils.registerKryoClasses( sparkConf )
How do I accomplish this when running spark-shell?
Spark 2.0+
You should be able to use SparkSession.conf.set method to set some configuration option on runtime but it is mostly limited to SQL configuration.
Spark < 2.0
You can simply stop an existing context and create a new one:
import org.apache.spark.{SparkContext, SparkConf}
sc.stop()
val conf = new SparkConf().set("spark.executor.memory", "4g")
val sc = new SparkContext(conf)
As you can read in the official documentation:
Once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
So as you can see stopping the context it is the only applicable option once shell has been started.
You can always use configuration files or --conf argument to spark-shell to set required parameters which will be used be the default context. In case of Kryo you should take a look at:
spark.kryo.classesToRegister
spark.kryo.registrator
See Compression and Serialization in Spark Configuration.

Resources