Get instance of Azure data bricks Spark in Python code - apache-spark

I am developing a python package which will be deployed into databricks cluster. We often need reference to the "spark" and "dbutils" object within the python code.
We can access these objects easily within Notebook using "spark" (like spark.sql()). How do we get the spark instance within the python code in the package?

SparkSession.Builder.getOrCreate:
Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.
This method first checks whether there is a valid global default SparkSession, and if yes, return that one. If no valid global default SparkSession exists, the method creates a new SparkSession and assigns the newly created SparkSession as the global default
So whenever you need instance of SparkSession and don't want to pass it as an argument:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

Related

In Databricks Spark, how is SparkContext shared across different processes?

In https://docs.databricks.com/workflows/jobs/jobs.html#use-the-shared-sparkcontext it says:
Because Databricks initializes the SparkContext, programs that invoke
new SparkContext() will fail. To get the SparkContext, use only the
shared SparkContext created by Databricks:
val goodSparkContext = SparkContext.getOrCreate()
val goodSparkSession = SparkSession.builder().getOrCreate()
In SparkContext#getOrCreate it says:
This function may be used to get or instantiate a SparkContext and
register it as a singleton object. Because we can only have one active
SparkContext per JVM, this is useful when applications may wish to
share a SparkContext. This method allows not passing a SparkConf
(useful if just retrieving).
In SparkSession.Builder#getOrCreate it says:
Gets an existing SparkSession or, if there is no existing one, creates
a new one based on the options set in this builder. This method first
checks whether there is a valid thread-local SparkSession, and if yes,
return that one. It then checks whether there is a valid global
default SparkSession, and if yes, return that one. If no valid global
default SparkSession exists, the method creates a new SparkSession and
assigns the newly created SparkSession as the global default.
In case an existing SparkSession is returned, the non-static config
options specified in this builder will be applied to the existing
SparkSession.
So my understanding is that Databricks somehow creates a SparkContext in some process, probably a JVM, and then executes the submitted JAR in a different JVM process. Is this understanding correct?
If it is, then how does the SparkContext sharing mechanism work across multiple processes?
If not, then what actually happens and how does SparkContext get shared?
Thanks
On Databricks, SparkContext/SparkSession are created when cluster is starting, and then you submitted jar is executed in the same JVM where SparkContext/SparkSession was created. The recommendations about not stopping SparkContext especially important when you are submitting job to the interactive cluster (not recommended for multiple reasons).
When you're using Python or R, you get separate Python/R processes, but they will use the same SparkContext.

Spark Custom Aggregator -- register and invoke through PySpark

According to various docs, to write a custom Aggregator in Spark it must be written in Java/Scala.
https://medium.com/swlh/apache-spark-3-0-remarkable-improvements-in-custom-aggregation-41dbaf725903
I have built and compiled a test implementation of a custom aggregator, but would now like to register and invoke it through PySpark and SparkSQL.
I tried spark.udf.registerJavaUDAF ... but that seems only to work with the older style UDAF functions not the new Aggregators.
How can I register a new Aggregator function written in Java through PySpark if at all possible? (I know how to pass the JAR to spark-submit etc the problem is the registration call).
I'm not sure what the correct approach is, but I was able to get the following to work.
In your Java class that extends Aggregator:
// This is assumed to be part of: com.example.java.udaf
// MyUdaf is the class that extends Aggregator
// I'm using Encoders.LONG() as an example, change this as needed
// Change the registered Spark SQL name, `myUdaf`, as needed
// Note that if you don't want to hardcode the "myUdaf" string, you can pass that in too.
// Expose UDAF registration
// This function is necessary for Python utilization
public static void register(SparkSession spark) {
spark.udf().register("myUdaf", functions.udaf(new MyUdaf(), Encoders.LONG()));
}
Then in Python:
udaf_jar_path = "..."
# Running in standalone mode
spark = SparkSession.builder\
.appName("udaf_demo")\
.config("spark.jars", udaf_jar_path)\
.master("local[*]")\
.getOrCreate()
# Register using registration function provided by Java class
spark.sparkContext._jvm.com.example.java.udaf.MyUdaf.register(_spark._jsparkSession)
As a bonus, you can use this same registration function in Java:
// Running in standalone mode
SparkSession spark = SparkSession
.builder()
.master("local[*]")
.appName("udaf_demo")
.getOrCreate();
register(spark);
Then you should be able to use this directly in Spark SQL:
SELECT
col0
, myUdaf(col1)
FROM some_table
GROUP BY 1
I tested this with a simple summation and it worked reasonably well. For summing 1M numbers, the Python version was ~150ms slower than the Java one (local testing using standalone mode, with both run directly within my IDEs). Compared to the built-in sum it was about half a second slower.
An alternative approach is to use Spark native functions. I haven't directly used this approach; however, I have used the spark-alchemy library which does. See their repo for more details.

Why I don't need to create a SparkSession in Databricks?

Why I don't need to create a SparkSession in Databricks? Is a SparkSession created automatically when the cluster is configured? Or somebodyelse did it for me?
That is done only in the notebooks, to simplify user's work & avoiding them to specify different parameters, many of them won't have any effect because Spark is already started. This behavior is similar to what you get when you start spark-shell or pyspark - both of them initialize the SparkSession and SparkContext:
Spark context available as 'sc' (master = local[*], app id = local-1635579272032).
SparkSession available as 'spark'.
But if you're running code from jar or Python wheel as job, then it's your responsibility to create corresponding objects.
In Databricks environment, Whereas in Spark 2.0 the same effects can be achieved through SparkSession, without expliciting creating SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession. Using a builder design pattern, it instantiates a SparkSession object if one does not already exist, along with its associated underlying contexts.ref: link

How to use getOrCreate() method in SparkContext class and what exactly is the functionality we achieve from this method

What is the use of getOrCreate() method in SparkContext Class and how I can use it? I did not found any suitable example(coding wise) for this.
What I understand is that using above method I can share spark context between applications. What do we mean by applications here?
Is application a different job submitted to a spark cluster?
If so then we should be able to use global variables(broadcast) and temp tables registered in one application into another application ?
Please if anyone can elaborate and give suitable example on this.
As given in the Javadoc for SparkContext, getOrCreate() is useful when applications may wish to share a SparkContext. So yes, you can use it to share a SparkContext object across Applications. And yes, you can re-use broadcast variables and temp tables across.
As for understanding Spark Applications, please refer this link. In short, an application is the highest-level unit of computation in Spark. And what you submit to a spark cluster is not a job, but an application. Invoking an action inside a Spark application triggers the launch of a job to fulfill it.
getOrCreate
public SparkSession getOrCreate()
Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.
This method first checks whether there is a valid thread-local SparkSession and if yes, return that one. It then checks whether there is a valid global default SparkSession and if yes, return that one. If no valid global default SparkSession exists, the method creates a new SparkSession and assigns the newly created SparkSession as the global default.
In case an existing SparkSession is returned, the config options specified in this builder will be applied to the existing SparkSession.
Please check link: [https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html][1]
An example can be :
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

Read application configuration from Sparkcontext Object

I am developing a Spark application using pyspark shell.
I kickstarted the iPython notebook service using the command below, see here how I created the profile:
IPYTHON_OPTS="notebook --port 8889 --profile pyspark" pyspark
Based on the documentation, there is a sc spark context object already created for me with some default configuration.
"In the PySpark shell, a special interpreter-aware SparkContext is
already created for you, in the variable called sc. Making your own
SparkContext will not work."
I basically have two questions here:
(1) How can I get a summary of the configuration for the default sc object?
I want to know how much memory has been allocated, how many cores I can use...etc. However, I only found a method called getLocalProperty for object sc from pyspark API without knowing what is the key argument that I should call.
(2) Is it possible to modify the sparkcontext working with iPythonnotebook. If you cannot modify the configurations once you started the iPython notebook, if there a file somewhere to configure the sc somewhere?
I am fairly new to Spark, the more information(resource) you can provide, the better it would be. Thanks!
It is not required to use pyspark: you can import the pyspark classes and then instantiate the SparkContext yourself
from pyspark import SparkContext, SparkConf
Set up your custom config:
conf = SparkConf().setAppName(appName).setMaster(master)
# set values into conf here ..
sc = SparkContext(conf=conf)
You may also want to look at the general spark-env.sh
conf/spark-env.sh.template # copy to conf/spark-env.sh and then modify vals as useful to you
eg. some of the values you may customize:
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append

Resources