I've encountered a problem with pyspark when I've made Import Pyspark from Sparkcontext but I found that it can be imported from sparkconf as well, I'm asking what's the difference between those two spark class libraries.
Sparkcontext is the entry point for spark environment. For every sparkapp you need to create the sparkcontext object. In spark 2 you can use sparksession instead of sparkcontext.
Sparkconf is the class which gives you the various option to provide configuration parameters.
Val Conf = new sparkConf().setMaster(“local[*]”).setAppName(“test”)
Val SC = new sparkContext(Conf)
The spark configuration is passed to spark context. You can also set different application configuration in sparkconf and pass to sparkcontex
SparkConf is a configuration class for setting config information in key value format
SparkContext is the main entry class for establishing connection to the cluster.
Implementation of SparkConf-->
class SparkConf(object):
def __init__(self, loadDefaults=True, _jvm=None, _jconf=None):
"""
Create a new Spark configuration.
"""
if _jconf:
self._jconf = _jconf
else:
from pyspark.context import SparkContext
_jvm = _jvm or SparkContext._jvm
In this SparkContext is imported in the constructor, so you can pass the sparkContext. Similarly in SparkContext we have sparkConf as parameter so that you can pass sparkConf to it.
Thus you are setting values of configuration in both the ways.
Related
The problem is that the pyspark script runs fine in one cluster, the error occurs when I run the same pyspark script in another yarn cluster. I guess the spark environment configurations are different between two cluster.
Here is the code to initialize the spark session.
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
from pyspark.sql.session import SparkSession
sc = SparkContext()
spark = SparkSession(sc)
hive_context = HiveContext(sc)
error
Error while instantiating * org. apache. spark. sql. hive. HiveACLSessi onStateBuilder’
Why I don't need to create a SparkSession in Databricks? Is a SparkSession created automatically when the cluster is configured? Or somebodyelse did it for me?
That is done only in the notebooks, to simplify user's work & avoiding them to specify different parameters, many of them won't have any effect because Spark is already started. This behavior is similar to what you get when you start spark-shell or pyspark - both of them initialize the SparkSession and SparkContext:
Spark context available as 'sc' (master = local[*], app id = local-1635579272032).
SparkSession available as 'spark'.
But if you're running code from jar or Python wheel as job, then it's your responsibility to create corresponding objects.
In Databricks environment, Whereas in Spark 2.0 the same effects can be achieved through SparkSession, without expliciting creating SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession. Using a builder design pattern, it instantiates a SparkSession object if one does not already exist, along with its associated underlying contexts.ref: link
When we start pyspark (spark 2.4), it comes with a spark variable call underline functionality.
so when to call and use SparkSession and SparkContext methods if "spark" is already available.
Using spark 2.4, you probably see something like this in your log:
Spark context available as 'sc' (master = yarn, app id = application...).
Spark session available as 'spark'.
According to databricks blog:
In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark, as shown here:
//set up the spark configuration and create contexts
val sparkConf = new SparkConf().setAppName("SparkSessionZipsExample").setMaster("local")
// your handle to SparkContext to access other context like SQLContext
val sc = new SparkContext(sparkConf).set("spark.some.config.option", "some-value")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Whereas in Spark 2.0 the same effects can be achieved through
SparkSession, without expliciting creating SparkConf, SparkContext or
SQLContext, as they’re encapsulated within the SparkSession.
So:
In your case spark is just an alias for the SparkSession.
You not need to use SparkContext as it is encapsulated within the SparkSession.
I am using Apache Spark and running it on Ipython notebook.
I am trying to convert a regular dataframe to Spark DataFrame. For that I need sqlContext. When I use it i get an error.
Error says:
IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
I have looked up multiple resources but am not able to solve this issue.
SQLContext used to be the entry point for the SQL functionality in Spark 1.x; in Spark 2 it has been replaced with SparkSession (documentation). So, here is the proper way to initialize Spark in version 2.2, which is the one you are using according to your screenshot:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf()
sc = SparkContext(conf=conf)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
See the documentation on Spark SQL for further usage examples.
I would like to know how to specify a custom profiler class in PySpark for Spark version 2+. Under 1.6, I know I can do so like this:
sc = SparkContext('local', 'test', profiler_cls='MyProfiler')
but when I create the SparkSession in 2.0 I don't explicitly have access to
the SparkContext. Can someone please advise how to do this for Spark 2.0+ ?
SparkSession can be initialized with an existing SparkContext, for example:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.profiler import BasicProfiler
spark = SparkSession(SparkContext('local', 'test', profiler_cls=BasicProfiler))