spark variable in pyspark vs SparkSession - apache-spark

When we start pyspark (spark 2.4), it comes with a spark variable call underline functionality.
so when to call and use SparkSession and SparkContext methods if "spark" is already available.

Using spark 2.4, you probably see something like this in your log:
Spark context available as 'sc' (master = yarn, app id = application...).
Spark session available as 'spark'.
According to databricks blog:
In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark, as shown here:
//set up the spark configuration and create contexts
val sparkConf = new SparkConf().setAppName("SparkSessionZipsExample").setMaster("local")
// your handle to SparkContext to access other context like SQLContext
val sc = new SparkContext(sparkConf).set("spark.some.config.option", "some-value")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Whereas in Spark 2.0 the same effects can be achieved through
SparkSession, without expliciting creating SparkConf, SparkContext or
SQLContext, as they’re encapsulated within the SparkSession.
So:
In your case spark is just an alias for the SparkSession.
You not need to use SparkContext as it is encapsulated within the SparkSession.

Related

Why I don't need to create a SparkSession in Databricks?

Why I don't need to create a SparkSession in Databricks? Is a SparkSession created automatically when the cluster is configured? Or somebodyelse did it for me?
That is done only in the notebooks, to simplify user's work & avoiding them to specify different parameters, many of them won't have any effect because Spark is already started. This behavior is similar to what you get when you start spark-shell or pyspark - both of them initialize the SparkSession and SparkContext:
Spark context available as 'sc' (master = local[*], app id = local-1635579272032).
SparkSession available as 'spark'.
But if you're running code from jar or Python wheel as job, then it's your responsibility to create corresponding objects.
In Databricks environment, Whereas in Spark 2.0 the same effects can be achieved through SparkSession, without expliciting creating SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession. Using a builder design pattern, it instantiates a SparkSession object if one does not already exist, along with its associated underlying contexts.ref: link

What's the difference between Sparkconf and Sparkcontext?

I've encountered a problem with pyspark when I've made Import Pyspark from Sparkcontext but I found that it can be imported from sparkconf as well, I'm asking what's the difference between those two spark class libraries.
Sparkcontext is the entry point for spark environment. For every sparkapp you need to create the sparkcontext object. In spark 2 you can use sparksession instead of sparkcontext.
Sparkconf is the class which gives you the various option to provide configuration parameters.
Val Conf = new sparkConf().setMaster(“local[*]”).setAppName(“test”)
Val SC = new sparkContext(Conf)
The spark configuration is passed to spark context. You can also set different application configuration in sparkconf and pass to sparkcontex
SparkConf is a configuration class for setting config information in key value format
SparkContext is the main entry class for establishing connection to the cluster.
Implementation of SparkConf-->
class SparkConf(object):
def __init__(self, loadDefaults=True, _jvm=None, _jconf=None):
"""
Create a new Spark configuration.
"""
if _jconf:
self._jconf = _jconf
else:
from pyspark.context import SparkContext
_jvm = _jvm or SparkContext._jvm
In this SparkContext is imported in the constructor, so you can pass the sparkContext. Similarly in SparkContext we have sparkConf as parameter so that you can pass sparkConf to it.
Thus you are setting values of configuration in both the ways.

Error when creating sqlContext in Apache Spark

I am using Apache Spark and running it on Ipython notebook.
I am trying to convert a regular dataframe to Spark DataFrame. For that I need sqlContext. When I use it i get an error.
Error says:
IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
I have looked up multiple resources but am not able to solve this issue.
SQLContext used to be the entry point for the SQL functionality in Spark 1.x; in Spark 2 it has been replaced with SparkSession (documentation). So, here is the proper way to initialize Spark in version 2.2, which is the one you are using according to your screenshot:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf()
sc = SparkContext(conf=conf)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
See the documentation on Spark SQL for further usage examples.

Is SparkEnv created after the creation of SparkSession in Spark 2?

In Spark 1.6, a SparkEnv is automatically created after the creating a new SparkContext object.
In Spark 2.0, SparkSession was introduced as the entry point to Spark SQL.
Is SparkEnv created automatically after the creation of SparkSession in Spark 2?
Yes, SparkEnv, SparkConf and SparkContext are all automatically created when SparkSession is created (and that's why corresponding code in Spark SQL is more high-level and hopefully less error-prone).
SparkEnv is a part of Spark runtime infrastructure and is required to have all the Spark Core's low-level services up and running before you can use the high-level APIs in Spark SQL (or Spark MLlib). Nothing has changed here.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> spark.sparkContext
res1: org.apache.spark.SparkContext = org.apache.spark.SparkContext#1e86506c

Use collect_list and collect_set in Spark SQL

According to the docs, the collect_set and collect_list functions should be available in Spark SQL. However, I cannot get it to work. I'm running Spark 1.6.0 using a Docker image.
I'm trying to do this in Scala:
import org.apache.spark.sql.functions._
df.groupBy("column1")
.agg(collect_set("column2"))
.show()
And receive the following error at runtime:
Exception in thread "main" org.apache.spark.sql.AnalysisException: undefined function collect_set;
Also tried it using pyspark, but it also fails. The docs state these functions are aliases of Hive UDAFs, but I can't figure out to enable these functions.
How to fix this? Thanx!
Spark 2.0+:
SPARK-10605 introduced native collect_list and collect_set implementation. SparkSession with Hive support or HiveContext are no longer required.
Spark 2.0-SNAPSHOT (before 2016-05-03):
You have to enable Hive support for a given SparkSession:
In Scala:
val spark = SparkSession.builder
.master("local")
.appName("testing")
.enableHiveSupport() // <- enable Hive support.
.getOrCreate()
In Python:
spark = (SparkSession.builder
.enableHiveSupport()
.getOrCreate())
Spark < 2.0:
To be able to use Hive UDFs (see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) you have use Spark built with Hive support (this is already covered when you use pre-built binaries what seems to be the case here) and initialize SparkContext using HiveContext.
In Scala:
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
val sqlContext: SQLContext = new HiveContext(sc)
In Python:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

Resources