Is SparkEnv created after the creation of SparkSession in Spark 2? - apache-spark

In Spark 1.6, a SparkEnv is automatically created after the creating a new SparkContext object.
In Spark 2.0, SparkSession was introduced as the entry point to Spark SQL.
Is SparkEnv created automatically after the creation of SparkSession in Spark 2?

Yes, SparkEnv, SparkConf and SparkContext are all automatically created when SparkSession is created (and that's why corresponding code in Spark SQL is more high-level and hopefully less error-prone).
SparkEnv is a part of Spark runtime infrastructure and is required to have all the Spark Core's low-level services up and running before you can use the high-level APIs in Spark SQL (or Spark MLlib). Nothing has changed here.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> spark.sparkContext
res1: org.apache.spark.SparkContext = org.apache.spark.SparkContext#1e86506c

Related

Why SparkSession class of pyspark is in pyspark.sql not pyspark?

As the official documents of Spark Starting Point: SparkSession puts"The entry point into all functionality in Spark is the SparkSession class."
So, I'm wodering why in pyspark the SparkSession is imported from pyspark.sql not pyspark itself. My logic is since SparkSession is the entry point of all fuctionality in Spark (SparkSql, SparkStreaming, SparkMLib, SparkGraphX, etc.), doesn't it make more sense to import SparkSession from spark but not spark.sql?
Primarily because pyspark is used for Spark Core - RDD-based APIs that were existing in Spark from the beginning, and SparkSession (originally as SQLContext) was added as a part of Spark SQL (original announcement).

spark variable in pyspark vs SparkSession

When we start pyspark (spark 2.4), it comes with a spark variable call underline functionality.
so when to call and use SparkSession and SparkContext methods if "spark" is already available.
Using spark 2.4, you probably see something like this in your log:
Spark context available as 'sc' (master = yarn, app id = application...).
Spark session available as 'spark'.
According to databricks blog:
In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark, as shown here:
//set up the spark configuration and create contexts
val sparkConf = new SparkConf().setAppName("SparkSessionZipsExample").setMaster("local")
// your handle to SparkContext to access other context like SQLContext
val sc = new SparkContext(sparkConf).set("spark.some.config.option", "some-value")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Whereas in Spark 2.0 the same effects can be achieved through
SparkSession, without expliciting creating SparkConf, SparkContext or
SQLContext, as they’re encapsulated within the SparkSession.
So:
In your case spark is just an alias for the SparkSession.
You not need to use SparkContext as it is encapsulated within the SparkSession.

Difference between various sparkcontexts in Spark 1.x and 2.x

Can anyone explain the difference between SparkContext, SQLContext, HiveContext and SparkSession EntryPoints and each one's usecases.
SparkContext is used for basic RDD API on both Spark1.x and Spark2.x
SparkSession is used for DataFrame API and Struct Streaming API on Spark2.x
SQLContext & HiveContext are used for DataFrame API on Spark1.x and deprecated from Spark2.x
Spark Context is Class in Spark API which is the first stage to build the spark application. Functionality of the spark context is to create memory in RAM we call this as driver memory, allocation of number of executers and cores in short its all about the cluster management. Spark Context can be used to create RDD and shared variables. SparkContext is a Class to access this we need to create object of it.
This way we can create Spark Context :: var sc=new SparkContext()
Spark Session this is new Object added since spark 2.x which is replacement of Sql Context and Hive Context.
Earlier we had two options like one is Sql Context which is way to do sql operation on Dataframe and second is Hive Context which manage the Hive connectivity related stuff and fetch/insert the data from/to the hive tables.
Since 2.x came We can create SparkSession for the SQL operation on Dataframe and if you have any Hive related work just call the Method enablehivesupport() then you can use the SparkSession for the both Dataframe and hive related SQL operations.
This way we can create SparkSession for Sql operation on Dataframe
val sparksession=SparkSession.builder().getOrCreate();
Second way is to create SparkSession for Sql operation on Dataframe as well as Hive Operation.
val sparkSession=SparkSession.builder().enableHiveSupport().getOrCreate()

Why can't we create an RDD using Spark session

We see that,
Spark context available as 'sc'.
Spark session available as 'spark'.
I read spark session includes spark context, streaming context, hive context ... If so, then why are we not able to create an rdd by using a spark session instead of a spark context.
scala> val a = sc.textFile("Sample.txt")
17/02/17 16:16:14 WARN util.SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
a: org.apache.spark.rdd.RDD[String] = Sample.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> val a = spark.textFile("Sample.txt")
<console>:23: error: value textFile is not a member of org.apache.spark.sql.SparkSession
val a = spark.textFile("Sample.txt")
As shown above, sc.textFile succeeds in creating an RDD but not spark.textFile.
In Spark 2+, Spark Context is available via Spark Session, so all you need to do is:
spark.sparkContext().textFile(yourFileOrURL)
see the documentation on this access method here.
Note that in PySpark this would become:
spark.sparkContext.textFile(yourFileOrURL)
see the documentation here.
In earlier versions of spark, spark context was entry point for Spark. As RDD was main API, it was created and manipulated using context API’s.
For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext.
But as DataSet and Dataframe API’s are becoming new standard API’s Spark need an entry point build for them. So in Spark 2.0, Spark have a new entry point for DataSet and Dataframe API’s called as Spark Session.
SparkSession is essentially combination of SQLContext, HiveContext and future StreamingContext.
All the API’s available on those contexts are available on spark session also. Spark session internally has a spark context for actual computation.
sparkContext still contains the method which it had in previous
version .
methods of sparkSession can be found here
It can be created in the following way-
val a = spark.read.text("wc.txt")
This will create a dataframe,If you want to convert it to RDD then use-
a.rdd Please refer the link below,on dataset API-
http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html

Use collect_list and collect_set in Spark SQL

According to the docs, the collect_set and collect_list functions should be available in Spark SQL. However, I cannot get it to work. I'm running Spark 1.6.0 using a Docker image.
I'm trying to do this in Scala:
import org.apache.spark.sql.functions._
df.groupBy("column1")
.agg(collect_set("column2"))
.show()
And receive the following error at runtime:
Exception in thread "main" org.apache.spark.sql.AnalysisException: undefined function collect_set;
Also tried it using pyspark, but it also fails. The docs state these functions are aliases of Hive UDAFs, but I can't figure out to enable these functions.
How to fix this? Thanx!
Spark 2.0+:
SPARK-10605 introduced native collect_list and collect_set implementation. SparkSession with Hive support or HiveContext are no longer required.
Spark 2.0-SNAPSHOT (before 2016-05-03):
You have to enable Hive support for a given SparkSession:
In Scala:
val spark = SparkSession.builder
.master("local")
.appName("testing")
.enableHiveSupport() // <- enable Hive support.
.getOrCreate()
In Python:
spark = (SparkSession.builder
.enableHiveSupport()
.getOrCreate())
Spark < 2.0:
To be able to use Hive UDFs (see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) you have use Spark built with Hive support (this is already covered when you use pre-built binaries what seems to be the case here) and initialize SparkContext using HiveContext.
In Scala:
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
val sqlContext: SQLContext = new HiveContext(sc)
In Python:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

Resources