Specifiying custom profilers for pyspark running Spark 2.0 - apache-spark

I would like to know how to specify a custom profiler class in PySpark for Spark version 2+. Under 1.6, I know I can do so like this:
sc = SparkContext('local', 'test', profiler_cls='MyProfiler')
but when I create the SparkSession in 2.0 I don't explicitly have access to
the SparkContext. Can someone please advise how to do this for Spark 2.0+ ?

SparkSession can be initialized with an existing SparkContext, for example:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.profiler import BasicProfiler
spark = SparkSession(SparkContext('local', 'test', profiler_cls=BasicProfiler))

Related

Why SparkSession class of pyspark is in pyspark.sql not pyspark?

As the official documents of Spark Starting Point: SparkSession puts"The entry point into all functionality in Spark is the SparkSession class."
So, I'm wodering why in pyspark the SparkSession is imported from pyspark.sql not pyspark itself. My logic is since SparkSession is the entry point of all fuctionality in Spark (SparkSql, SparkStreaming, SparkMLib, SparkGraphX, etc.), doesn't it make more sense to import SparkSession from spark but not spark.sql?
Primarily because pyspark is used for Spark Core - RDD-based APIs that were existing in Spark from the beginning, and SparkSession (originally as SQLContext) was added as a part of Spark SQL (original announcement).

spark variable in pyspark vs SparkSession

When we start pyspark (spark 2.4), it comes with a spark variable call underline functionality.
so when to call and use SparkSession and SparkContext methods if "spark" is already available.
Using spark 2.4, you probably see something like this in your log:
Spark context available as 'sc' (master = yarn, app id = application...).
Spark session available as 'spark'.
According to databricks blog:
In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark, as shown here:
//set up the spark configuration and create contexts
val sparkConf = new SparkConf().setAppName("SparkSessionZipsExample").setMaster("local")
// your handle to SparkContext to access other context like SQLContext
val sc = new SparkContext(sparkConf).set("spark.some.config.option", "some-value")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Whereas in Spark 2.0 the same effects can be achieved through
SparkSession, without expliciting creating SparkConf, SparkContext or
SQLContext, as they’re encapsulated within the SparkSession.
So:
In your case spark is just an alias for the SparkSession.
You not need to use SparkContext as it is encapsulated within the SparkSession.

What's the difference between Sparkconf and Sparkcontext?

I've encountered a problem with pyspark when I've made Import Pyspark from Sparkcontext but I found that it can be imported from sparkconf as well, I'm asking what's the difference between those two spark class libraries.
Sparkcontext is the entry point for spark environment. For every sparkapp you need to create the sparkcontext object. In spark 2 you can use sparksession instead of sparkcontext.
Sparkconf is the class which gives you the various option to provide configuration parameters.
Val Conf = new sparkConf().setMaster(“local[*]”).setAppName(“test”)
Val SC = new sparkContext(Conf)
The spark configuration is passed to spark context. You can also set different application configuration in sparkconf and pass to sparkcontex
SparkConf is a configuration class for setting config information in key value format
SparkContext is the main entry class for establishing connection to the cluster.
Implementation of SparkConf-->
class SparkConf(object):
def __init__(self, loadDefaults=True, _jvm=None, _jconf=None):
"""
Create a new Spark configuration.
"""
if _jconf:
self._jconf = _jconf
else:
from pyspark.context import SparkContext
_jvm = _jvm or SparkContext._jvm
In this SparkContext is imported in the constructor, so you can pass the sparkContext. Similarly in SparkContext we have sparkConf as parameter so that you can pass sparkConf to it.
Thus you are setting values of configuration in both the ways.

Error when creating sqlContext in Apache Spark

I am using Apache Spark and running it on Ipython notebook.
I am trying to convert a regular dataframe to Spark DataFrame. For that I need sqlContext. When I use it i get an error.
Error says:
IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
I have looked up multiple resources but am not able to solve this issue.
SQLContext used to be the entry point for the SQL functionality in Spark 1.x; in Spark 2 it has been replaced with SparkSession (documentation). So, here is the proper way to initialize Spark in version 2.2, which is the one you are using according to your screenshot:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf()
sc = SparkContext(conf=conf)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
See the documentation on Spark SQL for further usage examples.

Use collect_list and collect_set in Spark SQL

According to the docs, the collect_set and collect_list functions should be available in Spark SQL. However, I cannot get it to work. I'm running Spark 1.6.0 using a Docker image.
I'm trying to do this in Scala:
import org.apache.spark.sql.functions._
df.groupBy("column1")
.agg(collect_set("column2"))
.show()
And receive the following error at runtime:
Exception in thread "main" org.apache.spark.sql.AnalysisException: undefined function collect_set;
Also tried it using pyspark, but it also fails. The docs state these functions are aliases of Hive UDAFs, but I can't figure out to enable these functions.
How to fix this? Thanx!
Spark 2.0+:
SPARK-10605 introduced native collect_list and collect_set implementation. SparkSession with Hive support or HiveContext are no longer required.
Spark 2.0-SNAPSHOT (before 2016-05-03):
You have to enable Hive support for a given SparkSession:
In Scala:
val spark = SparkSession.builder
.master("local")
.appName("testing")
.enableHiveSupport() // <- enable Hive support.
.getOrCreate()
In Python:
spark = (SparkSession.builder
.enableHiveSupport()
.getOrCreate())
Spark < 2.0:
To be able to use Hive UDFs (see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) you have use Spark built with Hive support (this is already covered when you use pre-built binaries what seems to be the case here) and initialize SparkContext using HiveContext.
In Scala:
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
val sqlContext: SQLContext = new HiveContext(sc)
In Python:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

Resources