How to create emptyRDD using SparkSession - (since hivecontext got deprecated) - apache-spark

IN Spark version 1.*
Created emptyRDD like below:
var baseDF = hiveContextVar.createDataFrame(sc.emptyRDD[Row], baseSchema)
While migrating to Spark 2.0(since hiveContext got deprecated, using sparkSession)
Tried like:
var baseDF = sparkSession.createDataFrame(sc.emptyRDD[Row], baseSchema)
Though getting below error:
org.apache.spark.SparkException: Only one SparkContext may be running
in this JVM (see SPARK-2243)
Is there a way to create emptyRDD using sparkSession?

In Spark 2.0 you need to refer the spark context through spark session. You can create empty dataframe as below. It worked for me.
sparkSession.createDataFrame(sparkSession.sparkContext.emptyRDD[Row], baseSchema)
Hope it helps you.

Related

Why I don't need to create a SparkSession in Databricks?

Why I don't need to create a SparkSession in Databricks? Is a SparkSession created automatically when the cluster is configured? Or somebodyelse did it for me?
That is done only in the notebooks, to simplify user's work & avoiding them to specify different parameters, many of them won't have any effect because Spark is already started. This behavior is similar to what you get when you start spark-shell or pyspark - both of them initialize the SparkSession and SparkContext:
Spark context available as 'sc' (master = local[*], app id = local-1635579272032).
SparkSession available as 'spark'.
But if you're running code from jar or Python wheel as job, then it's your responsibility to create corresponding objects.
In Databricks environment, Whereas in Spark 2.0 the same effects can be achieved through SparkSession, without expliciting creating SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession. Using a builder design pattern, it instantiates a SparkSession object if one does not already exist, along with its associated underlying contexts.ref: link

spark.sql vs SqlContext

I have used SQL in Spark, in this example:
results = spark.sql("select * from ventas")
where ventas is a dataframe, previosuly cataloged like a table:
df.createOrReplaceTempView('ventas')
but I have seen other ways of working with SQL in Spark, using the class SqlContext:
df = sqlContext.sql("SELECT * FROM table")
What is the difference between both of them?
Thanks in advance
From a user's perspective (not a contributor), I can only rehash what the developer's provided in the upgrade notes:
Upgrading From Spark SQL 1.6 to 2.0
SparkSession is now the new entry point of Spark that replaces the old SQLContext and HiveContext. Note that the old SQLContext and HiveContext are kept for backward compatibility. A new catalog interface is accessible from SparkSession - existing API on databases and tables access such as listTables, createExternalTable, dropTempView, cacheTable are moved here.
Before 2.0, the SqlContext needed an extra call to the factory that creates it. With SparkSession, they made things a lot more convenient.
If you take a look at the source code, you'll notice that the SqlContext class is mostly marked #deprecated. Closer inspection shows that the most commonly used methods simply call sparkSession.
For more info, take a look at the developer notes, Jira issues, conference talks on spark 2.0, and Databricks blog.
Before Spark 2.x SQLContext was build with help of SparkContext but after Spark 2.x SparkSession was introduced which have the functionality of HiveContext and SQLContect both.So no need of creating SQLContext separatly.
**before Spark2.x**
sCont = SparkContext()
sqlCont = SQLContext(sCont)
**after Spark 2.x:**
spark = SparkSession()
Sparksession is the preferred way of working with Spark object now. Both Hivecontext and SQLContext are available as a part of this single object SparkSession.
You are using the latest syntax by creating a view df.createOrReplaceTempView('ventas').
Next create the df1 as javaobject
df1=sqlcontext.sql("select col1,col2,col3 from table")
Next create df2 as DATAFRAME
df2=spark.sql("select col1,col2,col3 from table")
Check the difference using type(df2) and type(df1)

Why can't we create an RDD using Spark session

We see that,
Spark context available as 'sc'.
Spark session available as 'spark'.
I read spark session includes spark context, streaming context, hive context ... If so, then why are we not able to create an rdd by using a spark session instead of a spark context.
scala> val a = sc.textFile("Sample.txt")
17/02/17 16:16:14 WARN util.SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
a: org.apache.spark.rdd.RDD[String] = Sample.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> val a = spark.textFile("Sample.txt")
<console>:23: error: value textFile is not a member of org.apache.spark.sql.SparkSession
val a = spark.textFile("Sample.txt")
As shown above, sc.textFile succeeds in creating an RDD but not spark.textFile.
In Spark 2+, Spark Context is available via Spark Session, so all you need to do is:
spark.sparkContext().textFile(yourFileOrURL)
see the documentation on this access method here.
Note that in PySpark this would become:
spark.sparkContext.textFile(yourFileOrURL)
see the documentation here.
In earlier versions of spark, spark context was entry point for Spark. As RDD was main API, it was created and manipulated using context API’s.
For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext.
But as DataSet and Dataframe API’s are becoming new standard API’s Spark need an entry point build for them. So in Spark 2.0, Spark have a new entry point for DataSet and Dataframe API’s called as Spark Session.
SparkSession is essentially combination of SQLContext, HiveContext and future StreamingContext.
All the API’s available on those contexts are available on spark session also. Spark session internally has a spark context for actual computation.
sparkContext still contains the method which it had in previous
version .
methods of sparkSession can be found here
It can be created in the following way-
val a = spark.read.text("wc.txt")
This will create a dataframe,If you want to convert it to RDD then use-
a.rdd Please refer the link below,on dataset API-
http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html

Spark Context is not automatically created in Scala Spark Shell

I read in a Spark book :
Driver programs access Spark through a SparkContext object, which represents a
connection to a computing cluster. In the shell, a SparkContext is automatically created for you as the variable called sc. Try printing out sc to see its type
sc
When I enter sc, it gives me an error 20 value sc not found. Any idea why is sc not automatically created in my scala spark shell?
I try to manually create a sc and it gave me an error saying there is already a spark context in the JVM. Please see pic :
http://s30.photobucket.com/user/kctestingeas1/media/No%20Spark%20Context.jpg.html
I believe i am already in scala spark shell as you can see on the top of my cmd window indicating bin\spark-shell
Please advise. Thanks
Hopefully you found the answer to your question, because I am encountering the same issue as well.
In the meantime, use this workaround. In the scala spark shell, enter:
import org.apache.spark.SparkContext
val sc = SparkContext.getOrCreate()
You then have access to sc.

how to set spark conf for pyspark standalone ipython notebook [duplicate]

In Spark, there are 3 primary ways to specify the options for the SparkConf used to create the SparkContext:
As properties in the conf/spark-defaults.conf
e.g., the line: spark.driver.memory 4g
As args to spark-shell or spark-submit
e.g., spark-shell --driver-memory 4g ...
In your source code, configuring a SparkConf instance before using it to create the SparkContext:
e.g., sparkConf.set( "spark.driver.memory", "4g" )
However, when using spark-shell, the SparkContext is already created for you by the time you get a shell prompt, in the variable named sc. When using spark-shell, how do you use option #3 in the list above to set configuration options, if the SparkContext is already created before you have a chance to execute any Scala statements?
In particular, I am trying to use Kyro serialization and GraphX. The prescribed way to use Kryo with GraphX is to execute the following Scala statement when customizing the SparkConf instance:
GraphXUtils.registerKryoClasses( sparkConf )
How do I accomplish this when running spark-shell?
Spark 2.0+
You should be able to use SparkSession.conf.set method to set some configuration option on runtime but it is mostly limited to SQL configuration.
Spark < 2.0
You can simply stop an existing context and create a new one:
import org.apache.spark.{SparkContext, SparkConf}
sc.stop()
val conf = new SparkConf().set("spark.executor.memory", "4g")
val sc = new SparkContext(conf)
As you can read in the official documentation:
Once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
So as you can see stopping the context it is the only applicable option once shell has been started.
You can always use configuration files or --conf argument to spark-shell to set required parameters which will be used be the default context. In case of Kryo you should take a look at:
spark.kryo.classesToRegister
spark.kryo.registrator
See Compression and Serialization in Spark Configuration.

Resources