Spark Context is not automatically created in Scala Spark Shell - apache-spark

I read in a Spark book :
Driver programs access Spark through a SparkContext object, which represents a
connection to a computing cluster. In the shell, a SparkContext is automatically created for you as the variable called sc. Try printing out sc to see its type
sc
When I enter sc, it gives me an error 20 value sc not found. Any idea why is sc not automatically created in my scala spark shell?
I try to manually create a sc and it gave me an error saying there is already a spark context in the JVM. Please see pic :
http://s30.photobucket.com/user/kctestingeas1/media/No%20Spark%20Context.jpg.html
I believe i am already in scala spark shell as you can see on the top of my cmd window indicating bin\spark-shell
Please advise. Thanks

Hopefully you found the answer to your question, because I am encountering the same issue as well.
In the meantime, use this workaround. In the scala spark shell, enter:
import org.apache.spark.SparkContext
val sc = SparkContext.getOrCreate()
You then have access to sc.

Related

What is interpreter-aware SparkContext

In the spark docs, there is mentioned a line
In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called sc.
reference : https://spark.apache.org/docs/latest/rdd-programming-guide.html
What does interpreter-aware SparkContext means here ?
You can run the spark-shell with python or scala. The spark context knows which one. As it is interactive there is an interpreter. It's a common computing concept, that is all.

How to create emptyRDD using SparkSession - (since hivecontext got deprecated)

IN Spark version 1.*
Created emptyRDD like below:
var baseDF = hiveContextVar.createDataFrame(sc.emptyRDD[Row], baseSchema)
While migrating to Spark 2.0(since hiveContext got deprecated, using sparkSession)
Tried like:
var baseDF = sparkSession.createDataFrame(sc.emptyRDD[Row], baseSchema)
Though getting below error:
org.apache.spark.SparkException: Only one SparkContext may be running
in this JVM (see SPARK-2243)
Is there a way to create emptyRDD using sparkSession?
In Spark 2.0 you need to refer the spark context through spark session. You can create empty dataframe as below. It worked for me.
sparkSession.createDataFrame(sparkSession.sparkContext.emptyRDD[Row], baseSchema)
Hope it helps you.

how to set spark conf for pyspark standalone ipython notebook [duplicate]

In Spark, there are 3 primary ways to specify the options for the SparkConf used to create the SparkContext:
As properties in the conf/spark-defaults.conf
e.g., the line: spark.driver.memory 4g
As args to spark-shell or spark-submit
e.g., spark-shell --driver-memory 4g ...
In your source code, configuring a SparkConf instance before using it to create the SparkContext:
e.g., sparkConf.set( "spark.driver.memory", "4g" )
However, when using spark-shell, the SparkContext is already created for you by the time you get a shell prompt, in the variable named sc. When using spark-shell, how do you use option #3 in the list above to set configuration options, if the SparkContext is already created before you have a chance to execute any Scala statements?
In particular, I am trying to use Kyro serialization and GraphX. The prescribed way to use Kryo with GraphX is to execute the following Scala statement when customizing the SparkConf instance:
GraphXUtils.registerKryoClasses( sparkConf )
How do I accomplish this when running spark-shell?
Spark 2.0+
You should be able to use SparkSession.conf.set method to set some configuration option on runtime but it is mostly limited to SQL configuration.
Spark < 2.0
You can simply stop an existing context and create a new one:
import org.apache.spark.{SparkContext, SparkConf}
sc.stop()
val conf = new SparkConf().set("spark.executor.memory", "4g")
val sc = new SparkContext(conf)
As you can read in the official documentation:
Once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
So as you can see stopping the context it is the only applicable option once shell has been started.
You can always use configuration files or --conf argument to spark-shell to set required parameters which will be used be the default context. In case of Kryo you should take a look at:
spark.kryo.classesToRegister
spark.kryo.registrator
See Compression and Serialization in Spark Configuration.

sc is sharkContext in spark shell (DSE)

I have setup DSE 4.7.0 with
SPARK_ENABLED=1
in /etc/default/DSE, but when I start the spark shell (dse spark) even though I get no errors and see
Spark context available as sc.
then I find the variable sc is not a sparkContext:
scala> sc
res5: shark.SharkContext = shark.SharkContext#61cb67f1
Is something misconfigured? Any suggestions would be much appreciated.
Cheers,
David

Read application configuration from Sparkcontext Object

I am developing a Spark application using pyspark shell.
I kickstarted the iPython notebook service using the command below, see here how I created the profile:
IPYTHON_OPTS="notebook --port 8889 --profile pyspark" pyspark
Based on the documentation, there is a sc spark context object already created for me with some default configuration.
"In the PySpark shell, a special interpreter-aware SparkContext is
already created for you, in the variable called sc. Making your own
SparkContext will not work."
I basically have two questions here:
(1) How can I get a summary of the configuration for the default sc object?
I want to know how much memory has been allocated, how many cores I can use...etc. However, I only found a method called getLocalProperty for object sc from pyspark API without knowing what is the key argument that I should call.
(2) Is it possible to modify the sparkcontext working with iPythonnotebook. If you cannot modify the configurations once you started the iPython notebook, if there a file somewhere to configure the sc somewhere?
I am fairly new to Spark, the more information(resource) you can provide, the better it would be. Thanks!
It is not required to use pyspark: you can import the pyspark classes and then instantiate the SparkContext yourself
from pyspark import SparkContext, SparkConf
Set up your custom config:
conf = SparkConf().setAppName(appName).setMaster(master)
# set values into conf here ..
sc = SparkContext(conf=conf)
You may also want to look at the general spark-env.sh
conf/spark-env.sh.template # copy to conf/spark-env.sh and then modify vals as useful to you
eg. some of the values you may customize:
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append

Resources