In the spark docs, there is mentioned a line
In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called sc.
reference : https://spark.apache.org/docs/latest/rdd-programming-guide.html
What does interpreter-aware SparkContext means here ?
You can run the spark-shell with python or scala. The spark context knows which one. As it is interactive there is an interpreter. It's a common computing concept, that is all.
Related
Why I don't need to create a SparkSession in Databricks? Is a SparkSession created automatically when the cluster is configured? Or somebodyelse did it for me?
That is done only in the notebooks, to simplify user's work & avoiding them to specify different parameters, many of them won't have any effect because Spark is already started. This behavior is similar to what you get when you start spark-shell or pyspark - both of them initialize the SparkSession and SparkContext:
Spark context available as 'sc' (master = local[*], app id = local-1635579272032).
SparkSession available as 'spark'.
But if you're running code from jar or Python wheel as job, then it's your responsibility to create corresponding objects.
In Databricks environment, Whereas in Spark 2.0 the same effects can be achieved through SparkSession, without expliciting creating SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession. Using a builder design pattern, it instantiates a SparkSession object if one does not already exist, along with its associated underlying contexts.ref: link
I have a spark as interpreter in Zeppelin.
I'm using a Spark2.0, I built a Session: Create
In general you should not initialize SparkSession nor SparkContext in Zeppelin. Zeppelin notebooks are configured to create session for you, and their correct behavior depends on using provided objects.
Initializing your SparkSession will break core Zeppelin functionalities, and multiple SparkContexts will break things completely in the worst case scenario.
Is set spark.driver.allowMultipleContexts to False is best to do a tests ?
You should never use spark.driver.allowMultipleContexts - it not supported, and doesn't guarantee correct results.
I read in a Spark book :
Driver programs access Spark through a SparkContext object, which represents a
connection to a computing cluster. In the shell, a SparkContext is automatically created for you as the variable called sc. Try printing out sc to see its type
sc
When I enter sc, it gives me an error 20 value sc not found. Any idea why is sc not automatically created in my scala spark shell?
I try to manually create a sc and it gave me an error saying there is already a spark context in the JVM. Please see pic :
http://s30.photobucket.com/user/kctestingeas1/media/No%20Spark%20Context.jpg.html
I believe i am already in scala spark shell as you can see on the top of my cmd window indicating bin\spark-shell
Please advise. Thanks
Hopefully you found the answer to your question, because I am encountering the same issue as well.
In the meantime, use this workaround. In the scala spark shell, enter:
import org.apache.spark.SparkContext
val sc = SparkContext.getOrCreate()
You then have access to sc.
I was going through the Apache spark documentation. Spark docs for python says the following:
...We can pass Python functions to Spark, which are automatically
serialized along with any variables that they reference...
I don't fully understand what it means. Does it have to do something the the RDD type?
What does it mean in the context of spark?
The serialization is necessary when using PySpark because the function you define locally needs to be executed remotely on each of the worker nodes. This concept isn't really related to the RDD type.
I am trying to create a new Spark context using pyspark, and i get the following:
WARN SparkContext: Another SparkContext is being constructed (or threw
an exception in its constructor). This may indicate an error, since
only one SparkContext may be running in this JVM (see SPARK-2243). The
other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
I do not have any other context active (in theory), but maybe it did not finish correctly and it is still there. How can I find out if there is other or kill all the current ones? I am using spark 1.5.1
When you run pyspark shell and execute python script inside it, e.g., using 'execfile()', SparkContext available as sc, HiveContext available as sqlContext. To run a python script without any contexts just use ./bin/spark-submit 'your_python_file'.