sc is sharkContext in spark shell (DSE) - apache-spark

I have setup DSE 4.7.0 with
SPARK_ENABLED=1
in /etc/default/DSE, but when I start the spark shell (dse spark) even though I get no errors and see
Spark context available as sc.
then I find the variable sc is not a sparkContext:
scala> sc
res5: shark.SharkContext = shark.SharkContext#61cb67f1
Is something misconfigured? Any suggestions would be much appreciated.
Cheers,
David

Related

SparkContext: Error initializing SparkContext while Running Spark Job via google DataProc

SparkContext: Error initializing SparkContext while Running Spark Job via google DataProc
After I upgraded the google dataproc version from 1.3.62-debian9 to 1.4-debian, all spark data proc jobs started falling with an error:
22/01/09 00:36:50 INFO org.spark_project.jetty.server.Server: Started 3339ms
22/01/09 00:36:50 INFO org.spark_project.jetty.server.AbstractConnector: Started
22/01/09 00:36:50 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair
Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use
fair scheduling, configure pools in fairscheduler.xml or set
spark.scheduler.allocation.file to a file that contains the configuration.
ERROR org.apache.spark.SparkContext: Error initializing SparkContext.
java.lang.NumberFormatException: For input string: "30s"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:1441)
I don't set '30s' in spark configration file or in sparkConf object.
initialize SparkContext in my code:
val conf = new SparkConf().setAppName(getMainName().toString)
val sc = new SparkContext(conf)
spark version - 2.3.0
I saw that the default setting of ׳spark.scheduler.maxRegisteredResourcesWaitingTime׳ has the same value
(https://spark.apache.org/docs/latest/configuration.html#spark-configuration) But I did not change or update it.
I do not understand where this value comes from and why it is related to updating the dataproc.
it's related to 'apache hadoop' - they adding a time unit to hdfs-default.xml.
more info about the issue -
https://issues.apache.org/jira/browse/HDFS-12920
Apache Tez Job fails due to java.lang.NumberFormatException for input string: "30s"
1.4 image is approaching EOL too, may you try 1.5 instead?
That said, the problem probably is in your app that probably brings some old Hadoop/Spark jars and/or configs that break Spark, because when you SSH in 1.4 cluster main node and execute your code in spark-shell it works:
val conf = new SparkConf().setAppName("test-app-name")
val sc = new SparkContext(conf)

How to create emptyRDD using SparkSession - (since hivecontext got deprecated)

IN Spark version 1.*
Created emptyRDD like below:
var baseDF = hiveContextVar.createDataFrame(sc.emptyRDD[Row], baseSchema)
While migrating to Spark 2.0(since hiveContext got deprecated, using sparkSession)
Tried like:
var baseDF = sparkSession.createDataFrame(sc.emptyRDD[Row], baseSchema)
Though getting below error:
org.apache.spark.SparkException: Only one SparkContext may be running
in this JVM (see SPARK-2243)
Is there a way to create emptyRDD using sparkSession?
In Spark 2.0 you need to refer the spark context through spark session. You can create empty dataframe as below. It worked for me.
sparkSession.createDataFrame(sparkSession.sparkContext.emptyRDD[Row], baseSchema)
Hope it helps you.

Error when creating sqlContext in Apache Spark

I am using Apache Spark and running it on Ipython notebook.
I am trying to convert a regular dataframe to Spark DataFrame. For that I need sqlContext. When I use it i get an error.
Error says:
IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
I have looked up multiple resources but am not able to solve this issue.
SQLContext used to be the entry point for the SQL functionality in Spark 1.x; in Spark 2 it has been replaced with SparkSession (documentation). So, here is the proper way to initialize Spark in version 2.2, which is the one you are using according to your screenshot:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf()
sc = SparkContext(conf=conf)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
See the documentation on Spark SQL for further usage examples.

Spark Context is not automatically created in Scala Spark Shell

I read in a Spark book :
Driver programs access Spark through a SparkContext object, which represents a
connection to a computing cluster. In the shell, a SparkContext is automatically created for you as the variable called sc. Try printing out sc to see its type
sc
When I enter sc, it gives me an error 20 value sc not found. Any idea why is sc not automatically created in my scala spark shell?
I try to manually create a sc and it gave me an error saying there is already a spark context in the JVM. Please see pic :
http://s30.photobucket.com/user/kctestingeas1/media/No%20Spark%20Context.jpg.html
I believe i am already in scala spark shell as you can see on the top of my cmd window indicating bin\spark-shell
Please advise. Thanks
Hopefully you found the answer to your question, because I am encountering the same issue as well.
In the meantime, use this workaround. In the scala spark shell, enter:
import org.apache.spark.SparkContext
val sc = SparkContext.getOrCreate()
You then have access to sc.

how to set spark conf for pyspark standalone ipython notebook [duplicate]

In Spark, there are 3 primary ways to specify the options for the SparkConf used to create the SparkContext:
As properties in the conf/spark-defaults.conf
e.g., the line: spark.driver.memory 4g
As args to spark-shell or spark-submit
e.g., spark-shell --driver-memory 4g ...
In your source code, configuring a SparkConf instance before using it to create the SparkContext:
e.g., sparkConf.set( "spark.driver.memory", "4g" )
However, when using spark-shell, the SparkContext is already created for you by the time you get a shell prompt, in the variable named sc. When using spark-shell, how do you use option #3 in the list above to set configuration options, if the SparkContext is already created before you have a chance to execute any Scala statements?
In particular, I am trying to use Kyro serialization and GraphX. The prescribed way to use Kryo with GraphX is to execute the following Scala statement when customizing the SparkConf instance:
GraphXUtils.registerKryoClasses( sparkConf )
How do I accomplish this when running spark-shell?
Spark 2.0+
You should be able to use SparkSession.conf.set method to set some configuration option on runtime but it is mostly limited to SQL configuration.
Spark < 2.0
You can simply stop an existing context and create a new one:
import org.apache.spark.{SparkContext, SparkConf}
sc.stop()
val conf = new SparkConf().set("spark.executor.memory", "4g")
val sc = new SparkContext(conf)
As you can read in the official documentation:
Once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
So as you can see stopping the context it is the only applicable option once shell has been started.
You can always use configuration files or --conf argument to spark-shell to set required parameters which will be used be the default context. In case of Kryo you should take a look at:
spark.kryo.classesToRegister
spark.kryo.registrator
See Compression and Serialization in Spark Configuration.

Resources