SparkContext: Error initializing SparkContext while Running Spark Job via google DataProc - apache-spark

SparkContext: Error initializing SparkContext while Running Spark Job via google DataProc
After I upgraded the google dataproc version from 1.3.62-debian9 to 1.4-debian, all spark data proc jobs started falling with an error:
22/01/09 00:36:50 INFO org.spark_project.jetty.server.Server: Started 3339ms
22/01/09 00:36:50 INFO org.spark_project.jetty.server.AbstractConnector: Started
22/01/09 00:36:50 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair
Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use
fair scheduling, configure pools in fairscheduler.xml or set
spark.scheduler.allocation.file to a file that contains the configuration.
ERROR org.apache.spark.SparkContext: Error initializing SparkContext.
java.lang.NumberFormatException: For input string: "30s"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:1441)
I don't set '30s' in spark configration file or in sparkConf object.
initialize SparkContext in my code:
val conf = new SparkConf().setAppName(getMainName().toString)
val sc = new SparkContext(conf)
spark version - 2.3.0
I saw that the default setting of ׳spark.scheduler.maxRegisteredResourcesWaitingTime׳ has the same value
(https://spark.apache.org/docs/latest/configuration.html#spark-configuration) But I did not change or update it.
I do not understand where this value comes from and why it is related to updating the dataproc.

it's related to 'apache hadoop' - they adding a time unit to hdfs-default.xml.
more info about the issue -
https://issues.apache.org/jira/browse/HDFS-12920
Apache Tez Job fails due to java.lang.NumberFormatException for input string: "30s"

1.4 image is approaching EOL too, may you try 1.5 instead?
That said, the problem probably is in your app that probably brings some old Hadoop/Spark jars and/or configs that break Spark, because when you SSH in 1.4 cluster main node and execute your code in spark-shell it works:
val conf = new SparkConf().setAppName("test-app-name")
val sc = new SparkContext(conf)

Related

Use case of spark.executor.allowSparkContext

I'm looking into spark-core, I found one undocumented config, which is spark.executor.allowSparkContext available since 3.0.1. I wasn't able to find detail in spark official documentation.
In code, there is short description for this config
If set to true, SparkContext can be created in executors.
But I wonder that, How can SparkContext be created in executors? As far as I know SparkContext is created on driver, and executors are assigned by resource manager. So SparkContext is always created before executors.
What is the use case of this config?
From the Spark Core migration 3.0 to 3.1:
In Spark 3.0 and below, SparkContext can be created in executors.
Since Spark 3.1, an exception will be thrown when creating
SparkContext in executors. You can allow it by setting the
configuration spark.executor.allowSparkContext when creating
SparkContext in executors.
As per this issue SPARK-32160, since version 3.1 there is a check added when creating SparkContext (see for pyspark pyspark/context.py) which prevents executors from creating SparkContext:
if (conf is None or
conf.get("spark.executor.allowSparkContext", "false").lower() != "true"):
# In order to prevent SparkContext from being created in executors.
SparkContext._assert_on_driver()
# ...
#staticmethod
def _assert_on_driver():
"""
Called to ensure that SparkContext is created only on the Driver.
Throws an exception if a SparkContext is about to be created in executors.
"""
if TaskContext.get() is not None:
raise Exception("SparkContext should only be created and accessed on the driver.")
An error in the docs and, or implementation I suggest.
The whole concept makes no sense if you (as you do) understand the Spark architecture. No announcement has been made otherwise about this.
From the other answer and plentiful doc of errors on this aspect it is clear something went awry.

How to create emptyRDD using SparkSession - (since hivecontext got deprecated)

IN Spark version 1.*
Created emptyRDD like below:
var baseDF = hiveContextVar.createDataFrame(sc.emptyRDD[Row], baseSchema)
While migrating to Spark 2.0(since hiveContext got deprecated, using sparkSession)
Tried like:
var baseDF = sparkSession.createDataFrame(sc.emptyRDD[Row], baseSchema)
Though getting below error:
org.apache.spark.SparkException: Only one SparkContext may be running
in this JVM (see SPARK-2243)
Is there a way to create emptyRDD using sparkSession?
In Spark 2.0 you need to refer the spark context through spark session. You can create empty dataframe as below. It worked for me.
sparkSession.createDataFrame(sparkSession.sparkContext.emptyRDD[Row], baseSchema)
Hope it helps you.

Spark Context is not automatically created in Scala Spark Shell

I read in a Spark book :
Driver programs access Spark through a SparkContext object, which represents a
connection to a computing cluster. In the shell, a SparkContext is automatically created for you as the variable called sc. Try printing out sc to see its type
sc
When I enter sc, it gives me an error 20 value sc not found. Any idea why is sc not automatically created in my scala spark shell?
I try to manually create a sc and it gave me an error saying there is already a spark context in the JVM. Please see pic :
http://s30.photobucket.com/user/kctestingeas1/media/No%20Spark%20Context.jpg.html
I believe i am already in scala spark shell as you can see on the top of my cmd window indicating bin\spark-shell
Please advise. Thanks
Hopefully you found the answer to your question, because I am encountering the same issue as well.
In the meantime, use this workaround. In the scala spark shell, enter:
import org.apache.spark.SparkContext
val sc = SparkContext.getOrCreate()
You then have access to sc.

how to set spark conf for pyspark standalone ipython notebook [duplicate]

In Spark, there are 3 primary ways to specify the options for the SparkConf used to create the SparkContext:
As properties in the conf/spark-defaults.conf
e.g., the line: spark.driver.memory 4g
As args to spark-shell or spark-submit
e.g., spark-shell --driver-memory 4g ...
In your source code, configuring a SparkConf instance before using it to create the SparkContext:
e.g., sparkConf.set( "spark.driver.memory", "4g" )
However, when using spark-shell, the SparkContext is already created for you by the time you get a shell prompt, in the variable named sc. When using spark-shell, how do you use option #3 in the list above to set configuration options, if the SparkContext is already created before you have a chance to execute any Scala statements?
In particular, I am trying to use Kyro serialization and GraphX. The prescribed way to use Kryo with GraphX is to execute the following Scala statement when customizing the SparkConf instance:
GraphXUtils.registerKryoClasses( sparkConf )
How do I accomplish this when running spark-shell?
Spark 2.0+
You should be able to use SparkSession.conf.set method to set some configuration option on runtime but it is mostly limited to SQL configuration.
Spark < 2.0
You can simply stop an existing context and create a new one:
import org.apache.spark.{SparkContext, SparkConf}
sc.stop()
val conf = new SparkConf().set("spark.executor.memory", "4g")
val sc = new SparkContext(conf)
As you can read in the official documentation:
Once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
So as you can see stopping the context it is the only applicable option once shell has been started.
You can always use configuration files or --conf argument to spark-shell to set required parameters which will be used be the default context. In case of Kryo you should take a look at:
spark.kryo.classesToRegister
spark.kryo.registrator
See Compression and Serialization in Spark Configuration.

sc is sharkContext in spark shell (DSE)

I have setup DSE 4.7.0 with
SPARK_ENABLED=1
in /etc/default/DSE, but when I start the spark shell (dse spark) even though I get no errors and see
Spark context available as sc.
then I find the variable sc is not a sparkContext:
scala> sc
res5: shark.SharkContext = shark.SharkContext#61cb67f1
Is something misconfigured? Any suggestions would be much appreciated.
Cheers,
David

Resources