Spark job throwing NPE [duplicate] - apache-spark

I'm writing Spark Jobs that talk to Cassandra in Datastax.
Sometimes when working through a sequence of steps in a Spark job, it is easier to just get a new RDD rather than join to the old one.
You can do this by calling the SparkContext [getOrCreate][1] method.
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network.
In this case - you're registering a singleton for that JVM, and so it gets around the problem of serialization.
One day my tech lead came to me and said
Don't use SparkContext getOrCreate you can and should use joins instead
But he didn't give a reason.
My question is: Is there a reason not to use SparkContext.getOrCreate when writing a spark job?

TL;DR There are many legitimate applications of the getOrCreate methods but attempt to find a loophole to perform map-side joins is not one of them.
In general there is nothing deeply wrong with SparkContext.getOrCreate. The method has its applications, and although there some caveats, most notably:
In its simplest form it doesn't allow you to set job specific properties, and the second variant ((SparkConf) => SparkContext) requires passing SparkConf around, which is hardly an improvement over keeping SparkContext / SparkSession in the scope.
It can lead to opaque code with "magic" dependency. It affects testing strategies and overall code readability.
However your question, specifically:
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network
and
Don't use SparkContext getOrCreate you can and should use joins instead
suggests you're actually using the method in a way that it was never intended to be used. By using SparkContext on an executor node.
val rdd: RDD[_] = ???
rdd.map(_ => {
val sc = SparkContext.getOrCreate()
...
})
This is definitely something that you shouldn't do.
Each Spark application should have one, and only one SparkContext initialized on the driver, and Apache Spark developers made at a lot prevent users from any attempts of using SparkContex outside the driver. It is not because SparkContext is large, or impossible to serialize, but because it is fundamental feature of the Spark's computing model.
As you probably know, computation in Spark is described by a directed acyclic graph of dependencies, which:
Describes processing pipeline in a way that can be translated into actual task.
Enables graceful recovery in case of task failures.
Allows proper resource allocation and ensures lack of cyclic dependencies.
Let's focus on the last part. Since each executor JVM gets its own instance of SparkContext cyclic dependencies are not an issue - RDDs and Datasets exist only in a scope of its parent context so you won't be able to objects belonging to the application driver.
Proper resource allocation is a different thing. Since each SparkContext creates its own Spark application, your "main" process won't be able to account for resources used by the contexts initialized in the tasks. At the same time cluster manager won't have any indication that application or somehow interconnected. This is likely to cause deadlock-like conditions.
It is technically possible to go around it, with careful resource allocation and usage of the manager-level scheduling pools, or even a separate cluster manager with its own set or resources, but it is not something that Spark is designed for, it not supported, and overall would lead to brittle and convoluted design, where correctness depends on a configuration details, specific cluster manager choice and overall cluster utilization.

Related

When does Spark evict broadcasted dataframe from Executors?

I have a doubt, about when we broadcast a dataframe.
Copies of broadcasted dataframe are sent to each Executor.
So, when does Spark evict these copies from each Executor ?
I find this topic functionally easy to understand, but the manuals harder to follow technically and there are improvements always in the offing.
My take:
There is a ContextCleaner that is running on the Driver for every Spark App.
It gets created immediately started when the SparkContext commences.
It is more about all sorts of objects in Spark.
The ContextCleaner thread cleans RDD, shuffle, and broadcast states, Accumulators using keepCleaning method that runs always
from this class. It decides which objects needs eviction due to no longer being
referenced and these get placed on a list. It calls various methods, such
as registerShuffleForCleanup. That is to say a check is made to see if there are no alive root objects pointing to a given object; if so, then that object is eligible for clean-up, eviction.
context-cleaner-periodic-gc asynchronously requests the standard JVM garbage collector. Periodic runs of this are started when
ContextCleaner starts and stopped when ContextCleaner terminates.
Spark makes use of the standard Java GC.
This https://mallikarjuna_g.gitbooks.io/spark/content/spark-service-contextcleaner.html is a good reference next to the Spark official docs.

Spark shuffle blocks replication

I'd like to know if it's possible to define replication logic to shuffle blocks without using persist action.
Use case is having complex sql with multiple joins which requires a big amount of shuffles which is saved on worker machines (with splill), loosing a machine might require stage retries (using dag) which is very expansive and might not always work.
Can it be done using configuration or by inheriting from some class in spark context.
Version Spark 2.3

Get SparkSession in partition loop [duplicate]

I'm writing Spark Jobs that talk to Cassandra in Datastax.
Sometimes when working through a sequence of steps in a Spark job, it is easier to just get a new RDD rather than join to the old one.
You can do this by calling the SparkContext [getOrCreate][1] method.
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network.
In this case - you're registering a singleton for that JVM, and so it gets around the problem of serialization.
One day my tech lead came to me and said
Don't use SparkContext getOrCreate you can and should use joins instead
But he didn't give a reason.
My question is: Is there a reason not to use SparkContext.getOrCreate when writing a spark job?
TL;DR There are many legitimate applications of the getOrCreate methods but attempt to find a loophole to perform map-side joins is not one of them.
In general there is nothing deeply wrong with SparkContext.getOrCreate. The method has its applications, and although there some caveats, most notably:
In its simplest form it doesn't allow you to set job specific properties, and the second variant ((SparkConf) => SparkContext) requires passing SparkConf around, which is hardly an improvement over keeping SparkContext / SparkSession in the scope.
It can lead to opaque code with "magic" dependency. It affects testing strategies and overall code readability.
However your question, specifically:
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network
and
Don't use SparkContext getOrCreate you can and should use joins instead
suggests you're actually using the method in a way that it was never intended to be used. By using SparkContext on an executor node.
val rdd: RDD[_] = ???
rdd.map(_ => {
val sc = SparkContext.getOrCreate()
...
})
This is definitely something that you shouldn't do.
Each Spark application should have one, and only one SparkContext initialized on the driver, and Apache Spark developers made at a lot prevent users from any attempts of using SparkContex outside the driver. It is not because SparkContext is large, or impossible to serialize, but because it is fundamental feature of the Spark's computing model.
As you probably know, computation in Spark is described by a directed acyclic graph of dependencies, which:
Describes processing pipeline in a way that can be translated into actual task.
Enables graceful recovery in case of task failures.
Allows proper resource allocation and ensures lack of cyclic dependencies.
Let's focus on the last part. Since each executor JVM gets its own instance of SparkContext cyclic dependencies are not an issue - RDDs and Datasets exist only in a scope of its parent context so you won't be able to objects belonging to the application driver.
Proper resource allocation is a different thing. Since each SparkContext creates its own Spark application, your "main" process won't be able to account for resources used by the contexts initialized in the tasks. At the same time cluster manager won't have any indication that application or somehow interconnected. This is likely to cause deadlock-like conditions.
It is technically possible to go around it, with careful resource allocation and usage of the manager-level scheduling pools, or even a separate cluster manager with its own set or resources, but it is not something that Spark is designed for, it not supported, and overall would lead to brittle and convoluted design, where correctness depends on a configuration details, specific cluster manager choice and overall cluster utilization.

How many SparkSessions can a single application have?

I have found that as Spark runs, and tables grow in size (through Joins) that the spark executors will eventually run out of memory and the entire system crashes. Even if I try to write temporary results to Hive tables (on HDFS), the system still doesn't free much memory, and my entire system crashes after about 130 joins.
However, through experimentation, I realized that if I break the problem into smaller pieces, write temporary results to hive tables, and Stop/Start the Spark session (and spark context), then the system's resources are freed. I was able to join over 1,000 columns using this approach.
But I can't find any documentation to understand if this is considered a good practice or not (I know you should not acquire multiple sessions at once). Most systems acquire the session in the beginning and close it in the end. I could also break the application into smaller ones, and use a driver like Oozie to schedule these smaller applications on Yarn. But this approach would start and stop the JVM at each stage, which seems a bit heavy-weight.
So my question: is it bad practice to continually start/stop the spark session to free system resources during the run of a single spark application?
But can you elaborate on what you mean by a single SparkContext on a single JVM? I was able call sparkSession.sparkContext().stop(), and also stop the SparkSession. I then created a new SparkSession and used a new sparkContext. No error was thrown.
I was also able to use this on the JavaSparkPi without any problems.
I have tested this in yarn-client and a local spark install.
What exactly does stopping the spark context do, and why can you not create a new one once you've stopped one?
TL;DR You can have as many SparkSessions as needed.
You can have one and only one SparkContext on a single JVM, but the number of SparkSessions is pretty much unbounded.
But can you elaborate on what you mean by a single SparkContext on a single JVM?
It means that at any given time in the lifecycle of a Spark application the driver can only be one and only one which in turn means that there's one and only one SparkContext on that JVM available.
The driver of a Spark application is where the SparkContext lives (or it's the opposite rather where SparkContext defines the driver -- the distinction is pretty much blurry).
You can only have one SparkContext at one time. Although you can start and stop it on demand as many times you want, but I remember an issue about it that said you should not close SparkContext unless you're done with Spark (which usually happens at the very end of your Spark application).
In other words, have a single SparkContext for the entire lifetime of your Spark application.
There was a similar question What's the difference between SparkSession.sql vs Dataset.sqlContext.sql? about multiple SparkSessions that can shed more light on why you'd want to have two or more sessions.
I was able call sparkSession.sparkContext().stop(), and also stop the SparkSession.
So?! How does this contradict what I said?! You stopped the only SparkContext available on the JVM. Not a big deal. You could, but that's just one part of "you can only have one and only one SparkContext on a single JVM available", isn't it?
SparkSession is a mere wrapper around SparkContext to offer Spark SQL's structured/SQL features on top of Spark Core's RDDs.
From the point of Spark SQL developer, the purpose of a SparkSession is to be a namespace for query entities like tables, views or functions that your queries use (as DataFrames, Datasets or SQL) and Spark properties (that could have different values per SparkSession).
If you'd like to have the same (temporary) table name used for different Datasets, creating two SparkSessions would be what I'd consider the recommended way.
I've just worked on an example to showcase how whole-stage codegen works in Spark SQL and have created the following that simply turns the feature off.
// both where and select operators support whole-stage codegen
// the plan tree (with the operators and expressions) meets the requirements
// That's why the plan has WholeStageCodegenExec inserted
// You can see stars (*) in the output of explain
val q = Seq((1,2,3)).toDF("id", "c0", "c1").where('id === 0).select('c0)
scala> q.explain
== Physical Plan ==
*Project [_2#89 AS c0#93]
+- *Filter (_1#88 = 0)
+- LocalTableScan [_1#88, _2#89, _3#90]
// Let's break the requirement of having up to spark.sql.codegen.maxFields
// I'm creating a brand new SparkSession with one property changed
val newSpark = spark.newSession()
import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_MAX_NUM_FIELDS
newSpark.sessionState.conf.setConf(WHOLESTAGE_MAX_NUM_FIELDS, 2)
scala> println(newSpark.sessionState.conf.wholeStageMaxNumFields)
2
// Let's see what's the initial value is
// Note that I use spark value (not newSpark)
scala> println(spark.sessionState.conf.wholeStageMaxNumFields)
100
import newSpark.implicits._
// the same query as above but created in SparkSession with WHOLESTAGE_MAX_NUM_FIELDS as 2
val q = Seq((1,2,3)).toDF("id", "c0", "c1").where('id === 0).select('c0)
// Note that there are no stars in the output of explain
// No WholeStageCodegenExec operator in the plan => whole-stage codegen disabled
scala> q.explain
== Physical Plan ==
Project [_2#122 AS c0#126]
+- Filter (_1#121 = 0)
+- LocalTableScan [_1#121, _2#122, _3#123]
I then created a new SparkSession and used a new SparkContext. No error was thrown.
Again, how does this contradict what I said about a single SparkContext being available? I'm curious.
What exactly does stopping the spark context do, and why can you not create a new one once you've stopped one?
You can no longer use it to run Spark jobs (to process large and distributed datasets) which is pretty much exactly the reason why you use Spark in the first place, doesn't it?
Try the following:
Stop SparkContext
Execute any processing using Spark Core's RDD or Spark SQL's Dataset APIs
An exception? Right! Remember that you close the "doors" to Spark so how could you have expected to be inside?! :)

How to use long-lived expensive-to-instantiate utility services where executors run?

My Spark processing logic depends upon long-lived, expensive-to-instantiate utility objects to perform data-persistence operations. Not only are these objects probably not Serializable, but it is probably impractical to distribute their state in any case, as said state likely includes stateful network connections.
What I would like to do instead is instantiate these objects locally within each executor, or locally within threads spawned by each executor. (Either alternative is acceptable, as long as the instantiation does not take place on each tuple in the RDD.)
Is there a way to write my Spark driver program such that it directs executors to invoke a function to instantiate an object locally (and cache it in the executor's local JVM memory space), rather than instantiating it within the driver program then attempting to serialize and distribute it to the executors?
It is possible to share objects at partition level:
I've tried this : How to make Apache Spark mapPartition work correctly?
The repartition to make numPartitions match a multiple of the number of executors.

Resources