NullPointer in broadcast value being passed as parameter - apache-spark

:)
I'd like to say that I'm new in Spark, as many of these posts start..but the truth is I'm not that new.
Still, I'm facing this issue with broadcast variables.
When a variable is broadcast, each executor receives a copy of it. Later on, when this variable is referenced in the part of the code that is executed in the executors (let's say map or foreach), if the variable reference that was set in the driver is not passed to it, the executor does not know what are we talking about. Which I think is perfectly explain here
My problem is I am getting a nullPointerException even tough I passed the broadcast reference to the executors.
class A {
var broadcastVal: Broadcast[Dataframe] = _
...
def method1 {
broadcastVal = otherMethodWhichSendBroadcast
doSomething(broadcastVal, others)
}
}
class B {
def doSomething(...) {
forEachPartition {x => doSomethingElse(x, broadcasVal)}
}
}
object C {
def doSomethingElse(...) {
broadcastVal.value.show --> Exception
}
}
What am I missing?
Thanks in advance!

RDD and DataFrames are already distributed structures, no need to broadcast them as local variable .(org.apache.spark.sql.functions.broadcast() function (which is used while doing joins) is not local variable broadcast )
Even if you try the code syntax wise it wont show any compilation error, rather it will throw RuntimeException like NullPointerException which is 100% valid.
Example to Explain the behavior :
package examples
import org.apache.log4j.Level
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.sql.{DataFrame, SparkSession}
object BroadCastCheck extends App {
org.apache.log4j.Logger.getLogger("org").setLevel(Level.OFF)
val spark = SparkSession.builder().appName(getClass.getName).master("local").getOrCreate()
val sc = spark.sparkContext
val df = spark.range(100).toDF()
var broadcastVal: Broadcast[DataFrame] = sc.broadcast(df)
val t1 = sc.parallelize(0 until 10)
val t2 = sc.broadcast(2) // this is right since its local variable can be primitive or map or any scala collection
val t3 = t1.filter(_ % t2.value == 0).persist() //this is the way of ha
t3.foreach {
x =>
println(x)
// broadcastVal.value.toDF().show // null pointer wrong way
// spark.range(100).toDF().show // null pointer wrong way
}
}
Result : (if you un comment broadcastVal.value.toDF().show or spark.range(100).toDF().show in above code)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:56)
at org.apache.spark.sql.execution.WholeStageCodegenExec.metrics$lzycompute(WholeStageCodegenExec.scala:528)
at org.apache.spark.sql.execution.WholeStageCodegenExec.metrics(WholeStageCodegenExec.scala:527)
Further read the difference between broadcast variable and broadcast function here...

Related

Garbage Collection on Flink Applications

I have a very simple Flink application in Scala. I have 2 simple streams. I am broadcasting one of my stream to the other stream. Broadcasted stream is containing rules and just checking whether the other is stream's tuples are inside of rules or not. Everything is working fine and my code is like below.
This is an infinite running application. I wonder if there is any possibility for JVM to collect my rules object as garbage or not.
Does anyone has any idea? Many thanks in advance.
object StreamBroadcasting extends App {
val env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI()
val stream = env
.socketTextStream("localhost", 9998)
.flatMap(_.toLowerCase.split("\\W+").filter(_.nonEmpty))
.keyBy(l => l)
val ruleStream = env
.socketTextStream("localhost", 9999)
.flatMap(_.toLowerCase.split("\\W+").filter(_.nonEmpty))
val broadcastStream: DataStream[String] = ruleStream.broadcast
stream.connect(broadcastStream)
.flatMap(new SimpleConnect)
.print
class SimpleConnect extends RichCoFlatMapFunction[String, String, (String, Boolean)] {
private var rules: Set[String] = Set.empty[String] // Can JVM collect this object after a long time?
override def open(parameters: Configuration): Unit = {}
override def flatMap1(value: String, out: Collector[(String, Boolean)]): Unit = {
out.collect(value, rules.contains(value))
}
override def flatMap2(value: String, out: Collector[(String, Boolean)]): Unit = {
rules = rules.+(value)
}
}
env.execute("flink-broadcast-streams")
}
No, the Set of rules will not be garbage collected. It will stick around forever. (Of course, since you're not using Flink's broadcast state, the rules won't survive an application restart.)

Is this usage of a HashMap thread safe?

I happen to work in a project which uses mapperdao library.
Occasionally, it throws an exception which suggests that the library is not thread safe. I could not figure out why though.
The issue happens in this code:
type CacheKey = (Class[_], LazyLoad)
private val classCache = new scala.collection.mutable.HashMap[CacheKey, (Class[_], Map[String, ColumnInfoRelationshipBase[_, Any, Any, Any]])]
def proxyFor[ID, T](constructed: T with Persisted, entity: EntityBase[ID, T], lazyLoad: LazyLoad, vm: ValuesMap): T with Persisted = {
(...)
val key = (clz, lazyLoad)
// get cached proxy class or generate it
val (proxyClz, methodToCI) = classCache.synchronized {
classCache.get(key).getOrElse {
val methods = lazyRelationships.map(ci =>
ci.getterMethod.getOrElse(
throw new IllegalStateException("please define getter method on entity %s . %s".format(entity.getClass.getName, ci.column))
).getterMethod
).toSet
if (methods.isEmpty)
throw new IllegalStateException("can't lazy load class that doesn't declare any getters for relationships. Entity: %s".format(clz))
val proxyClz = createProxyClz(constructedClz, clz, methods)
val methodToCI = lazyRelationships.map {
ci =>
(ci.getterMethod.get.getterMethod.getName, ci.asInstanceOf[ColumnInfoRelationshipBase[T, Any, Any, Any]])
}.toMap
val r = (proxyClz, methodToCI)
classCache.put(key, r)
r
}
}
val instantiator = objenesis.getInstantiatorOf(proxyClz)
val instance = instantiator.newInstance.asInstanceOf[DeclaredIds[ID] with T with MethodImplementation[T with Persisted]]
(...)
}
instantiator.newInstance throws java.lang.NoClassDefFoundError for a class, which is supposed to be dynamically compiled and its name put into the map.
This code seems to be thread safe to me as any operation on the map is performed in the synchronized block. I can not figure out a scenario, when a map returns a class name, which is not generated and compiled yet. Am I missing something?
One other explanation is that the class is compiled, but it's not visible to the current class loader. I don't know how this can happen though and why this happens occasionally though.
UPDATE:
The track trace looks like:
java.lang.NoClassDefFoundError: Could not initialize class com.mypackage.MyClass_$2
at sun.reflect.GeneratedSerializationConstructorAccessor57.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.objenesis.instantiator.sun.SunReflectionFactoryInstantiator.newInstance(SunReflectionFactoryInstantiator.java:56)
at com.googlecode.mapperdao.lazyload.LazyLoadManager.proxyFor(LazyLoadManager.scala:63)
at com.googlecode.mapperdao.jdbc.impl.MapperDaoImpl.lazyLoadEntity(MapperDaoImpl.scala:338)
at com.googlecode.mapperdao.jdbc.impl.MapperDaoImpl.$anonfun$toEntities$5(MapperDaoImpl.scala:301)
at com.googlecode.mapperdao.internal.EntityMap.$anonfun$get$1(EntityMap.scala:46)
UPDATE 2: I finally found the root cause of my issue. It is a concurrency problem indeed, but not in the code I pasted. It's in MImpl class.
What is happening here is dynamically generated class being compiled correctly, but it's initialization fails occasionally due to concurrency issue in MImpl class. Next time the code tries to instantiate the class ends up with NoClassDefFoundException thrown by JVM.

task not serializable error while performing an RDD map function scala

trust me on this one, I have tried really hard toiling night and day but just just could not get hold of this Task not serializable error which is now totally eating me out!.
And I understand there are many similar questions floating around on SO but either am I really too dumb to get those (I am not expecting any spoon feeding in here) or mine being a different bug story altogether.
Totally requesting you guys to have a look at this one :
class RootServer(val config: Config) extends Configurable with Server with Serializable {
var PKsAffectedDF: DataFrame = dataSource.getAffectedPKs(prevBookMark,currBookMark)
if(PKsAffectedDF.rdd.isEmpty()){
Holder.log.info("[SQL] no records to upsert in the internal Status Table == for bookmarks : "+prevBookMark+" ==and== "+currBookMark+" for table == "+dataSource.db+"_"+dataSource.table)
}
val PKsAffectedDF_json = PKsAffectedDF.toJSON
PKsAffectedDF_json.foreachPartition { partitionOfRecords => {
var props = new Properties()
props.put('','')
props.put('','')
props.put('','')
props.put('','')
val producer = new KafkaProducer[String,String](props)
partitionOfRecords.foreach
{
case x:String=>{
println(x)
val message=new ProducerRecord[String, String]("[TOPIC] "+dbname+"_"+dbtable,dbname,x)
producer.send(message)
}
}
}
}
}
Now Config is the typesafe.config Class which I believe is serializable, and server class is a basic abstract class with just two methods. I am totally stuck at what would be giving the following stacktrace :
http://pastebin.com/5yPA1s7e Sorry for pastebinning it but the entire stacktrace is there.
Thanks in Advance peeps.

How do I pass functions into Spark transformations during scalatest?

I am using Flatspec to run a test and keep hitting an error because I pass a function into map. I've encountered this problem a few times, but just found a workaround by just using an anonymous function. That doesn't seem to be possible in this case. Is there a way of passing functions into transformations in scalatest?
code:
“test” should “fail” in {
val expected = sc.parallelize(Array(Array(“foo”, “bar”), Array(“bar”, “qux”)))
def validateFoos(firstWord: String): Boolean = {
if (firstWord == “foo”) true else false
}
val validated = expected.map(x => validateFoos(x(0)))
val trues = expected.map(row => true)
assert(None === RDDComparisons.compareWithOrder(validated, trues))
}
error:
org.apache.spark.SparkException: Task not serializable
*This uses Holden Karau's Spark testing base:
https://github.com/holdenk/spark-testing-base
The "normal" way of handing this is to define the outer class to be serilizable, this is a bad practice in anything except for tests since you don't want to ship a lot of data around.

Actor's value sometimes returns null

I have an Actor and some other object:
object Config {
val readValueFromConfig() = { //....}
}
class MyActor extends Actor {
val confValue = Config.readValueFromConfig()
val initValue = Future {
val a = confValue // sometimes it's null
val a = Config.readValueFromConfig() //always works well
}
//..........
}
The code above is a very simplified version of what I actually have. The odd thing is that sometimes val a = confValue returns null, whereas if I replace it with val a = Config.readValueFromConfig() then it always works well.
I wonder, is this due to the fact that the only way to interact with an actor is sending it a message? Therefore, since val confValue is not a local variable, I must either use val a = Config.readValueFromConfig() (a different object, not an actor) or val a = self ! GetConfigValue and read the result afterwards?
val readValueFromConfig() = { //....}
This gives me a compile error. I assume you mean without parentheses?
val readValueFromConfig = { //....}
Same logic with different timing gives different result = a race condition.
val confValue = Config.readValueFromConfig() is always executed during construction of MyActor objects (because it's a field of MyActor). Sometimes this is returning null.
val a = Config.readValueFromConfig() //always works well is always executed later - after MyActor is constructed, when the Future initValue is executed by it's Executor. It seems this never returns null.
Possible causes:
Could be explained away if the body of readValueFromConfig was dependent upon another
parallel/async operation having completed. Any chance you're reading the config asynchronously? Given the name of this method, it probably just reads synchronously from a file - meaning this is not the cause.
Singleton objects are not threadsafe?? I compiled your code. Here's the decompilation of your singleton object java class:
public final class Config
{
public static String readValueFromConfig()
{
return Config..MODULE$.readValueFromConfig();
}
}
public final class Config$
{
public static final MODULE$;
private final String readValueFromConfig;
static
{
new ();
}
public String readValueFromConfig()
{
return this.readValueFromConfig;
}
private Config$()
{
MODULE$ = this;
this.readValueFromConfig = // ... your logic here;
}
}
Mmmkay... Unless I'm mistaken, that ain't thread-safe.
IF two threads are accessing readValueFromConfig (say Thread1 accesses it first), then inside method private Config$(), MODULE$ is unsafely published before this.readValueFromConfig is set (reference to this prematurely escapes the constructor). Thread2 which is right behind can read MODULE$.readValueFromConfig before it is set. Highly likely to be a problem if '... your logic here' is slow and blocks the thread - which is precisely what synchronous I/O does.
Moral of story: avoid stateful singleton objects from Actors (or any Threads at all, including Executors) OR make them become thread-safe through very careful coding style. Work-Around: change to a def, which internally caches the value in a private val.
I wonder, is this due to the fact that the only way to interact with an actor is sending it a message? Therefore, since val confValue is not a local variable, I must either use val a = Config.readValueFromConfig() (a different object, not an actor)
Just because it's not an actor, doesn't mean it's necessarily safe. It probably isn't.
or val a = self ! GetConfigValue and read the result afterwards?
That's almost right. You mean self ? GetConfigValue, I think - that will return a Future, which you can then map over. ! doesn't return anything.
You cannot read from an actor's variables directly inside a Future because (in general) that Future could be running on any thread, on any processor core, and you don't have any memory barrier there to force the CPU caches to reload the value from main memory.

Resources