Is this usage of a HashMap thread safe? - multithreading

I happen to work in a project which uses mapperdao library.
Occasionally, it throws an exception which suggests that the library is not thread safe. I could not figure out why though.
The issue happens in this code:
type CacheKey = (Class[_], LazyLoad)
private val classCache = new scala.collection.mutable.HashMap[CacheKey, (Class[_], Map[String, ColumnInfoRelationshipBase[_, Any, Any, Any]])]
def proxyFor[ID, T](constructed: T with Persisted, entity: EntityBase[ID, T], lazyLoad: LazyLoad, vm: ValuesMap): T with Persisted = {
(...)
val key = (clz, lazyLoad)
// get cached proxy class or generate it
val (proxyClz, methodToCI) = classCache.synchronized {
classCache.get(key).getOrElse {
val methods = lazyRelationships.map(ci =>
ci.getterMethod.getOrElse(
throw new IllegalStateException("please define getter method on entity %s . %s".format(entity.getClass.getName, ci.column))
).getterMethod
).toSet
if (methods.isEmpty)
throw new IllegalStateException("can't lazy load class that doesn't declare any getters for relationships. Entity: %s".format(clz))
val proxyClz = createProxyClz(constructedClz, clz, methods)
val methodToCI = lazyRelationships.map {
ci =>
(ci.getterMethod.get.getterMethod.getName, ci.asInstanceOf[ColumnInfoRelationshipBase[T, Any, Any, Any]])
}.toMap
val r = (proxyClz, methodToCI)
classCache.put(key, r)
r
}
}
val instantiator = objenesis.getInstantiatorOf(proxyClz)
val instance = instantiator.newInstance.asInstanceOf[DeclaredIds[ID] with T with MethodImplementation[T with Persisted]]
(...)
}
instantiator.newInstance throws java.lang.NoClassDefFoundError for a class, which is supposed to be dynamically compiled and its name put into the map.
This code seems to be thread safe to me as any operation on the map is performed in the synchronized block. I can not figure out a scenario, when a map returns a class name, which is not generated and compiled yet. Am I missing something?
One other explanation is that the class is compiled, but it's not visible to the current class loader. I don't know how this can happen though and why this happens occasionally though.
UPDATE:
The track trace looks like:
java.lang.NoClassDefFoundError: Could not initialize class com.mypackage.MyClass_$2
at sun.reflect.GeneratedSerializationConstructorAccessor57.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.objenesis.instantiator.sun.SunReflectionFactoryInstantiator.newInstance(SunReflectionFactoryInstantiator.java:56)
at com.googlecode.mapperdao.lazyload.LazyLoadManager.proxyFor(LazyLoadManager.scala:63)
at com.googlecode.mapperdao.jdbc.impl.MapperDaoImpl.lazyLoadEntity(MapperDaoImpl.scala:338)
at com.googlecode.mapperdao.jdbc.impl.MapperDaoImpl.$anonfun$toEntities$5(MapperDaoImpl.scala:301)
at com.googlecode.mapperdao.internal.EntityMap.$anonfun$get$1(EntityMap.scala:46)
UPDATE 2: I finally found the root cause of my issue. It is a concurrency problem indeed, but not in the code I pasted. It's in MImpl class.
What is happening here is dynamically generated class being compiled correctly, but it's initialization fails occasionally due to concurrency issue in MImpl class. Next time the code tries to instantiate the class ends up with NoClassDefFoundException thrown by JVM.

Related

NullPointer in broadcast value being passed as parameter

:)
I'd like to say that I'm new in Spark, as many of these posts start..but the truth is I'm not that new.
Still, I'm facing this issue with broadcast variables.
When a variable is broadcast, each executor receives a copy of it. Later on, when this variable is referenced in the part of the code that is executed in the executors (let's say map or foreach), if the variable reference that was set in the driver is not passed to it, the executor does not know what are we talking about. Which I think is perfectly explain here
My problem is I am getting a nullPointerException even tough I passed the broadcast reference to the executors.
class A {
var broadcastVal: Broadcast[Dataframe] = _
...
def method1 {
broadcastVal = otherMethodWhichSendBroadcast
doSomething(broadcastVal, others)
}
}
class B {
def doSomething(...) {
forEachPartition {x => doSomethingElse(x, broadcasVal)}
}
}
object C {
def doSomethingElse(...) {
broadcastVal.value.show --> Exception
}
}
What am I missing?
Thanks in advance!
RDD and DataFrames are already distributed structures, no need to broadcast them as local variable .(org.apache.spark.sql.functions.broadcast() function (which is used while doing joins) is not local variable broadcast )
Even if you try the code syntax wise it wont show any compilation error, rather it will throw RuntimeException like NullPointerException which is 100% valid.
Example to Explain the behavior :
package examples
import org.apache.log4j.Level
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.sql.{DataFrame, SparkSession}
object BroadCastCheck extends App {
org.apache.log4j.Logger.getLogger("org").setLevel(Level.OFF)
val spark = SparkSession.builder().appName(getClass.getName).master("local").getOrCreate()
val sc = spark.sparkContext
val df = spark.range(100).toDF()
var broadcastVal: Broadcast[DataFrame] = sc.broadcast(df)
val t1 = sc.parallelize(0 until 10)
val t2 = sc.broadcast(2) // this is right since its local variable can be primitive or map or any scala collection
val t3 = t1.filter(_ % t2.value == 0).persist() //this is the way of ha
t3.foreach {
x =>
println(x)
// broadcastVal.value.toDF().show // null pointer wrong way
// spark.range(100).toDF().show // null pointer wrong way
}
}
Result : (if you un comment broadcastVal.value.toDF().show or spark.range(100).toDF().show in above code)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:56)
at org.apache.spark.sql.execution.WholeStageCodegenExec.metrics$lzycompute(WholeStageCodegenExec.scala:528)
at org.apache.spark.sql.execution.WholeStageCodegenExec.metrics(WholeStageCodegenExec.scala:527)
Further read the difference between broadcast variable and broadcast function here...

Child thread not seeing updates made by main thread

I'm implementing a SparkHealthListener by extending SparkListener class.
#Component
class ClusterHealthListener extends SparkListener with Logging {
val appRunning = new AtomicBoolean(false)
val executorCount = new AtomicInteger(0)
override def onApplicationStart(applicationStart: SparkListenerApplicationStart) = {
logger.info("Application Start called .. ")
this.appRunning.set(true)
logger.info(s"[appRunning = ${appRunning.get}]")
}
override def onExecutorAdded(executorAdded: SparkListenerExecutorAdded) = {
logger.info("Executor add called .. ")
this.executorCount.incrementAndGet()
logger.info(s"[executorCount = ${executorCount.get}]")
}
}
appRunning and executorCount are two variables declared in ClusterHealthListener class. ClusterHealthReporterThread will only read the values.
#Component
class ClusterHealthReporterThread #Autowired() (healthListener: ClusterHealthListener) extends Logging {
new Thread {
override def run(): Unit = {
while (true) {
Thread.sleep(10 * 1000)
logger.info("Checking range health")
logger.info(s"[appRunning = ${healthListener.appRunning.get}] [executorCount=${healthListener.executorCount.get}]"
}
}
}.start()
}
ClusterHealthReporterThread is always reporting the initialized values regardless of the changes made to the variable by main thread? What am I doing wrong? Is this because I inject healthListener to ClusterHealthReporterThread?
Update
I played around a bit and looks like it has something to do with the way i initiate spark listener.
If I add the spark listener like this
val sparkContext = SparkContext.getOrCreate(sparkConf)
sparkContext.addSparkListener(healthListener)
Parent thread will show appRunning as 'false' always but shows executor count correctly. Child thread (health reporter) will also show proper executor counts but appRunning was always reporting 'false' like that of the main thread.
Then I stumbled across this Why is SparkListenerApplicationStart never fired? and tried setting listener at the spark config level,
.set("spark.extraListeners", "HealthListener class path")
If I do this, main thread will report 'true' for appRunning and will report correct executor counts but child thread will always report 'false' and '0' value for executors.
I can't immediately see what's wrong here, you might have found an interesting edge case.
I think #m4gic's comment might be correct, that the logging library is perhaps caching that interpolated string? It looks like you are using https://github.com/lightbend/scala-logging which claims that this interpolation "has no effect on behavior", so maybe not. Please could you follow his suggestion to retry without using that feature and report back?
A second possibility: I wonder if there is only one ClusterHealthListener in the system? Perhaps the autowiring is causing a second instance to be created? Can you log the object ids of the ClusterHealthListener reference in both locations and verify that they are the same object?
If neither of those suggestions fix this, are you able to post a working example that I can play with?

task not serializable error while performing an RDD map function scala

trust me on this one, I have tried really hard toiling night and day but just just could not get hold of this Task not serializable error which is now totally eating me out!.
And I understand there are many similar questions floating around on SO but either am I really too dumb to get those (I am not expecting any spoon feeding in here) or mine being a different bug story altogether.
Totally requesting you guys to have a look at this one :
class RootServer(val config: Config) extends Configurable with Server with Serializable {
var PKsAffectedDF: DataFrame = dataSource.getAffectedPKs(prevBookMark,currBookMark)
if(PKsAffectedDF.rdd.isEmpty()){
Holder.log.info("[SQL] no records to upsert in the internal Status Table == for bookmarks : "+prevBookMark+" ==and== "+currBookMark+" for table == "+dataSource.db+"_"+dataSource.table)
}
val PKsAffectedDF_json = PKsAffectedDF.toJSON
PKsAffectedDF_json.foreachPartition { partitionOfRecords => {
var props = new Properties()
props.put('','')
props.put('','')
props.put('','')
props.put('','')
val producer = new KafkaProducer[String,String](props)
partitionOfRecords.foreach
{
case x:String=>{
println(x)
val message=new ProducerRecord[String, String]("[TOPIC] "+dbname+"_"+dbtable,dbname,x)
producer.send(message)
}
}
}
}
}
Now Config is the typesafe.config Class which I believe is serializable, and server class is a basic abstract class with just two methods. I am totally stuck at what would be giving the following stacktrace :
http://pastebin.com/5yPA1s7e Sorry for pastebinning it but the entire stacktrace is there.
Thanks in Advance peeps.

Using Squeryl with Akka Actors

So I'm trying to work with both Squeryl and Akka Actors. I've done a lot of searching and all I've been able to find is the following Google Group post:
https://groups.google.com/forum/#!topic/squeryl/M0iftMlYfpQ
I think I might have shot myself in the foot as I originally created this factory pattern so I could toss around Database objects.
object DatabaseType extends Enumeration {
type DatabaseType = Value
val Postgres = Value(1,"Postgres")
val H2 = Value(2,"H2")
}
object Database {
def getInstance(dbType : DatabaseType, jdbcUrl : String, username : String, password : String) : Database = {
Class.forName(jdbcDriver(dbType))
new Database(Session.create(
_root_.java.sql.DriverManager.getConnection(jdbcUrl,username,password),
squerylAdapter(dbType)))
}
private def jdbcDriver(db : DatabaseType) = {
db match {
case DatabaseType.Postgres => "org.postgresql.Driver"
case DatabaseType.H2 => "org.h2.Driver"
}
}
private def squerylAdapter(db : DatabaseType) = {
db match {
case DatabaseType.Postgres => new PostgreSqlAdapter
case DatabaseType.H2 => new H2Adapter
}
}
}
Originally in my implementation, I tried surrounding all my statements in using(session), but I'd keep getting the dreaded "No session is bound to the current thread" error, so I added the session.bindToCuirrentThread to the constructor.
class Database(session: Session) {
session.bindToCurrent
def failedBatch(filename : String, message : String, start : Timestamp = now, end : Timestamp = now) =
batch.insert(new Batch(0,filename,Some(start),Some(end),ProvisioningStatus.Fail,Some(message)))
def startBatch(batch_id : Long, start : Timestamp = now) =
batch update (b => where (b.id === batch_id) set (b.start := Some(start)))
...more functions
This worked reasonably well, until I got to Scala Actors.
class TransferActor() extends Actor {
def databaseInstance() = {
val dbConfig = config.getConfig("provisioning.database")
Database.getInstance(DatabaseType.Postgres,
dbConfig.getString("jdbcUrl"),
dbConfig.getString("username"),
dbConfig.getString("password"))
}
lazy val config = ConfigManager.current
override def receive: Actor.Receive = { /* .. do some work */
I constantly get the following:
[ERROR] [03/11/2014 17:02:57.720] [provisioning-system-akka.actor.default-dispatcher-4] [akka://provisioning-system/user/$c] No session is bound to current thread, a session must be created via Session.create
and bound to the thread via 'work' or 'bindToCurrentThread'
Usually this error occurs when a statement is executed outside of a transaction/inTrasaction block
java.lang.RuntimeException: No session is bound to current thread, a session must be created via Session.create
and bound to the thread via 'work' or 'bindToCurrentThread'
I'm getting a fresh Database object each time, not caching it with a lazy val, so shouldn't that constructor always get called and attach to my current thread? Does Akka attach different threads to different actors and swap them around? Should I just add a function to call session.bindToCurrentThread each time I'm in an actor? Seems kinda hacky.
Does Akka attach different threads to different actors and swap them around?
That's exactly how the actor model works. The idea is that you can have a small thread pool servicing a very large number of threads because actors only need to use a thread when they have a message waiting to be processed.
Some general tips for Squeryl.... A session is a one to one association with a JDBC connection. The main advantage of keeping Sessions open is that you can have a transaction open that gives you a consistent view of the database as you perform multiple operations. If you don't need that, make your session/transaction code granular to avoid these types of issues. If you do need it, don't rely on Sessions being available in a thread local context. Use the transaction(session){} or transaction(sessionFactory){} methods to explicitly tell Squeryl where you want your Session to come from.

Actor's value sometimes returns null

I have an Actor and some other object:
object Config {
val readValueFromConfig() = { //....}
}
class MyActor extends Actor {
val confValue = Config.readValueFromConfig()
val initValue = Future {
val a = confValue // sometimes it's null
val a = Config.readValueFromConfig() //always works well
}
//..........
}
The code above is a very simplified version of what I actually have. The odd thing is that sometimes val a = confValue returns null, whereas if I replace it with val a = Config.readValueFromConfig() then it always works well.
I wonder, is this due to the fact that the only way to interact with an actor is sending it a message? Therefore, since val confValue is not a local variable, I must either use val a = Config.readValueFromConfig() (a different object, not an actor) or val a = self ! GetConfigValue and read the result afterwards?
val readValueFromConfig() = { //....}
This gives me a compile error. I assume you mean without parentheses?
val readValueFromConfig = { //....}
Same logic with different timing gives different result = a race condition.
val confValue = Config.readValueFromConfig() is always executed during construction of MyActor objects (because it's a field of MyActor). Sometimes this is returning null.
val a = Config.readValueFromConfig() //always works well is always executed later - after MyActor is constructed, when the Future initValue is executed by it's Executor. It seems this never returns null.
Possible causes:
Could be explained away if the body of readValueFromConfig was dependent upon another
parallel/async operation having completed. Any chance you're reading the config asynchronously? Given the name of this method, it probably just reads synchronously from a file - meaning this is not the cause.
Singleton objects are not threadsafe?? I compiled your code. Here's the decompilation of your singleton object java class:
public final class Config
{
public static String readValueFromConfig()
{
return Config..MODULE$.readValueFromConfig();
}
}
public final class Config$
{
public static final MODULE$;
private final String readValueFromConfig;
static
{
new ();
}
public String readValueFromConfig()
{
return this.readValueFromConfig;
}
private Config$()
{
MODULE$ = this;
this.readValueFromConfig = // ... your logic here;
}
}
Mmmkay... Unless I'm mistaken, that ain't thread-safe.
IF two threads are accessing readValueFromConfig (say Thread1 accesses it first), then inside method private Config$(), MODULE$ is unsafely published before this.readValueFromConfig is set (reference to this prematurely escapes the constructor). Thread2 which is right behind can read MODULE$.readValueFromConfig before it is set. Highly likely to be a problem if '... your logic here' is slow and blocks the thread - which is precisely what synchronous I/O does.
Moral of story: avoid stateful singleton objects from Actors (or any Threads at all, including Executors) OR make them become thread-safe through very careful coding style. Work-Around: change to a def, which internally caches the value in a private val.
I wonder, is this due to the fact that the only way to interact with an actor is sending it a message? Therefore, since val confValue is not a local variable, I must either use val a = Config.readValueFromConfig() (a different object, not an actor)
Just because it's not an actor, doesn't mean it's necessarily safe. It probably isn't.
or val a = self ! GetConfigValue and read the result afterwards?
That's almost right. You mean self ? GetConfigValue, I think - that will return a Future, which you can then map over. ! doesn't return anything.
You cannot read from an actor's variables directly inside a Future because (in general) that Future could be running on any thread, on any processor core, and you don't have any memory barrier there to force the CPU caches to reload the value from main memory.

Resources