task not serializable error while performing an RDD map function scala - apache-spark

trust me on this one, I have tried really hard toiling night and day but just just could not get hold of this Task not serializable error which is now totally eating me out!.
And I understand there are many similar questions floating around on SO but either am I really too dumb to get those (I am not expecting any spoon feeding in here) or mine being a different bug story altogether.
Totally requesting you guys to have a look at this one :
class RootServer(val config: Config) extends Configurable with Server with Serializable {
var PKsAffectedDF: DataFrame = dataSource.getAffectedPKs(prevBookMark,currBookMark)
if(PKsAffectedDF.rdd.isEmpty()){
Holder.log.info("[SQL] no records to upsert in the internal Status Table == for bookmarks : "+prevBookMark+" ==and== "+currBookMark+" for table == "+dataSource.db+"_"+dataSource.table)
}
val PKsAffectedDF_json = PKsAffectedDF.toJSON
PKsAffectedDF_json.foreachPartition { partitionOfRecords => {
var props = new Properties()
props.put('','')
props.put('','')
props.put('','')
props.put('','')
val producer = new KafkaProducer[String,String](props)
partitionOfRecords.foreach
{
case x:String=>{
println(x)
val message=new ProducerRecord[String, String]("[TOPIC] "+dbname+"_"+dbtable,dbname,x)
producer.send(message)
}
}
}
}
}
Now Config is the typesafe.config Class which I believe is serializable, and server class is a basic abstract class with just two methods. I am totally stuck at what would be giving the following stacktrace :
http://pastebin.com/5yPA1s7e Sorry for pastebinning it but the entire stacktrace is there.
Thanks in Advance peeps.

Related

NullPointer in broadcast value being passed as parameter

:)
I'd like to say that I'm new in Spark, as many of these posts start..but the truth is I'm not that new.
Still, I'm facing this issue with broadcast variables.
When a variable is broadcast, each executor receives a copy of it. Later on, when this variable is referenced in the part of the code that is executed in the executors (let's say map or foreach), if the variable reference that was set in the driver is not passed to it, the executor does not know what are we talking about. Which I think is perfectly explain here
My problem is I am getting a nullPointerException even tough I passed the broadcast reference to the executors.
class A {
var broadcastVal: Broadcast[Dataframe] = _
...
def method1 {
broadcastVal = otherMethodWhichSendBroadcast
doSomething(broadcastVal, others)
}
}
class B {
def doSomething(...) {
forEachPartition {x => doSomethingElse(x, broadcasVal)}
}
}
object C {
def doSomethingElse(...) {
broadcastVal.value.show --> Exception
}
}
What am I missing?
Thanks in advance!
RDD and DataFrames are already distributed structures, no need to broadcast them as local variable .(org.apache.spark.sql.functions.broadcast() function (which is used while doing joins) is not local variable broadcast )
Even if you try the code syntax wise it wont show any compilation error, rather it will throw RuntimeException like NullPointerException which is 100% valid.
Example to Explain the behavior :
package examples
import org.apache.log4j.Level
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.sql.{DataFrame, SparkSession}
object BroadCastCheck extends App {
org.apache.log4j.Logger.getLogger("org").setLevel(Level.OFF)
val spark = SparkSession.builder().appName(getClass.getName).master("local").getOrCreate()
val sc = spark.sparkContext
val df = spark.range(100).toDF()
var broadcastVal: Broadcast[DataFrame] = sc.broadcast(df)
val t1 = sc.parallelize(0 until 10)
val t2 = sc.broadcast(2) // this is right since its local variable can be primitive or map or any scala collection
val t3 = t1.filter(_ % t2.value == 0).persist() //this is the way of ha
t3.foreach {
x =>
println(x)
// broadcastVal.value.toDF().show // null pointer wrong way
// spark.range(100).toDF().show // null pointer wrong way
}
}
Result : (if you un comment broadcastVal.value.toDF().show or spark.range(100).toDF().show in above code)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:56)
at org.apache.spark.sql.execution.WholeStageCodegenExec.metrics$lzycompute(WholeStageCodegenExec.scala:528)
at org.apache.spark.sql.execution.WholeStageCodegenExec.metrics(WholeStageCodegenExec.scala:527)
Further read the difference between broadcast variable and broadcast function here...

Camel-Olingo2: The metadata constraints '[Nullable=true, MaxLength=16]' do not match the literal

I'm using camel-olingo2 component for query SAP SuccessFactors on ODataV2 endpoints. The route is:
from("direct:start")
.to(olingoEndpoint)
.process(paging)
.loopDoWhile(simple("\${header.CamelOlingo2.\$skiptoken} != null"))
.to(olingoEndpoint)
.process(paging)
.end()
Paging processor is:
Processor paging = new Processor() {
#Override
void process(Exchange g) throws Exception {
ODataDeltaFeed feed = g.in.getMandatoryBody(ODataDeltaFeed)
if (consumer) feed.getEntries().forEach(consumer)
String next = feed.getFeedMetadata().getNextLink()
if (next) {
List<NameValuePair> lst = URLEncodedUtils.parse(new URI(next), StandardCharsets.UTF_8)
NameValuePair skiptoken = lst.find { it.name == "\$skiptoken" }
g.out.headers."CamelOlingo2.\$skiptoken" = skiptoken.value
} else {
g.out.headers.remove("CamelOlingo2.\$skiptoken")
}
}
}
Everything is OK with most of entities but there are fields for several entities with wrong nullability or data length so I got:
Caused by: org.apache.olingo.odata2.api.edm.EdmSimpleTypeException: The metadata constraints '[Nullable=true, MaxLength=16]' do not match the literal 'Bor.Kralja Petra I.16'.
at org.apache.olingo.odata2.core.edm.EdmString.internalValueOfString(EdmString.java:62)
at org.apache.olingo.odata2.core.edm.AbstractSimpleType.valueOfString(AbstractSimpleType.java:91)
at org.apache.olingo.odata2.core.ep.consumer.JsonPropertyConsumer.readSimpleProperty(JsonPropertyConsumer.java:236)
at org.apache.olingo.odata2.core.ep.consumer.JsonPropertyConsumer.readPropertyValue(JsonPropertyConsumer.java:169)
In documentation for Olingo2 camel component I cannot find the way to disable this checking or other walkaround. Can you suggest me the good way?
Please do not recommend server-side data changes, ex metadata modifiyng, it's out of scope for this task.
I have plan B: to use HTTPS requests with JSON parsing, it's quite simple but little boring.

Is this usage of a HashMap thread safe?

I happen to work in a project which uses mapperdao library.
Occasionally, it throws an exception which suggests that the library is not thread safe. I could not figure out why though.
The issue happens in this code:
type CacheKey = (Class[_], LazyLoad)
private val classCache = new scala.collection.mutable.HashMap[CacheKey, (Class[_], Map[String, ColumnInfoRelationshipBase[_, Any, Any, Any]])]
def proxyFor[ID, T](constructed: T with Persisted, entity: EntityBase[ID, T], lazyLoad: LazyLoad, vm: ValuesMap): T with Persisted = {
(...)
val key = (clz, lazyLoad)
// get cached proxy class or generate it
val (proxyClz, methodToCI) = classCache.synchronized {
classCache.get(key).getOrElse {
val methods = lazyRelationships.map(ci =>
ci.getterMethod.getOrElse(
throw new IllegalStateException("please define getter method on entity %s . %s".format(entity.getClass.getName, ci.column))
).getterMethod
).toSet
if (methods.isEmpty)
throw new IllegalStateException("can't lazy load class that doesn't declare any getters for relationships. Entity: %s".format(clz))
val proxyClz = createProxyClz(constructedClz, clz, methods)
val methodToCI = lazyRelationships.map {
ci =>
(ci.getterMethod.get.getterMethod.getName, ci.asInstanceOf[ColumnInfoRelationshipBase[T, Any, Any, Any]])
}.toMap
val r = (proxyClz, methodToCI)
classCache.put(key, r)
r
}
}
val instantiator = objenesis.getInstantiatorOf(proxyClz)
val instance = instantiator.newInstance.asInstanceOf[DeclaredIds[ID] with T with MethodImplementation[T with Persisted]]
(...)
}
instantiator.newInstance throws java.lang.NoClassDefFoundError for a class, which is supposed to be dynamically compiled and its name put into the map.
This code seems to be thread safe to me as any operation on the map is performed in the synchronized block. I can not figure out a scenario, when a map returns a class name, which is not generated and compiled yet. Am I missing something?
One other explanation is that the class is compiled, but it's not visible to the current class loader. I don't know how this can happen though and why this happens occasionally though.
UPDATE:
The track trace looks like:
java.lang.NoClassDefFoundError: Could not initialize class com.mypackage.MyClass_$2
at sun.reflect.GeneratedSerializationConstructorAccessor57.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.objenesis.instantiator.sun.SunReflectionFactoryInstantiator.newInstance(SunReflectionFactoryInstantiator.java:56)
at com.googlecode.mapperdao.lazyload.LazyLoadManager.proxyFor(LazyLoadManager.scala:63)
at com.googlecode.mapperdao.jdbc.impl.MapperDaoImpl.lazyLoadEntity(MapperDaoImpl.scala:338)
at com.googlecode.mapperdao.jdbc.impl.MapperDaoImpl.$anonfun$toEntities$5(MapperDaoImpl.scala:301)
at com.googlecode.mapperdao.internal.EntityMap.$anonfun$get$1(EntityMap.scala:46)
UPDATE 2: I finally found the root cause of my issue. It is a concurrency problem indeed, but not in the code I pasted. It's in MImpl class.
What is happening here is dynamically generated class being compiled correctly, but it's initialization fails occasionally due to concurrency issue in MImpl class. Next time the code tries to instantiate the class ends up with NoClassDefFoundException thrown by JVM.

Read & write data into cassandra using apache flink Java API

I intend to use apache flink for read/write data into cassandra using flink. I was hoping to use flink-connector-cassandra, I don't find good documentation/examples for the connector.
Can you please point me to the right way for read and write data from cassandra using Apache Flink. I see only sink example which are purely for write ? Is apache flink meant for reading data too from cassandra similar to apache spark ?
I had the same question, and this is what I was looking for. I don't know if it is over simplified for what you need, but figured I should show it none the less.
ClusterBuilder cb = new ClusterBuilder() {
#Override
public Cluster buildCluster(Cluster.Builder builder) {
return builder.addContactPoint("urlToUse.com").withPort(9042).build();
}
};
CassandraInputFormat<Tuple2<String, String>> cassandraInputFormat = new CassandraInputFormat<>("SELECT * FROM example.cassandraconnectorexample", cb);
cassandraInputFormat.configure(null);
cassandraInputFormat.open(null);
Tuple2<String, String> testOutputTuple = new Tuple2<>();
cassandraInputFormat.nextRecord(testOutputTuple);
System.out.println("column1: " + testOutputTuple.f0);
System.out.println("column2: " + testOutputTuple.f1);
The way I figured this out was thanks to finding the code for the "CassandraInputFormat" class and seeing how it worked (http://www.javatips.net/api/flink-master/flink-connectors/flink-connector-cassandra/src/main/java/org/apache/flink/batch/connectors/cassandra/CassandraInputFormat.java). I honestly expected it to just be a format and not the full class of reading from Cassandra based on the name, and I have a feeling others might be thinking the same thing.
ClusterBuilder cb = new ClusterBuilder() {
#Override
public Cluster buildCluster(Cluster.Builder builder) {
return builder.addContactPoint("localhost").withPort(9042).build();
}
};
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env);
InputFormat inputFormat = new CassandraInputFormat<Tuple3<Integer, Integer, Integer>>("SELECT * FROM test.example;", cb);//, TypeInformation.of(Tuple3.class));
DataStreamSource t = env.createInput(inputFormat, TupleTypeInfo.of(new TypeHint<Tuple3<Integer, Integer,Integer>>() {}));
tableEnv.registerDataStream("t1",t);
Table t2 = tableEnv.sql("select * from t1");
t2.printSchema();
You can use RichFlatMapFunction to extend a class
class MongoMapper extends RichFlatMapFunction[JsonNode,JsonNode]{
var userCollection: MongoCollection[Document] = _
override def open(parameters: Configuration): Unit = {
// do something here like opening connection
val client: MongoClient = MongoClient("mongodb://localhost:10000")
userCollection = client.getDatabase("gp_stage").getCollection("users").withReadPreference(ReadPreference.secondaryPreferred())
super.open(parameters)
}
override def flatMap(event: JsonNode, out: Collector[JsonNode]): Unit = {
// Do something here per record and this function can make use of objects initialized via open
userCollection.find(Filters.eq("_id", somevalue)).limit(1).first().subscribe(
(result: Document) => {
// println(result)
},
(t: Throwable) =>{
println(t)
},
()=>{
out.collect(event)
}
)
}
}
}
Basically open function executes once per worker and flatmap executes it per record. The example is for mongo but can be similarly used for cassandra
In your case as I understand the first step of your pipeline is reading data from Cassandra rather than writing a RichFlatMapFunction you should write your own RichSourceFunction
As a reference you can have a look at simple implementation of WikipediaEditsSource.

Spark Streaming - Travis CI and GitHub custom receiver - continuous data but empty RDD?

Lately, as a part of a scientific research, I've been developing an application that streams (or at least should) data from Travis CI and GitHub, using their REST API's. The purpose of this is to get insight into the commit-build relationship, in order to further perform numerous analysis.
For this, I've implemented the following Travis custom receiver:
object TravisUtils {
def createStream(ctx : StreamingContext, storageLevel: StorageLevel) : ReceiverInputDStream[Build] = new TravisInputDStream(ctx, storageLevel)
}
private[streaming]
class TravisInputDStream(ctx : StreamingContext, storageLevel : StorageLevel) extends ReceiverInputDStream[Build](ctx) {
def getReceiver() : Receiver[Build] = new TravisReceiver(storageLevel)
}
private[streaming]
class TravisReceiver(storageLevel: StorageLevel) extends Receiver[Build](storageLevel) with Logging {
def onStart() : Unit = {
new BuildStream().addListener(new BuildListener {
override def onBuildsReceived(numberOfBuilds: Int): Unit = {
}
override def onBuildRepositoryReceived(build: Build): Unit = {
store(build)
}
override def onException(e: Exception): Unit = {
reportError("Exception while streaming travis", e)
}
})
}
def onStop() : Unit = {
}
}
Whereas the receiver uses my custom made TRAVIS API library (developed in Java using Apache Async Client). However, the problem is the following: the data that I should be receiving is continuous and changes i.e. is being pushed to Travis and GitHub constantly. As an example, consider the fact that GitHub records per second approx. 350 events - including push events, commit comment and similar.
But, when streaming either GitHub or Travis, I do get the data from the first two batches, but then afterwards, the RDD's apart of the DStream are empty - although there is data to be streamed!
I've checked so far couple of things, including the HttpClient used for omitting requests to the API, but none of them did actually solve this problem.
Therefore, my question is - what could be going on? Why isn't Spark streaming the data after period x passes. Below, you may find the set context and configuration:
val configuration = new SparkConf().setAppName("StreamingSoftwareAnalytics").setMaster("local[2]")
val ctx = new StreamingContext(configuration, Seconds(3))
val stream = GitHubUtils.createStream(ctx, StorageLevel.MEMORY_AND_DISK_SER)
// RDD IS EMPTY - that is what is happenning!
stream.window(Seconds(9)).foreachRDD(rdd => {
if (rdd.isEmpty()) {println("RDD IS EMPTY")} else {rdd.collect().foreach(event => println(event.getRepo.getName + " " + event.getId))}
})
ctx.start()
ctx.awaitTermination()
Thanks in advance!

Resources