I wonder how can I pass OJAI connection from spark driver to its executors. Here's my code:
val connection = DriverManager.getConnection("ojai:mapr:")
val store = connection.getStore("/tables/table1")
val someStream = messagesDStream.mapPartitions {
iterator => {
val list = iterator
.map(record => record.value())
.toList
.asJava
//TODO serializacja, deserializacja, interface serializable w javie
val query = connection
.newQuery()
.where(connection.newCondition()
.in("_id", list)
.build())
.build()}
and the error I got:
Caused by: java.io.NotSerializableException: com.mapr.ojai.store.impl.OjaiConnection
Serialization stack:
- object not serializable (class: com.mapr.ojai.store.impl.OjaiConnection, value: com.mapr.ojai.store.impl.OjaiConnection#2a367e93)
- field (class: com.example.App$$anonfun$1, name: connection$1, type: interface org.ojai.store.Connection)
- object (class com.example.App$$anonfun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
...
As long as the connection to the OJAI is inside the mapPartitions function, everything is fine and dandy. I know that I need to pass the configuration from the driver to executors in order for the code to work but I don't know how to do it. Tschüs!
You're running into spark's most infamous error - task not serialisable.
Essentailly what it means is that one of the classes or objects you're attempting to serialise - send over the network from the driver to the executors - cannot be processed in this way: here, it's the ojai connector.
You cannot pass the connection itself from the driver to the executors - what you can do, while avoiding constant re-creation of the connection for each batch of RDDs coming from your stream, is declare the connection in a companion object as
#transient lazy val connection = ...
And refer to that inside mapPartitions. This will ensure that each executor has a connection to the database which will persist through multiple batches, as fields marked in this way are not creted on the driver then serialised but created on each executor instead.
Related
:)
I'd like to say that I'm new in Spark, as many of these posts start..but the truth is I'm not that new.
Still, I'm facing this issue with broadcast variables.
When a variable is broadcast, each executor receives a copy of it. Later on, when this variable is referenced in the part of the code that is executed in the executors (let's say map or foreach), if the variable reference that was set in the driver is not passed to it, the executor does not know what are we talking about. Which I think is perfectly explain here
My problem is I am getting a nullPointerException even tough I passed the broadcast reference to the executors.
class A {
var broadcastVal: Broadcast[Dataframe] = _
...
def method1 {
broadcastVal = otherMethodWhichSendBroadcast
doSomething(broadcastVal, others)
}
}
class B {
def doSomething(...) {
forEachPartition {x => doSomethingElse(x, broadcasVal)}
}
}
object C {
def doSomethingElse(...) {
broadcastVal.value.show --> Exception
}
}
What am I missing?
Thanks in advance!
RDD and DataFrames are already distributed structures, no need to broadcast them as local variable .(org.apache.spark.sql.functions.broadcast() function (which is used while doing joins) is not local variable broadcast )
Even if you try the code syntax wise it wont show any compilation error, rather it will throw RuntimeException like NullPointerException which is 100% valid.
Example to Explain the behavior :
package examples
import org.apache.log4j.Level
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.sql.{DataFrame, SparkSession}
object BroadCastCheck extends App {
org.apache.log4j.Logger.getLogger("org").setLevel(Level.OFF)
val spark = SparkSession.builder().appName(getClass.getName).master("local").getOrCreate()
val sc = spark.sparkContext
val df = spark.range(100).toDF()
var broadcastVal: Broadcast[DataFrame] = sc.broadcast(df)
val t1 = sc.parallelize(0 until 10)
val t2 = sc.broadcast(2) // this is right since its local variable can be primitive or map or any scala collection
val t3 = t1.filter(_ % t2.value == 0).persist() //this is the way of ha
t3.foreach {
x =>
println(x)
// broadcastVal.value.toDF().show // null pointer wrong way
// spark.range(100).toDF().show // null pointer wrong way
}
}
Result : (if you un comment broadcastVal.value.toDF().show or spark.range(100).toDF().show in above code)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:56)
at org.apache.spark.sql.execution.WholeStageCodegenExec.metrics$lzycompute(WholeStageCodegenExec.scala:528)
at org.apache.spark.sql.execution.WholeStageCodegenExec.metrics(WholeStageCodegenExec.scala:527)
Further read the difference between broadcast variable and broadcast function here...
I want to continuously elaborate rows of a dataset stream (originally initiated by a Kafka): based on a condition I want to update a Radis hash. This is my code snippet (lastContacts is the result of a previous command, which is a stream of this type: org.apache.spark.sql.DataFrame = [serialNumber: string, lastModified: long]. This expands to org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]):
class MyStreamProcessor extends ForeachWriter[Row] {
override def open(partitionId: Long, version: Long): Boolean = {
true
}
override def process(record: Row) = {
val stringHashRDD = sc.parallelize(Seq(("lastContact", record(1).toString)))
sc.toRedisHASH(stringHashRDD, record(0).toString)(redisConfig)
}
override def close(errorOrNull: Throwable): Unit = {}
}
val query = lastContacts
.writeStream
.foreach(new MyStreamProcessor())
.start()
query.awaitTermination()
I receive a huge stack trace, which the relevant part (I think) is this: java.io.NotSerializableException: org.apache.spark.sql.streaming.DataStreamWriter
Could anyone explain why this exception occurs and how to avoid? Thank you!
This question is related to the following two:
DataFrame to RDD[(String, String)] conversion
Call a function with each element a stream in Databricks
Spark Context is not serializable.
Any implementation of ForeachWriter must be serializable because each task will get a fresh serialized-deserialized copy of the provided object. Hence, it is strongly recommended that any initialization for writing data (e.g. opening a connection or starting a transaction) is done after the open(...) method has been called, which signifies that the task is ready to generate data.
In your code, you are trying to use spark context within process method,
override def process(record: Row) = {
val stringHashRDD = sc.parallelize(Seq(("lastContact", record(1).toString)))
*sc.toRedisHASH(stringHashRDD, record(0).toString)(redisConfig)*
}
To send data to redis, you need to create your own connection and open it in the open method and then use it in the process method.
Take a look how to create redis connection pool. https://github.com/RedisLabs/spark-redis/blob/master/src/main/scala/com/redislabs/provider/redis/ConnectionPool.scala
I want to get all the IPs of executors at runtime, which API in Spark should I use? Or any other method to get IPs at runtime?
You should use SparkListener abstract class and intercept two executor-specific events - SparkListenerExecutorAdded and SparkListenerExecutorRemoved.
override def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): Unit = {
val execId = executorAdded.executorId
val host = executorAdded.executorInfo.executorHost
executors += (execId -> host)
println(s">>> executor id=$execId added on host=$host")
}
override def onExecutorRemoved(executorRemoved: SparkListenerExecutorRemoved): Unit = {
val execId = executorRemoved.executorId
val host = executors remove execId getOrElse "Host unknown"
println(s">>> executor id=$execId removed from host=$host")
}
The entire working project is in my Spark Executor Monitor Project.
There is a class in Apache Spark namely ExecutorInfo which has a method executorHost() which returns the Executor Host IP.
I have a following task ahead of me.
User provides set of IP addresses a config file while executing spark submit command.
Lets say that array looks like this :
val ips = Array(1,2,3,4,5)
There can be up to 100.000 values in array..
For all elements in array, I should read data for Cassandra, perform some computation and insert data back to Cassandra.
If I do:
ips.foreach(ip =>{
- read data from Casandra for specific "ip" // for each IP there is different amount of data to read (within the functions I determine start and end date for each IP)
- process it
- save it back to Cassandra})
this works fine.
I believe that process runs sequentially; I don't exploit parallelism.
On the other hand if I do:
val IPRdd = sc.parallelize(Array(1,2,3,4,5))
IPRdd.foreach(ip => {
- read data from Cassandra // I need to use spark context to make the query
-process it
save it back to Cassandra})
I get serialization exception, because spark is trying to serialize spark context, which is not serializable.
How to make this work, but still exploit parallelism.
Thanks
Edited
This is the execption I get:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:919)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)
at com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1.apply(WibeeeBatchJob.scala:59)
at com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1.apply(WibeeeBatchJob.scala:54)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$.main(WibeeeBatchJob.scala:54)
at com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob.main(WibeeeBatchJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#311ff287)
- field (class: com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1, )
- field (class: com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1$$anonfun$apply$1, name: $outer, type: class com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1)
- object (class com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1$$anonfun$apply$1, )
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
Easiest thing to do is to use the Spark Cassandra Connector which can handle connection pooling and serialization.
With that you could do something like
sc.parallelize(inputData, numTasks)
.mapPartitions { it =>
val con = CassandraConnection(yourConf)
con.withSessionDo{ session =>
//Use the session
}
//Do any other processing
}.saveToCassandra("ks","table"
This would be completely manual operation of a Cassandra Connection. The sessions would all be automatically pooled and cached and if you prepare a statement those will be cached on the executor as well.
If you would like to use more built in methods, there also exists joinWithCassandraTable which may work in your situation.
sc.parallelize(inputData, numTasks)
.joinWithCassandraTable("ks","table") //Retrieves all records for which input data is the primary key
.map( //manipulate returned results if needed )
.saveToCassandra("ks","table")
Given the following case class:
case class User(name:String, age:Int)
An RDD is created from a List of instances of Users
The following code filters the RDD to remove users above the age of 50
trait Process {
def test {
val rdd = ... // create RDD
rdd.filter(_.age>50)
}
}
In order to add logging, a separate validate function is created and passed to the filter, as follows:
trait Process {
def validate(user:User) {
if (user.age>50) {
true
}
else {
println("FAILED VALIDATION")
false
}
}
def test {
val rdd = ... // create RDD
rdd.filter(validate)
}
}
The following exception is thrown:
org.apache.spark.SparkException: Task not serializable
The code works by making the class in which the validate function is defined serializable:
trait Process extends Serializable
Is this the correct way to handle the Task not serializable exception, or is there a performance degradation to using serialization within Spark? Are there any better ways to do this?
Thanks
is there a performance degradation to using serialization within Spark
Task serialization (as opposed to data serialization, that occurs when shuffling / collecting data) is rarely noticeable performance-wise, as long as the serialized objects are small. Task serialization occurs once per task (regardless of the amount of data processed).
In this case (serializing the Process instance), the performance impact would probably be negligible since it's a small object.
The risk with this assumption ("Process is small, so it's OK") is that over time, Process might change: it would be easy for developers not to notice that this class gets serialized, so they might add members that would make this slower.
Are there any better ways to do this
You can avoid serialization completely by using static methods - methods of objects instead of classes. In this case, you can create a companion object for Process:
import Process._
trait Process {
def test {
val rdd = ... // create RDD
rdd.filter(validate)
}
}
object Process {
def validate(user:User) {
if (user.age>50) {
true
} else {
println("FAILED VALIDATION")
false
}
}
Objects are "static", so Spark can use them without serialization.