org.apache.spark.SparkException: Task not serializable, wh - multithreading

When I implemented my own partioner and tried to shuffle the original rdd, I encounter a problem. I know this is caused by referring functions that are not Serializable, but, after adding
extends Serializable
to every relevent class, this problem still exists. What should I do?
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1622)
object STRPartitioner extends Serializable{
def apply(expectedParNum: Int,
sampleRate: Double,
originRdd: RDD[Vertex]): Unit= {
val bound = computeBound(originRdd)
val rdd = originRdd.mapPartitions(
iter => iter.map(row => {
val cp = row
(cp.coordinate, cp.copy())
}
)
)
val partitioner = new STRPartitioner(expectedParNum, sampleRate, bound, rdd)
val shuffled = new ShuffledRDD[Coordinate, Vertex, Vertex](rdd, partitioner)
shuffled.setSerializer(new KryoSerializer(new SparkConf(false)))
val result = shuffled.collect()
}
class STRPartitioner(expectedParNum: Int,
sampleRate: Double,
bound: MBR,
rdd: RDD[_ <: Product2[Coordinate, Vertex]])
extends Partitioner with Serializable {
...
}

I just solve the problem! add -Dsun.io.serialization.extendedDebugInfo=true to your VM config, you will target the unserializable class!

Related

Why does Spark Dataset.map require all parts of the query to be serializable?

I would like to use the Dataset.map function to transform the rows of my dataset. The sample looks like this:
val result = testRepository.readTable(db, tableName)
.map(testInstance.doSomeOperation)
.count()
where testInstance is a class that extends java.io.Serializable, but testRepository does extend this. The code throws the following error:
Job aborted due to stage failure.
Caused by: NotSerializableException: TestRepository
Question
I understand why testInstance.doSomeOperation needs to be serializable, since it's inside the map and will be distributed to the Spark workers. But why does testRepository needs to be serialized? I don't see why that is necessary for the map. Changing the definition to class TestRepository extends java.io.Serializable solves the issue, but that is not desirable in the larger context of the project.
Is there a way to make this work without making TestRepository serializable, or why is it required to be serializable?
Minimal working example
Here's a full example with the code from both classes that reproduces the NotSerializableException:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class MyTableSchema(id: String, key: String, value: Double)
val db = "temp_autodelete"
val tableName = "serialization_test"
class TestRepository extends java.io.Serializable {
def readTable(database: String, tableName: String): Dataset[MyTableSchema] = {
spark.table(f"$database.$tableName")
.as[MyTableSchema]
}
}
val testRepository = new TestRepository()
class TestClass() extends java.io.Serializable {
def doSomeOperation(row: MyTableSchema): MyTableSchema = {
row
}
}
val testInstance = new TestClass()
val result = testRepository.readTable(db, tableName)
.map(testInstance.doSomeOperation)
.count()
The reason why is because your map operation is reading from something that already takes place on the executors.
If you look at your pipeline:
val result = testRepository.readTable(db, tableName)
.map(testInstance.doSomeOperation)
.count()
The first thing you do is testRepository.readTable(db, tableName). If we look inside of the readTable method, we see that you are doing a spark.table operation in there. If we look at the function signature of this method from the API docs, we see the following function signature:
def table(tableName: String): DataFrame
This is not an operation that solely takes place on the driver (imagine reading in a file of >1TB while only taking place on the driver), and it creates a Dataframe (which is by itself a distributed dataset). That means that the testRepository.readTable(db, tableName) function needs to be distributed, and so your testRepository object needs to be distributed.
Hope this helps you!

Empty set after collectAsList, even though it is not empty inside the transformation operator

I am trying to figure out if I can work with Kotlin and Spark,
and use the former's data classes instead of Scala's case classes.
I have the following data class:
data class Transaction(var context: String = "", var epoch: Long = -1L, var items: HashSet<String> = HashSet()) :
Serializable {
companion object {
#JvmStatic
private val serialVersionUID = 1L
}
}
And the relevant part of the main routine looks like this:
val transactionEncoder = Encoders.bean(Transaction::class.java)
val transactions = inputDataset
.groupByKey(KeyExtractor(), KeyExtractor.getKeyEncoder())
.mapGroups(TransactionCreator(), transactionEncoder)
.collectAsList()
transactions.forEach { println("collected Transaction=$it") }
With TransactionCreator defined as:
class TransactionCreator : MapGroupsFunction<Tuple2<String, Timestamp>, Row, Transaction> {
companion object {
#JvmStatic
private val serialVersionUID = 1L
}
override fun call(key: Tuple2<String, Timestamp>, values: MutableIterator<Row>): Transaction {
val seq = generateSequence { if (values.hasNext()) values.next().getString(2) else null }
val items = seq.toCollection(HashSet())
return Transaction(key._1, key._2.time, items).also { println("inside call Transaction=$it") }
}
}
However, I think I'm running into some sort of serialization problem,
because the set ends up empty after collection.
I see the following output:
inside call Transaction=Transaction(context=context1, epoch=1000, items=[c])
inside call Transaction=Transaction(context=context1, epoch=0, items=[a, b])
collected Transaction=Transaction(context=context1, epoch=0, items=[])
collected Transaction=Transaction(context=context1, epoch=1000, items=[])
I've tried a custom KryoRegistrator to see if it was a problem with Kotlin's HashSet:
class MyRegistrator : KryoRegistrator {
override fun registerClasses(kryo: Kryo) {
kryo.register(HashSet::class.java, JavaSerializer()) // kotlin's HashSet
}
}
But it doesn't seem to help.
Any other ideas?
Full code here.
It does seem to be a serialization issue.
The documentation of Encoders.bean states (Spark v2.4.0):
collection types: only array and java.util.List currently, map support is in progress
Porting the Transaction data class to Java and changing items to a java.util.List seems to help.

Spark Scala custom UnaryTransformer in a pipeline fails on read of persisted pipeline model

I have developed this simple LogTransformer by extending the UnaryTransformer to apply log transfromation on the "age" column in the DataFrame. I am able to apply this transformer and include that as a pipeline stage and persist the pipeline model after training.
class LogTransformer(override val uid: String) extends UnaryTransformer[Int,
Double, LogTransformer] with DefaultParamsWritable {
def this() = this(Identifiable.randomUID("logTransformer"))
override protected def createTransformFunc: Int => Double = (feature: Int) => {Math.log10(feature)}
override protected def validateInputType(inputType: DataType): Unit = {
require(inputType == DataTypes.IntegerType, s"Input type must be integer type but got $inputType.")
}
override protected def outputDataType: DataType = DataTypes.DoubleType
override def copy(extra: ParamMap): LogTransformer = defaultCopy(extra)
}
object LogTransformer extends DefaultParamsReadable[LogTransformer]
But when I read the persisted model I get the following exception.
val MODEL_PATH = "model/census_pipeline_model"
cvModel.bestModel.asInstanceOf[PipelineModel].write.overwrite.save(MODEL_PATH)
val same_pipeline_model = PipelineModel.load(MODEL_PATH)
exception in thread "main" java.lang.NoSuchMethodException: dsml.Census$LogTransformer$2.read()
at java.lang.Class.getMethod(Class.java:1786)
at org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstance(ReadWrite.scala:652)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:274)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:272)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:272)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342)
at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380)
at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332)
at dsml.Census$.main(Census.scala:572)
at dsml.Census.main(Census.scala)
Any pointers on how to fix that would be helpful. Thank You.

How to convert a Kotlin data class object to map?

Is there any easy way or any standard library method to convert a Kotlin data class object to a map/dictionary of its properties by property names? Can reflection be avoided?
I was using the jackson method, but turns out the performance of this is terrible on Android for first serialization (github issue here). And its dramatically worse for older android versions, (see benchmarks here)
But you can do this much faster with Gson. Conversion in both directions shown here:
import com.google.gson.Gson
import com.google.gson.reflect.TypeToken
val gson = Gson()
//convert a data class to a map
fun <T> T.serializeToMap(): Map<String, Any> {
return convert()
}
//convert a map to a data class
inline fun <reified T> Map<String, Any>.toDataClass(): T {
return convert()
}
//convert an object of type I to type O
inline fun <I, reified O> I.convert(): O {
val json = gson.toJson(this)
return gson.fromJson(json, object : TypeToken<O>() {}.type)
}
//example usage
data class Person(val name: String, val age: Int)
fun main() {
val person = Person("Tom Hanley", 99)
val map = mapOf(
"name" to "Tom Hanley",
"age" to 99
)
val personAsMap: Map<String, Any> = person.serializeToMap()
val mapAsPerson: Person = map.toDataClass()
}
This extension function uses reflection, but maybe it'll help someone like me coming across this in the future:
inline fun <reified T : Any> T.asMap() : Map<String, Any?> {
val props = T::class.memberProperties.associateBy { it.name }
return props.keys.associateWith { props[it]?.get(this) }
}
I have the same use case today for testing and ended up i have used Jackson object mapper to convert Kotlin data class into Map. The runtime performance is not a big concern in my case. I haven't checked in details but I believe it's using reflection under the hood but it's out of concern as happened behind the scene.
For Example,
val dataclass = DataClass(p1 = 1, p2 = 2)
val dataclassAsMap = objectMapper.convertValue(dataclass, object:
TypeReference<Map<String, Any>>() {})
//expect dataclassAsMap == mapOf("p1" to 1, "p2" to 2)
kotlinx.serialization has an experimental Properties format that makes it very simple to convert Kotlin classes into maps and vice versa:
#ExperimentalSerializationApi
#kotlinx.serialization.Serializable
data class Category constructor(
val id: Int,
val name: String,
val icon: String,
val numItems: Long
) {
// the map representation of this class
val asMap: Map<String, Any> by lazy { Properties.encodeToMap(this) }
companion object {
// factory to create Category from a map
fun from(map: Map<String, Any>): Category =
Properties.decodeFromMap(map)
}
}
The closest you can get is with delegated properties stored in a map.
Example (from link):
class User(val map: Map<String, Any?>) {
val name: String by map
val age: Int by map
}
Using this with data classes may not work very well, however.
Kpropmap is a reflection-based library that attempts to make working with Kotlin data classes and Maps easier. It has the following capabilities that are relevant:
Can transform maps to and from data classes, though note if all you need is converting from a data class to a Map, just use reflection directly as per #KenFehling's answer.
data class Foo(val a: Int, val b: Int)
// Data class to Map
val propMap = propMapOf(foo)
// Map to data class
val foo1 = propMap.deserialize<Foo>()
Can read and write Map data in a type-safe way by using the data class KProperty's for type information.
Given a data class and a Map, can do other neat things like detect changed values and extraneous Map keys that don't have corresponding data class properties.
Represent "partial" data classes (kind of like lenses). For example, say your backend model contains a Foo with 3 required immutable properties represented as vals. However, you want to provide an API to patch Foo instances. As it is a patch, the API consumer will only send the updated properties. The REST API layer for this obviously cannot deserialize directly to the Foo data class, but it can accept the patch as a Map. Use kpropmap to validate that the Map has the correct types, and apply the changes from the Map to a copy of the model instance:
data class Foo(val a: Int, val b: Int, val c: Int)
val f = Foo(1, 2, 3)
val p = propMapOf("b" to 5)
val f1 = p.applyProps(f) // f1 = Foo(1, 5, 3)
Disclaimer: I am the author.

org.apache.spark.SparkException: Task not serializable - When using an argument

I get a Task not serializable error when attempting to use an input parameter in a map:
val errors = inputRDD.map {
case (itemid, itemVector, userid, userVector, rating) =>
(itemid, itemVector, userid, userVector, rating,
(
(rating - userVector.dot(itemVector)) * itemVector)
- h4 * userVector
)
}
I pass h4 in with the arguments for the Class.
The map is in a method and it works fine if before the map transformation I put:
val h4 = h4
If I don't do this, or put this outside the method then it doesn't work and I get Task not serialisable. Why is this occurring? Other val's I generate for the Class outside the method work within the method, so how come when the val is instantiated from an input parameter/argument it does not?
The error indicates that the class to which h4 belongs is not Serializable.
Here is a similar example:
class ABC(h: Int) {
def test(s:SparkContext) = s.parallelize(0 to 5).filter(_ > h).collect
}
new ABC(3).test(sc)
//org.apache.spark.SparkException: Job aborted due to stage failure:
// Task not serializable: java.io.NotSerializableException:
// $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$ABC
When this.h is used in a rdd-transformation, this becomes part of the closure which gets serialized.
Making the class Serializable works as expected:
class ABC(h: Int) extends Serializable {
def test(s:SparkContext) = s.parallelize(0 to 5).filter(_ > h).collect
}
new ABC(3).test(sc)
// Array[Int] = Array(4, 5)
So does removing reference to this in the rdd-transformation, by defining a local variable in the method:
class ABC(h: Int) {
def test(s:SparkContext) = {
val x = h;
s.parallelize(0 to 5).filter(_ > x).collect
}
}
new ABC(3).test(sc)
// Array[Int] = Array(4, 5)
You can use Broadcast variable. Its Broadcast data from your variable to all your workers. For more Detail visit this link.

Resources