org.apache.spark.SparkException: Task not serializable - When using an argument - apache-spark

I get a Task not serializable error when attempting to use an input parameter in a map:
val errors = inputRDD.map {
case (itemid, itemVector, userid, userVector, rating) =>
(itemid, itemVector, userid, userVector, rating,
(
(rating - userVector.dot(itemVector)) * itemVector)
- h4 * userVector
)
}
I pass h4 in with the arguments for the Class.
The map is in a method and it works fine if before the map transformation I put:
val h4 = h4
If I don't do this, or put this outside the method then it doesn't work and I get Task not serialisable. Why is this occurring? Other val's I generate for the Class outside the method work within the method, so how come when the val is instantiated from an input parameter/argument it does not?

The error indicates that the class to which h4 belongs is not Serializable.
Here is a similar example:
class ABC(h: Int) {
def test(s:SparkContext) = s.parallelize(0 to 5).filter(_ > h).collect
}
new ABC(3).test(sc)
//org.apache.spark.SparkException: Job aborted due to stage failure:
// Task not serializable: java.io.NotSerializableException:
// $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$ABC
When this.h is used in a rdd-transformation, this becomes part of the closure which gets serialized.
Making the class Serializable works as expected:
class ABC(h: Int) extends Serializable {
def test(s:SparkContext) = s.parallelize(0 to 5).filter(_ > h).collect
}
new ABC(3).test(sc)
// Array[Int] = Array(4, 5)
So does removing reference to this in the rdd-transformation, by defining a local variable in the method:
class ABC(h: Int) {
def test(s:SparkContext) = {
val x = h;
s.parallelize(0 to 5).filter(_ > x).collect
}
}
new ABC(3).test(sc)
// Array[Int] = Array(4, 5)

You can use Broadcast variable. Its Broadcast data from your variable to all your workers. For more Detail visit this link.

Related

Create immutable objects using Hashmap in Groocy

Let's say I have groovy class Payment where I have static method payment to create Payment using HashMap. This gives me nice flexibility in terms of what parameters I want to override in given context. It's super useful for testing purposes.
import java.time.LocalDateTime
class Payment {
private final BigDecimal amount
private final String currency
private final LocalDateTime occurred
Payment() {
this.amount = null
this.currency = null
this.occurred = null
}
Payment(BigDecimal amount, String currency, LocalDateTime occurred) {
this.amount = amount
this.currency = currency
this.occurred = occurred
}
static Payment payment(Map params = [:]) {
def defaults = [
amount : 500.00,
currency: 'EUR',
occurred: LocalDateTime.now(),
]
new Payment(defaults << params)
}
}
Current problem with this class is whenever I call payment method, it returns groovy.lang.ReadOnlyPropertyException: Cannot set readonly property: amount for class: Payment.
In order to make this working I have to break class immutability by removing final keywords.
Is there some way how to keep immutability and create objects using Hashmap in Groovy?
If there is no map-c'tor, Groovy unrolls that operation into
def obj = new Payment()
obj.amount = map.amount
...
Which then gives you the error you are seeing. Adding
a MapConstructor
annotation to your class should already fix that. Yet there is an even
better annotation, that does that and much more:
Immutable.
import groovy.transform.Immutable
import java.time.Instant
#Immutable
class Payment {
BigDecimal amount
String currency
Instant ts
static Payment payment(Map params = [:]) {
new Payment([amount: 500, currency: 'EUR', ts: Instant.now()] << params)
}
}
println Payment.payment(amount: 42)
// → Payment(42, EUR, 2021-02-26T18:22:09.226109Z)
To use the tuple constructor from the factory method you have created, you can use getAt:
def m = defaults << params
new Payment(m['amount'], m['currency'], m['occurred'])
or skip the left shift and use params.getOrDefault('amount', 500g)
But you're better off using #Immutable

Empty set after collectAsList, even though it is not empty inside the transformation operator

I am trying to figure out if I can work with Kotlin and Spark,
and use the former's data classes instead of Scala's case classes.
I have the following data class:
data class Transaction(var context: String = "", var epoch: Long = -1L, var items: HashSet<String> = HashSet()) :
Serializable {
companion object {
#JvmStatic
private val serialVersionUID = 1L
}
}
And the relevant part of the main routine looks like this:
val transactionEncoder = Encoders.bean(Transaction::class.java)
val transactions = inputDataset
.groupByKey(KeyExtractor(), KeyExtractor.getKeyEncoder())
.mapGroups(TransactionCreator(), transactionEncoder)
.collectAsList()
transactions.forEach { println("collected Transaction=$it") }
With TransactionCreator defined as:
class TransactionCreator : MapGroupsFunction<Tuple2<String, Timestamp>, Row, Transaction> {
companion object {
#JvmStatic
private val serialVersionUID = 1L
}
override fun call(key: Tuple2<String, Timestamp>, values: MutableIterator<Row>): Transaction {
val seq = generateSequence { if (values.hasNext()) values.next().getString(2) else null }
val items = seq.toCollection(HashSet())
return Transaction(key._1, key._2.time, items).also { println("inside call Transaction=$it") }
}
}
However, I think I'm running into some sort of serialization problem,
because the set ends up empty after collection.
I see the following output:
inside call Transaction=Transaction(context=context1, epoch=1000, items=[c])
inside call Transaction=Transaction(context=context1, epoch=0, items=[a, b])
collected Transaction=Transaction(context=context1, epoch=0, items=[])
collected Transaction=Transaction(context=context1, epoch=1000, items=[])
I've tried a custom KryoRegistrator to see if it was a problem with Kotlin's HashSet:
class MyRegistrator : KryoRegistrator {
override fun registerClasses(kryo: Kryo) {
kryo.register(HashSet::class.java, JavaSerializer()) // kotlin's HashSet
}
}
But it doesn't seem to help.
Any other ideas?
Full code here.
It does seem to be a serialization issue.
The documentation of Encoders.bean states (Spark v2.4.0):
collection types: only array and java.util.List currently, map support is in progress
Porting the Transaction data class to Java and changing items to a java.util.List seems to help.

Spark Java API Task not serializable when not using Lambda

I am seeing a behavior in Spark ( 2.2.0 ) I do not understand, but guessing it's related to Lambda and Anonymous classes, when trying to extract out a lambda function:
This works:
public class EventsFilter
{
public Dataset< String > filter( Dataset< String > events )
{
return events.filter( ( FilterFunction< String > ) x -> x.length() > 3 );
}
}
Yet this does not:
public class EventsFilter
{
public Dataset< String > filter( Dataset< String > events )
{
FilterFunction< String > filter = new FilterFunction< String >(){
#Override public boolean call( String value ) throws Exception
{
return value.length() > 3;
}
};
return events.filter( filter );
}
}
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) ...
...
Caused by: java.io.NotSerializableException: ...EventsFilter
..Serialization stack:
- object not serializable (class: ...EventsFilter,
value:...EventsFilter#e521067)
- field (class: .EventsFilter$1, name: this$0, type: class ..EventsFilter)
. - object (class ...EventsFilter$1, ..EventsFilter$1#5c70d7f0)
. - element of array (index: 1)
- array (class [Ljava.lang.Object;, size 4)
- field (class:
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, name: references$1, type: class [Ljava.lang.Object;)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, <function2>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
I am testing against:
#Test
public void test()
{
EventsFilter filter = new EventsFilter();
Dataset<String> input = SparkSession.builder().appName( "test" ).master( "local" ).getOrCreate()
.createDataset( Arrays.asList( "123" , "123" , "3211" ) ,
Encoders.kryo( String.class ) );
Dataset<String> res = filter.filter( input );
assertThat( res.count() , is( 1l ) );
}
Even weirder, when put in a static main, both seem to work...
How is defining the function explicitly inside a method causing that sneaky 'this' reference serialization?
Java's inner classes holds reference to outer class. Your outer class is not serializable, so exception is thrown.
Lambdas does not hold reference if that reference is not used, so there's no problem with non-serializable outer class. More here
I was under the false impression that Lambdas are implemented under the hood as inner classes. This is no longer the case (very helpful talk).
Also, as T. Gawęda answered, inner classes do in fact hold reference to the outer class, even if it is not needed (here). This difference explains the behavior.

org.apache.spark.SparkException: Task not serializable, wh

When I implemented my own partioner and tried to shuffle the original rdd, I encounter a problem. I know this is caused by referring functions that are not Serializable, but, after adding
extends Serializable
to every relevent class, this problem still exists. What should I do?
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1622)
object STRPartitioner extends Serializable{
def apply(expectedParNum: Int,
sampleRate: Double,
originRdd: RDD[Vertex]): Unit= {
val bound = computeBound(originRdd)
val rdd = originRdd.mapPartitions(
iter => iter.map(row => {
val cp = row
(cp.coordinate, cp.copy())
}
)
)
val partitioner = new STRPartitioner(expectedParNum, sampleRate, bound, rdd)
val shuffled = new ShuffledRDD[Coordinate, Vertex, Vertex](rdd, partitioner)
shuffled.setSerializer(new KryoSerializer(new SparkConf(false)))
val result = shuffled.collect()
}
class STRPartitioner(expectedParNum: Int,
sampleRate: Double,
bound: MBR,
rdd: RDD[_ <: Product2[Coordinate, Vertex]])
extends Partitioner with Serializable {
...
}
I just solve the problem! add -Dsun.io.serialization.extendedDebugInfo=true to your VM config, you will target the unserializable class!

Avoid "Task not serialisable" with nested method in a class

I understand the usual "Task not serializable" issue that arises when accessing a field or a method that is out of scope of a closure.
To fix it, I usually define a local copy of these fields/methods, which avoids the need to serialize the whole class:
class MyClass(val myField: Any) {
def run() = {
val f = sc.textFile("hdfs://xxx.xxx.xxx.xxx/file.csv")
val myField = this.myField
println(f.map( _ + myField ).count)
}
}
Now, if I define a nested function in the run method, it cannot be serialized:
class MyClass() {
def run() = {
val f = sc.textFile("hdfs://xxx.xxx.xxx.xxx/file.csv")
def mapFn(line: String) = line.split(";")
val myField = this.myField
println(f.map( mapFn( _ ) ).count)
}
}
I don't understand since I thought "mapFn" would be in scope...
Even stranger, if I define mapFn to be a val instead of a def, then it works:
class MyClass() {
def run() = {
val f = sc.textFile("hdfs://xxx.xxx.xxx.xxx/file.csv")
val mapFn = (line: String) => line.split(";")
println(f.map( mapFn( _ ) ).count)
}
}
Is this related to the way Scala represents nested functions?
What's the recommended way to deal with this issue ?
Avoid nested functions?
Isn't it working in the way so that in the first case f.map(mapFN(_)) is equivalent to f.map(new Function() { override def apply(...) = mapFN(...) }) and in the second one it is just f.map(mapFN)? When you declare a method with def it is probably just a method in some anonymous class with implicit $outer reference to the enclosing class. But map requires a Function so the compiler needs to wrap it. In the wrapper you just refer to some method of that anonymous class, but not to the instance itself. If you use val, you have a direct reference to the function which you pass to the map. I'm not sure about this, just thinking out loud...

Resources