Spark Java API Task not serializable when not using Lambda - apache-spark

I am seeing a behavior in Spark ( 2.2.0 ) I do not understand, but guessing it's related to Lambda and Anonymous classes, when trying to extract out a lambda function:
This works:
public class EventsFilter
{
public Dataset< String > filter( Dataset< String > events )
{
return events.filter( ( FilterFunction< String > ) x -> x.length() > 3 );
}
}
Yet this does not:
public class EventsFilter
{
public Dataset< String > filter( Dataset< String > events )
{
FilterFunction< String > filter = new FilterFunction< String >(){
#Override public boolean call( String value ) throws Exception
{
return value.length() > 3;
}
};
return events.filter( filter );
}
}
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) ...
...
Caused by: java.io.NotSerializableException: ...EventsFilter
..Serialization stack:
- object not serializable (class: ...EventsFilter,
value:...EventsFilter#e521067)
- field (class: .EventsFilter$1, name: this$0, type: class ..EventsFilter)
. - object (class ...EventsFilter$1, ..EventsFilter$1#5c70d7f0)
. - element of array (index: 1)
- array (class [Ljava.lang.Object;, size 4)
- field (class:
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, name: references$1, type: class [Ljava.lang.Object;)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, <function2>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
I am testing against:
#Test
public void test()
{
EventsFilter filter = new EventsFilter();
Dataset<String> input = SparkSession.builder().appName( "test" ).master( "local" ).getOrCreate()
.createDataset( Arrays.asList( "123" , "123" , "3211" ) ,
Encoders.kryo( String.class ) );
Dataset<String> res = filter.filter( input );
assertThat( res.count() , is( 1l ) );
}
Even weirder, when put in a static main, both seem to work...
How is defining the function explicitly inside a method causing that sneaky 'this' reference serialization?

Java's inner classes holds reference to outer class. Your outer class is not serializable, so exception is thrown.
Lambdas does not hold reference if that reference is not used, so there's no problem with non-serializable outer class. More here

I was under the false impression that Lambdas are implemented under the hood as inner classes. This is no longer the case (very helpful talk).
Also, as T. Gawęda answered, inner classes do in fact hold reference to the outer class, even if it is not needed (here). This difference explains the behavior.

Related

Does Groovy "as" operator create a subclass at run time for User defined classes?

In Groovy When I write the below code in a groovy script.
class Emp {
public String getId() {
return "12345";
}
}
def coercedInstance = [
getId: {
"99999"
}
] as Emp
println new Emp().getId()
println coercedInstance .getId()
Using the as Operator here, am I creating a sub class of the actual Emp class at runtime and providing the method body at run time?
I have seen other stack overflow articles and i have learnt that Groovy uses DefaultGroovyMethods.java & DefaultTypeTransformation.java to do the coercion. But could not figure out if it was subclassing or not.
Yes, an as operator creates an object which type is a subclass of the target class. Using DefaultGroovyMethods.asType(Map map, Class clazz) generates (in a memory) a proxy class that extends given base class.
class Emp {
public String getId() {
return "12345";
}
}
def coercedInstance = [
getId: {
"99999"
}
] as Emp
assert (coercedInstance instanceof Emp)
assert (coercedInstance.class != Emp)
assert (Emp.isAssignableFrom(coercedInstance.class))
println coercedInstance.dump() // <Emp1_groovyProxy#229c6181 $closures$delegate$map=[getId:coercion$_run_closure1#7bd4937b]>
What happens in your case specifically is the following:
The asType method goes to line 11816 to execute ProxyGenerator.INSTANCE.instantiateAggregateFromBaseClass(map, clazz);
In the next step, ProxyGeneratorAdapter object gets created.
In the last step, adapter.proxy(map,constructorArgs) gets called to return a newly generated class that is a proxy of the base class.

How do I define a type-aware groovy DSL which can propagate a type to inner blocks?

I'm trying to implement a Groovy DSL, which takes a class name in a top-level block parameter, then allows static type checking against methods of that class in inner blocks, without needing to re-declare that class name redundantly.
(I am using Groovy v2.5.6)
For example, given this class:
// A message data type to parse
#Canonical
class MessageX {
int size
boolean hasName
// an optional field to read, depending on whether #hasName is true
String name
}
I'd want to be able to define things in the DSL something like this:
parserFor(MessageX) {
readInt 'size'
readBool 'hasName'
// #hasName is a property of MessageX
when { hasName } {
readString 'name'
}
}
An attempt at implementing this might be:
import groovy.transform.Canonical
import groovy.transform.CompileStatic
#CompileStatic
#Canonical
class MessageX {
boolean hasName
String name
int size
}
/** Generic message builder class for parsing messages */
#CompileStatic
class MessageBlock<T> {
MessageBlock(Map<String, Object> values) {
this.value = values
}
private final Map<String, Object> value
void readString(String field) {
// ...
}
void readInt(String field) {
// ..
}
void readBool(String field) {
// ...
}
/** Defines a condition */
void when(#DelegatesTo(type = "T") Closure conditionBlock, #DelegatesTo(value = MessageBlock, type = "MessageBlock<T>") Closure bodyBlock) {
// ...
}
}
#CompileStatic
class Parser {
static final <T> Closure<T> parserFor(Class<T> type, #DelegatesTo(value = MessageBlock, type = "MessageBlock<T>") Closure block) {
println "parser get type $type.name"
return {
Map<String, Object> values = [:]
MessageBlock<T> blockImpl = new MessageBlock<T>(values);
block.delegate = blockImpl
block()
return type.newInstance(values)
}
}
static void build() {
// Define a parser using our DSL
Closure<MessageX> messageXParser = parserFor(MessageX) {
readBool 'hasName'
when { hasName } {
readString 'name'
}
}
messageXParser()
}
}
Parser.build()
The documentation here suggests this should be possible with just the type = "MessageBlock<T>" tag on DelegatesTo.
However, that gives me null pointer exceptions when compiling.
Caught: BUG! exception in phase 'instruction selection' in source unit 'test.groovy' unexpected NullpointerException
BUG! exception in phase 'instruction selection' in source unit 'test.groovy' unexpected NullpointerException
Caused by: java.lang.NullPointerException
In the above example I also have the value = MessageBlock tag, which at least avoids an NPE - but I still get an error:
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
test.groovy: 57: [Static type checking] - The variable [hasName] is undeclared.
# line 57, column 20.
when { hasName } {
^
I have not yet found a way to get the #when method's closure block to correctly delegate to MessageX class. I've tried these annotations for the second parameter to #parserFor and various other permutations, to no avail:
DelegatesTo(MessageX)
DelegatesTo(value = MessageX, type = "T")
DelegatesTo(value = MessageX, type = "MessageX<T>")
DelegatesTo(type = "MessageX<T>")
Can anyone help?

Empty set after collectAsList, even though it is not empty inside the transformation operator

I am trying to figure out if I can work with Kotlin and Spark,
and use the former's data classes instead of Scala's case classes.
I have the following data class:
data class Transaction(var context: String = "", var epoch: Long = -1L, var items: HashSet<String> = HashSet()) :
Serializable {
companion object {
#JvmStatic
private val serialVersionUID = 1L
}
}
And the relevant part of the main routine looks like this:
val transactionEncoder = Encoders.bean(Transaction::class.java)
val transactions = inputDataset
.groupByKey(KeyExtractor(), KeyExtractor.getKeyEncoder())
.mapGroups(TransactionCreator(), transactionEncoder)
.collectAsList()
transactions.forEach { println("collected Transaction=$it") }
With TransactionCreator defined as:
class TransactionCreator : MapGroupsFunction<Tuple2<String, Timestamp>, Row, Transaction> {
companion object {
#JvmStatic
private val serialVersionUID = 1L
}
override fun call(key: Tuple2<String, Timestamp>, values: MutableIterator<Row>): Transaction {
val seq = generateSequence { if (values.hasNext()) values.next().getString(2) else null }
val items = seq.toCollection(HashSet())
return Transaction(key._1, key._2.time, items).also { println("inside call Transaction=$it") }
}
}
However, I think I'm running into some sort of serialization problem,
because the set ends up empty after collection.
I see the following output:
inside call Transaction=Transaction(context=context1, epoch=1000, items=[c])
inside call Transaction=Transaction(context=context1, epoch=0, items=[a, b])
collected Transaction=Transaction(context=context1, epoch=0, items=[])
collected Transaction=Transaction(context=context1, epoch=1000, items=[])
I've tried a custom KryoRegistrator to see if it was a problem with Kotlin's HashSet:
class MyRegistrator : KryoRegistrator {
override fun registerClasses(kryo: Kryo) {
kryo.register(HashSet::class.java, JavaSerializer()) // kotlin's HashSet
}
}
But it doesn't seem to help.
Any other ideas?
Full code here.
It does seem to be a serialization issue.
The documentation of Encoders.bean states (Spark v2.4.0):
collection types: only array and java.util.List currently, map support is in progress
Porting the Transaction data class to Java and changing items to a java.util.List seems to help.

What is the static version of propertyMissing method in Groovy?

ok - tried looking /reading and not sure i have an answer to this.
I have a Utility class which wraps a static ConcurrentLinkedQueue internally.
The utility class itself adds some static methods - i dont expect to call new to create an instance of the Utility.
I want to intercept the getProperty calls the utility class - and implement these internally in the class definition
I can achieve this by adding the following to the utility classes metaclass, before i use it
UnitOfMeasure.metaClass.static.propertyMissing = {name -> println "accessed prop called $name"}
println UnitOfMeasure.'Each'
however what i want to do is declare the interception in the class definition itself. i tried this in the class definition - but it never seems to get called
static def propertyMissing (receiver, String propName) {
println "prop $propName, saught"
}
i also tried
static def getProperty (String prop) { println "accessed $prop"}
but this isnt called either.
So other than adding to metaClass in my code/script before i use, how can declare the in the utility class that want to capture property accesses
the actual class i have looks like this at present
class UnitOfMeasure {
static ConcurrentLinkedQueue UoMList = new ConcurrentLinkedQueue(["Each", "Per Month", "Days", "Months", "Years", "Hours", "Minutes", "Seconds" ])
String uom
UnitOfMeasure () {
if (!UoMList.contains(this) )
UoMList << this
}
static list () {
UoMList.toArray()
}
static getAt (index) {
def value = null
if (index in 0..(UoMList.size() -1))
value = UoMList[index]
else if (index instanceof String) {
Closure matchClosure = {it.toUpperCase().contains(index.toUpperCase())}
def position = UoMList.findIndexOf (matchClosure)
if (position != -1)
value = UoMList[position]
}
value
}
static def propertyMissing (receiver, String propName) {
println "prop $propName, saught"
}
//expects either a String or your own closure, with String will do case insensitive find
static find (match) {
Closure matchClosure
if (match instanceof Closure)
matchClosure = match
if (match instanceof String) {
matchClosure = {it.toUpperCase().contains(match.toUpperCase())}
}
def inlist = UoMList.find (matchClosure)
}
static findWithIndex (match) {
Closure matchClosure
if (match instanceof Closure)
matchClosure = match
else if (match instanceof String) {
matchClosure = {it.toUpperCase().contains(match.toUpperCase())}
}
def position = UoMList.findIndexOf (matchClosure)
position != -1 ? [UoMList[position], position] : ["Not In List", -1]
}
}
i'd appreciate the secret of doing this for a static utility class rather than instance level property interception, and doing it in class declaration - not by adding to metaClass before i make the calls.
just so you can see the actual class, and script that calls - i've attached these below
my script thats calling the class looks like this
println UnitOfMeasure.list()
def (uom, position) = UnitOfMeasure.findWithIndex ("Day")
println "$uom at postition $position"
// works UnitOfMeasure.metaClass.static.propertyMissing = {name -> println "accessed prop called $name"}
println UnitOfMeasure[4]
println UnitOfMeasure.'Per'
which errors like this
[Each, Per Month, Days, Months, Years, Hours, Minutes, Seconds]
Days at postition 2
Years
Caught: groovy.lang.MissingPropertyException: No such property: Per for class: com.softwood.portfolio.UnitOfMeasure
Possible solutions: uom
groovy.lang.MissingPropertyException: No such property: Per for class: com.softwood.portfolio.UnitOfMeasure
Possible solutions: uom
at com.softwood.scripts.UoMTest.run(UoMTest.groovy:12)
Static version of propertyMissing method is called $static_propertyMissing:
static def $static_propertyMissing(String name) {
// do something
}
This method gets invoked by MetaClassImpl at line 1002:
protected static final String STATIC_METHOD_MISSING = "$static_methodMissing";
protected static final String STATIC_PROPERTY_MISSING = "$static_propertyMissing";
// ...
protected Object invokeStaticMissingProperty(Object instance, String propertyName, Object optionalValue, boolean isGetter) {
MetaClass mc = instance instanceof Class ? registry.getMetaClass((Class) instance) : this;
if (isGetter) {
MetaMethod propertyMissing = mc.getMetaMethod(STATIC_PROPERTY_MISSING, GETTER_MISSING_ARGS);
if (propertyMissing != null) {
return propertyMissing.invoke(instance, new Object[]{propertyName});
}
} else {
// .....
}
// ....
}
Example:
class Hello {
static def $static_propertyMissing(String name) {
println "Hello, $name!"
}
}
Hello.World
Output:
Hello, World!

org.apache.spark.SparkException: Task not serializable - When using an argument

I get a Task not serializable error when attempting to use an input parameter in a map:
val errors = inputRDD.map {
case (itemid, itemVector, userid, userVector, rating) =>
(itemid, itemVector, userid, userVector, rating,
(
(rating - userVector.dot(itemVector)) * itemVector)
- h4 * userVector
)
}
I pass h4 in with the arguments for the Class.
The map is in a method and it works fine if before the map transformation I put:
val h4 = h4
If I don't do this, or put this outside the method then it doesn't work and I get Task not serialisable. Why is this occurring? Other val's I generate for the Class outside the method work within the method, so how come when the val is instantiated from an input parameter/argument it does not?
The error indicates that the class to which h4 belongs is not Serializable.
Here is a similar example:
class ABC(h: Int) {
def test(s:SparkContext) = s.parallelize(0 to 5).filter(_ > h).collect
}
new ABC(3).test(sc)
//org.apache.spark.SparkException: Job aborted due to stage failure:
// Task not serializable: java.io.NotSerializableException:
// $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$ABC
When this.h is used in a rdd-transformation, this becomes part of the closure which gets serialized.
Making the class Serializable works as expected:
class ABC(h: Int) extends Serializable {
def test(s:SparkContext) = s.parallelize(0 to 5).filter(_ > h).collect
}
new ABC(3).test(sc)
// Array[Int] = Array(4, 5)
So does removing reference to this in the rdd-transformation, by defining a local variable in the method:
class ABC(h: Int) {
def test(s:SparkContext) = {
val x = h;
s.parallelize(0 to 5).filter(_ > x).collect
}
}
new ABC(3).test(sc)
// Array[Int] = Array(4, 5)
You can use Broadcast variable. Its Broadcast data from your variable to all your workers. For more Detail visit this link.

Resources