Spark: JavaRDD.map does not accept anonymous function - apache-spark

I am trying to convert JavaRDD<String> to JavaRDD<Row> using an anonymous function. Here is my code:
JavaRDD<String> listData = jsc.textFile("/src/main/resources/CorrectLabels.csv");
JavaRDD<Row> jrdd = listData.map(new Function<String, Row>() {
public Row call(String record) throws Exception {
String[] fields = record.split(",");
return RowFactory.create(fields[1], fields[0].trim());
}
});
But on doing this, I get back an error as :
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
Details of Stack:
Serialization stack:
- object not serializable (class: com.cpny.ml.supervised.FeatureExtractor, value: com.cpny.ml.supervised.FeatureExtractor#421056e5)
- field (class: com.cpny.ml.supervised.FeatureExtractor$1, name: this$0, type: class com.cpny.ml.supervised.FeatureExtractor)
- object (class com.cpny.ml.supervised.FeatureExtractor$1, com.cpny.ml.supervised.FeatureExtractor$1#227a47)
- field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function)
- object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
Any idea where am I going wrong?
Thanks! K

The exception you are getting is not related to the anonymous function.
The FeatureExtractor class is either not Serializable or contains non-Serializable fields.

Thanks #slovit..
My earlier setup was : MainClass calling FeatureExtractor to get the JavaRDD. This class was not Serializable before. Now after making it one, I no longer get the issue..
But on another note, MainClass was my starting point to submit a SparkJob as:
./bin/spark-submit --class com.cpny.ml.supervised.MainClass --master spark://localhost:7077 /mltraining/target/mltraining-0.0.1-SNAPSHOT.jar
But the MainClass is not marked as Serializable. But when I include the anonymous function in MainClass, I dont get the issue. How the MainClass got Serialized but another class did not?
PS: May be this is not a spark question, but a basic java question.. sorry!

Related

Groovy - ClassCastException accessing property

Groovy: 2.5.15
Java: 11
I have a java class:
#Data
public class SpecialDTO {
Long originId;
}
But, when I access it from my Groovy code:
Long myId = specialdto.originId
I get this exception:
java.lang.ClassCastException: class SpecialDTO cannot be cast to class groovy.lang.GroovyObject (SpecialDTO and groovy.lang.GroovyObject are in unnamed module of loader 'app')
Which appears bogus as I can insert this in the code and it works:
GroovyObject obj = specialdto
We do this all over the place, but this specific test is failing. I can substitute 'getOriginId()' and get the desired results, but it's just such an odd error given that it only happens with this object.

ModelMapper 2.4.4 and Groovy 3.0 compatibility issue

When switching from Groovy 2 to Groovy 3, ModelMapper 2.4.4 seems to now be failing to convert objects. ModelMapper itself does not throw an error, but rather just returns an object whose metaClass is still the initial class rather than the new post-conversion class.
This is demonstrated in the below code which, when run with Groovy 3 (tested with 3.0.2 and 3.0.9), then throws java.lang.IllegalArgumentException: object is not an instance of declaring class when accessing any of the properties of the returned object post-ModelMapping. This error does not happen when run in Groovy 2 (2.5.15).
Dependencies:
org.modelmapper:modelmapper:2.4.4
org.codehaus.groovy:groovy:3.0.9
import org.modelmapper.ModelMapper
class TestClass {
String fieldA
String fieldB
}
class TestClassDto {
String fieldA
String fieldB
}
TestClassDto test = new TestClassDto(fieldA: 'anyA', fieldB: 'anyB')
System.out.println(new ModelMapper().map(test, TestClass).fieldA)
The issue is in the fact that metaClass gets automatically mapped (so, testclass.metaClass gets replaces with TestClass.metaClass and groovy considers the final object to be an instance of TestClass). There are multiple ways to fix this.
You can explicitly set the metaclass after the mapping has been done:
def 'testMapping'() {
given:
TestClassDto test = new TestClassDto(fieldA: 'anyA', fieldB: 'anyB')
def mapped = new ModelMapper().map(test, TestClass)
mapped.metaClass = TestClass.metaClass
expect:
mapped.fieldA == 'anyA'
}
Alternatively, use #CompileStatic so that metaclass isn't generated at all.
Or you can even configure Modelmapper to skip metaclass:
mapper.typeMap(TestClassDto, TestClass)
.addMappings({ it.skip(GroovyObject::setMetaClass) } as ExpressionMap)

Java Spark Dataset MapFunction - Task not serializable without any reference to class

I have a following class that reads csv data into Spark's Dataset. Everything works fine if I just simply read and return the data.
However, if I apply a MapFunction to the data before returning from function, I get
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: com.Workflow.
I know Spark's working and its need to serialize objects for distributed processing, however, I'm NOT using any reference to Workflow class in my mapping logic. I'm not calling any Workflow class function in my mapping logic. So why is Spark trying to serialize Workflow class? Any help will be appreciated.
public class Workflow {
private final SparkSession spark;
public Dataset<Row> readData(){
final StructType schema = new StructType()
.add("text", "string", false)
.add("category", "string", false);
Dataset<Row> data = spark.read()
.schema(schema)
.csv(dataPath);
/*
* works fine till here if I call
* return data;
*/
Dataset<Row> cleanedData = data.map(new MapFunction<Row, Row>() {
public Row call(Row row){
/* some mapping logic */
return row;
}
}, RowEncoder.apply(schema));
cleanedData.printSchema();
/* .... ERROR .... */
cleanedData.show();
return cleanedData;
}
}
anonymous inner classes have a hidden/implicit reference to enclosing class. use Lambda expression or go with Roma Anankin's solution
you could make Workflow implement Serializeble and SparkSession as #transient

Spark Java API Task not serializable when not using Lambda

I am seeing a behavior in Spark ( 2.2.0 ) I do not understand, but guessing it's related to Lambda and Anonymous classes, when trying to extract out a lambda function:
This works:
public class EventsFilter
{
public Dataset< String > filter( Dataset< String > events )
{
return events.filter( ( FilterFunction< String > ) x -> x.length() > 3 );
}
}
Yet this does not:
public class EventsFilter
{
public Dataset< String > filter( Dataset< String > events )
{
FilterFunction< String > filter = new FilterFunction< String >(){
#Override public boolean call( String value ) throws Exception
{
return value.length() > 3;
}
};
return events.filter( filter );
}
}
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) ...
...
Caused by: java.io.NotSerializableException: ...EventsFilter
..Serialization stack:
- object not serializable (class: ...EventsFilter,
value:...EventsFilter#e521067)
- field (class: .EventsFilter$1, name: this$0, type: class ..EventsFilter)
. - object (class ...EventsFilter$1, ..EventsFilter$1#5c70d7f0)
. - element of array (index: 1)
- array (class [Ljava.lang.Object;, size 4)
- field (class:
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, name: references$1, type: class [Ljava.lang.Object;)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, <function2>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
I am testing against:
#Test
public void test()
{
EventsFilter filter = new EventsFilter();
Dataset<String> input = SparkSession.builder().appName( "test" ).master( "local" ).getOrCreate()
.createDataset( Arrays.asList( "123" , "123" , "3211" ) ,
Encoders.kryo( String.class ) );
Dataset<String> res = filter.filter( input );
assertThat( res.count() , is( 1l ) );
}
Even weirder, when put in a static main, both seem to work...
How is defining the function explicitly inside a method causing that sneaky 'this' reference serialization?
Java's inner classes holds reference to outer class. Your outer class is not serializable, so exception is thrown.
Lambdas does not hold reference if that reference is not used, so there's no problem with non-serializable outer class. More here
I was under the false impression that Lambdas are implemented under the hood as inner classes. This is no longer the case (very helpful talk).
Also, as T. Gawęda answered, inner classes do in fact hold reference to the outer class, even if it is not needed (here). This difference explains the behavior.

java.lang.VerifyError: Bad access to protected data

I have the following Groovy file "test.groovy":
import groovy.transform.CompileStatic
#CompileStatic
class Test {
final Set<String> HISTORY = [] as HashSet
Set<String> getHistory() {
return HISTORY.clone() as HashSet<String>
}
}
Test test = new Test()
println test.history
Compiling it with Groovy 2.4.1 works fine, however, when I run "groovy test.class" I get the following error:
Caught: java.lang.VerifyError:
(class: Test, method: getHistory signature:()Ljava/util/Set;)
Bad access to protected data
java.lang.VerifyError:
(class: Test, method: getHistory
signature: ()Ljava/util/Set;)
Bad access to protected data
at test.run(test.groovy:12)
Any ideas what I am doing wrong here?
This actually is a bug in Groovy. A ticket was filed: https://issues.apache.org/jira/browse/GROOVY-7325
Workaround in this case:
It works if you are using a final HashSet<String> and then cast the clone up. Since the getter overrides the property basically anyway (make it private if you wanna be sure), it should not harm the intention of the original code.

Resources