IgniteQueue getting - SparkException: Task not serializable - apache-spark

When we use IgniteQueue inside Spark map as below
sparkDataFrame.map(row => {
igniteQueue.put(row)
})
we get SparkException: Task not serializable exception. This is because IgniteQueue is not Serializable.
Is there a way to make IgniteQueue serializable?
Thanks in Advance!!!

You don't need to serialize IgniteQueue, you need to take it inside spark task, right from ignite instance, for example:
JavaIgniteContext<Integer, Integer> igniteContext = new JavaIgniteContext<Integer, Integer>(
sparkContext,"examples/config/spark/example-shared-rdd.xml", false);
Ignite ignite = igniteContext.ignite();
IgniteQueue queue = ignite.queue(name, cap, null);
Also, you can get information about integration Ignite with spark here

Is your class implements as serializable ? if not then make that class serializable and check. which dependences you used for ignite-spark and what version ?

Related

Java Spark Dataset MapFunction - Task not serializable without any reference to class

I have a following class that reads csv data into Spark's Dataset. Everything works fine if I just simply read and return the data.
However, if I apply a MapFunction to the data before returning from function, I get
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: com.Workflow.
I know Spark's working and its need to serialize objects for distributed processing, however, I'm NOT using any reference to Workflow class in my mapping logic. I'm not calling any Workflow class function in my mapping logic. So why is Spark trying to serialize Workflow class? Any help will be appreciated.
public class Workflow {
private final SparkSession spark;
public Dataset<Row> readData(){
final StructType schema = new StructType()
.add("text", "string", false)
.add("category", "string", false);
Dataset<Row> data = spark.read()
.schema(schema)
.csv(dataPath);
/*
* works fine till here if I call
* return data;
*/
Dataset<Row> cleanedData = data.map(new MapFunction<Row, Row>() {
public Row call(Row row){
/* some mapping logic */
return row;
}
}, RowEncoder.apply(schema));
cleanedData.printSchema();
/* .... ERROR .... */
cleanedData.show();
return cleanedData;
}
}
anonymous inner classes have a hidden/implicit reference to enclosing class. use Lambda expression or go with Roma Anankin's solution
you could make Workflow implement Serializeble and SparkSession as #transient

Reading a file inside reduceByKey() method in Spark - Java

I am working on a Spark application that expands edges by adding the adjacent vertices to that edges. I am using Map/reduce paradigm for the process where I want to partition the total number of edges and expand them in different worker nodes.
To accomplish that I need to read the partitioned adjacent list in the worker nodes based on the key value. But I am getting an error while trying to load files inside the reduceByKey() method. It says that the task is not serializable. My code:
public class MyClass implements Serializable{
public static void main(String args[]) throws IOException {
SparkConf conf = new SparkConf().setAppName("startingSpark").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> file = sc.textFile("hdfs://localhost:9000/mainFile.txt");
... ... ... //Mapping done successfully
JavaPairRDD<String, String> rdd1 = pairs.reduceByKey(new Function2<String, String, String>() {
#Override
public String call(String v1, String v2) throws Exception {
... ... ...
JavaRDD <String> adj = sc.textFile("hdfs://localhost:9000/adjacencyList_"+key+"txt");
//Here I to expand the edges after reading the adjacency list.
}
}
But I am getting an error Task not serializable. Caused by: java.io.NotSerializableException: org.apache.spark.api.java.JavaSparkContext
Serialization stack:
- object not serializable. I think it is due to the fact that I am using the same spark context in the worker node as in the driver program. If I try to create a new Spark Context inside the reduceByKey() method, it also gives me an error saying that Only one SparkContext should be running in this JVM.
Can anyone tell me how can I read a file inside the reduceByKey() method? Is there any other way to accomplish my task? I want the expansion of the edges in the worker nodes so that they can be run in a distributed way.
Thanks in advance.

Spark: JavaRDD.map does not accept anonymous function

I am trying to convert JavaRDD<String> to JavaRDD<Row> using an anonymous function. Here is my code:
JavaRDD<String> listData = jsc.textFile("/src/main/resources/CorrectLabels.csv");
JavaRDD<Row> jrdd = listData.map(new Function<String, Row>() {
public Row call(String record) throws Exception {
String[] fields = record.split(",");
return RowFactory.create(fields[1], fields[0].trim());
}
});
But on doing this, I get back an error as :
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
Details of Stack:
Serialization stack:
- object not serializable (class: com.cpny.ml.supervised.FeatureExtractor, value: com.cpny.ml.supervised.FeatureExtractor#421056e5)
- field (class: com.cpny.ml.supervised.FeatureExtractor$1, name: this$0, type: class com.cpny.ml.supervised.FeatureExtractor)
- object (class com.cpny.ml.supervised.FeatureExtractor$1, com.cpny.ml.supervised.FeatureExtractor$1#227a47)
- field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function)
- object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
Any idea where am I going wrong?
Thanks! K
The exception you are getting is not related to the anonymous function.
The FeatureExtractor class is either not Serializable or contains non-Serializable fields.
Thanks #slovit..
My earlier setup was : MainClass calling FeatureExtractor to get the JavaRDD. This class was not Serializable before. Now after making it one, I no longer get the issue..
But on another note, MainClass was my starting point to submit a SparkJob as:
./bin/spark-submit --class com.cpny.ml.supervised.MainClass --master spark://localhost:7077 /mltraining/target/mltraining-0.0.1-SNAPSHOT.jar
But the MainClass is not marked as Serializable. But when I include the anonymous function in MainClass, I dont get the issue. How the MainClass got Serialized but another class did not?
PS: May be this is not a spark question, but a basic java question.. sorry!

Cassandra #Table enum type not found

I am using spring data cassandra, and i have a #Table as defined below.
#Table(CassandraConstants.NotificationThread.NAME)
public class Event implements Serializable {
private static final long serialVersionUID = 1L;
#PrimaryKey
private EventKey primaryKey;
#Column(value = CassandraConstants.Event.COL_COMPONENT_TYPE)
private ComponentType componentType;
...
}
In my dao code i am doing setting the enum value and doing a save. but i get an error.
event.setComponentType(ComponentType.CONNECTOR);
....
this.eventDao.save(event);
but i see this error reported while doing the save action
Invalid value CONNECTOR of type unknown to the query builder...
Does the spring Data not handle the conversion of enums to string data type for cassandra ?
Any pointers to what is failing here.
Are you using the official spring-data-cassandra project, (repo at https://github.com/spring-projects/spring-data-cassandra), that's under the Spring Data umbrella of projects, or a different project?

Spring Integration jdbc:inbound-channel-adapter - Set max-rows-per-poll dynamic to throttle

I have a JDBC:inbound-channel-adapter : To set the 'max-rows-per-poll' dynamic to throttle the messages getting passed on the channel.
I have a QueueChannel which has a capacity of 200. The inbound-channel-adapter would be sending the message to this QueueChannel. I would like to set the 'max-rows-per-poll' value depending on the RemainingCapacity of the QueueChannel.
For this I tried to Inject the QueueChannel in a Bean but I get the error when deploying the war file.
Error: Cannot Inject the QueueChannel due to StateConversionError.
Is there any other way I could achieve this.
Update : I am using Spring-Integration-2.2.0.RC2
This is the config for jdbc-inbound-adapter:
<si-jdbc:inbound-channel-adapter id ="jdbcInboundAdapter" channel="queueChannel" data-source="myDataSource" auto-startup="true" query="${select.query}"
update="${update.query}" max-rows-per-poll="100" row-mapper="rowMapper" update-per-row="true">
<si:poller fixed-rate="5000">
<si:transactional/>
<si:advice-chain>
<bean class="foo.bar.InboundAdapterPollingConfigurationImpl"/>
</si:advice-chain>
</si:poller>
</si-jdbc:inbound-channel-adapter>
Bean:
#Service
public class InboundAdapterPollingConfigurationImpl implements InboundAdapterPollingConfiguration{
private static final Logger logger = LoggerFactory.getLogger(InboundAdapterPollingConfigurationImpl.class);
#Autowired
QueueChannel queueChannel;
#Autowired
SourcePollingChannelAdapter jdbcInboundAdapter;
public void setJdbcInboundAdapterMaxRowsPerPoll(){
String size = String.valueOf(queueChannel.getRemainingCapacity());
DirectFieldAccessor directFieldAccessor = new DirectFieldAccessor(jdbcInboundAdapter);
directFieldAccessor.setPropertyValue("setMaxRowsPerPoll", size);
String maxRowsPerPollSize = (String)directFieldAccessor.getPropertyValue("setMaxRowsPerPoll");
System.out.println(maxRowsPerPollSize);
}
}
The question is how to call the InboundAdapterPollingConfigurationImpl.setJdbcInboundAdapterMaxRowsPerPoll() method from the advice chain. Sorry for the naive question but t is my first time using the advice-chain. Also I am searching for an example but was not lucky yet.
Update2:
Got the below error when this is executed:
JdbcPollingChannelAdapter source = (JdbcPollingChannelAdapter)dfa.getPropertyValue("source");
Error:
java.lang.ClassCastException: $Proxy547 cannot be cast to org.springframework.integration.jdbc.JdbcPollingChannelAdapter –
I have the JDK1.6_26. I read in one of the posts that this is happening in the early versions of JDK1.6.
Well, let's try to investigate!
max-rows-per-poll is a volatile property of JdbcPollingChannelAdapter with appropriate setter.
As far as JdbcPollingChannelAdapter does the stuff within its receive() method just on the initiative of TaskScheduler.schedule() looks like changing that property at runtime is safe. It is a first point for our task
QueueChannel has property getQueueSize(). As far as a capacity is your configuration option, so you can simply calculate a value for max-rows-per-poll
And now how to get it worked? Actually you are interested in the value for max-rows-per-poll just on each poll. So, we should somehow to wedge into poller or polling task. Well, <poller> has advice-chain sub-element and we can write some Advice, which should change JdbcPollingChannelAdapter#setMaxRowsPerPoll before invoking receive() and the value should be based on QueueChannel#getQueueSize()
Inject your QueueChannel to the bean of your Advice
And now some bad point: how to inject JdbcPollingChannelAdapter bean? We provide a hook to register MessageSources as beans just only since Spring Integration 3.0. From here it's just enough to write this code:
#Autowired
#Qualifier("jdbcAdapter.source")
private JdbcPollingChannelAdapter messageSource;
We are going to release 3.0.GA this week. So, let me do not consider the reflection 'forest' prior to Sring Integration 3.0. However you can do it using DirectFieldAccessor on injected SourcePollingChannelAdapter bean.
UPDATE
Your Advice may look like this:
public class MyAdvice implements MethodInterceptor {
#Autowired
QueueChannel queueChannel;
#Autowired
SourcePollingChannelAdapter jdbcInboundAdapter;
Object invoke(MethodInvocation invocation) throws Throwable {
DirectFieldAccessor dfa = new DirectFieldAccessor(jdbcInboundAdapter);
JdbcPollingChannelAdapter source = (JdbcPollingChannelAdapter) dfa.getPropertyValue("source");
source.setMaxRowsPerPoll(queueChannel.getRemainingCapacity());
return invocation.proceed();
}
}
The theory is here: http://docs.spring.io/spring/docs/3.2.5.RELEASE/spring-framework-reference/htmlsingle/#aop

Resources