Spark flatmap is giving Iterator error - apache-spark

I am getting error if I apply a flatMap over a JSONArray to JSONObject
If I am running on my local(laptop) from eclipse, it runs fine, but when running on cluster(YARN), it gives weird error.
Spark Version 2.0.0
Code:-
JavaRDD<JSONObject> rdd7 = rdd6.flatMap(new FlatMapFunction<JSONArray, JSONObject>(){
#Override
public Iterable<JSONObject> call(JSONArray array) throws Exception {
List<JSONObject> list = new ArrayList<JSONObject>();
for (int i = 0; i < array.length();list.add(array.getJSONObject(i++)));
return list;
}
});
error-log:-
java.lang.AbstractMethodError: com.pwc.spark.tifcretrolookup.TIFCRetroJob$2.call(Ljava/lang/Object;)Ljava/util/Iterator;
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:124)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:124)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
at com.pwc.spark.ElasticsearchClientLib.CommonESClient.index(CommonESClient.java:33)
at com.pwc.spark.ElasticsearchClientLib.ESClient.call(ESClient.java:34)
at com.pwc.spark.ElasticsearchClientLib.ESClient.call(ESClient.java:15)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:218)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:218)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:883)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:883)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Since Spark 2.0.0, the function inside a flatMap call must return an Iterator instead of Iterable, as the release notes state:
Java RDD’s flatMap and mapPartitions functions used to require functions returning Java Iterable. They have been updated to require functions returning Java iterator so the functions do not need to materialize all the data.
And here is the relevant Jira issue

Related

Why returning Transformer null throws exception?

Transformer throws exception for returning null. I'm getting the message payload and doing my business logic in transformer. Then, sending response to fileoutput channel. I've tried using .handle method too instead of transformer, but getting one way message exception.
EDIT
#Bean
IntegrationFlow integrationFlow() {
return IntegrationFlows.from(this.sftpMessageSource()).channel(fileInputChannel()).
handle(service, "callMethod").channel(fileOutputChannel()).
handle(orderOutMessageHandler()).get();
}
EDIT 2
[ERROR] 2020-06-14 14:49:48.053 [task-scheduler-9] LoggingHandler - java.lang.AbstractMethodError: Method org/springframework/integration/sftp/session/SftpSession.getHostPort()Ljava/lang/String; is abstract
at org.springframework.integration.sftp.session.SftpSession.getHostPort(SftpSession.java)
at org.springframework.integration.file.remote.session.CachingSessionFactory$CachedSession.getHostPort(CachingSessionFactory.java:295)
at org.springframework.integration.file.remote.synchronizer.AbstractInboundFileSynchronizer.copyFileToLocalDirectory(AbstractInboundFileSynchronizer.java:496)
at org.springframework.integration.file.remote.synchronizer.AbstractInboundFileSynchronizer.copyIfNotNull(AbstractInboundFileSynchronizer.java:400)
at org.springframework.integration.file.remote.synchronizer.AbstractInboundFileSynchronizer.transferFilesFromRemoteToLocal(AbstractInboundFileSynchronizer.java:386)
at org.springframework.integration.file.remote.synchronizer.AbstractInboundFileSynchronizer.lambda$synchronizeToLocalDirectory$0(AbstractInboundFileSynchronizer.java:349)
at org.springframework.integration.file.remote.RemoteFileTemplate.execute(RemoteFileTemplate.java:437)
at org.springframework.integration.file.remote.synchronizer.AbstractInboundFileSynchronizer.synchronizeToLocalDirectory(AbstractInboundFileSynchronizer.java:348)
at org.springframework.integration.file.remote.synchronizer.AbstractInboundFileSynchronizingMessageSource.doReceive(AbstractInboundFileSynchronizingMessageSource.java:265)
at org.springframework.integration.file.remote.synchronizer.AbstractInboundFileSynchronizingMessageSource.doReceive(AbstractInboundFileSynchronizingMessageSource.java:66)
at org.springframework.integration.endpoint.AbstractFetchLimitingMessageSource.doReceive(AbstractFetchLimitingMessageSource.java:45)
at org.springframework.integration.endpoint.AbstractMessageSource.receive(AbstractMessageSource.java:167)
at org.springframework.integration.endpoint.SourcePollingChannelAdapter.receiveMessage(SourcePollingChannelAdapter.java:250)
at org.springframework.integration.endpoint.AbstractPollingEndpoint.doPoll(AbstractPollingEndpoint.java:359)
at org.springframework.integration.endpoint.AbstractPollingEndpoint.pollForMessage(AbstractPollingEndpoint.java:328)
at org.springframework.integration.endpoint.AbstractPollingEndpoint.lambda$null$1(AbstractPollingEndpoint.java:275)
at org.springframework.integration.util.ErrorHandlingTaskExecutor.lambda$execute$0(ErrorHandlingTaskExecutor.java:57)
at org.springframework.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:50)
at org.springframework.integration.util.ErrorHandlingTaskExecutor.execute(ErrorHandlingTaskExecutor.java:55)
at org.springframework.integration.endpoint.AbstractPollingEndpoint.lambda$createPoller$2(AbstractPollingEndpoint.java:272)
at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:93)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
at java.util.concurrent.FutureTask.run(FutureTask.java)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The transformer is designed to always return a reply because it is a transformation operation. Therefore you can’t return null from your method . You get one-way error probably because your handle method is void.

Caused by: kafka.common.OffsetOutOfRangeException

I'm using Kafka and Spark to stream my data updates into my Hbase tables.
But I keep getting the OffsetOutOfRangeException, here's my code:
new KafkaStreamBuilder()
.setStreamingContext(streamingContext)
.setTopics(topics)
.setDataSourceId(dataSourceId)
.setOffsetManager(offsetManager)
.setConsumerParameters(
ImmutableMap
.<String, String>builder()
.putAll(kafkaConsumerParams)
.put("group.id", groupId)
.put("metadata.broker.list", kafkaBroker))
.build()
)
.build()
.foreachRDD(
rdd -> {
rdd.foreachPartition(
iter -> {
final Table hTable = createHbaseTable(settings);
try {
while (iter.hasNext()) {
String json = new String(iter.next());
try {
putRow(
hTable,
json,
settings,
barrier);
} catch (Exception e) {
throw new RuntimeException("hbase write failure", e);
}
}
} catch (OffsetOutOfRangeException e) {throw new RuntimeException(
"encountered OffsetOutOfRangeException: ", e);
}
});
});
I set my streaming job to run every 5 mins, and every time, after my consumers finish one batch of streaming, it'll write the latest markers and checkpoints to S3. And when next time, before the streaming job runs, it'll read the previous checkpoints and markers from S3, and start from there.
Here's the exception stacktrace:
at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: kafka.common.OffsetOutOfRangeException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:86)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:188)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:197)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:212)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
What I had done:
I've checked and markers and checkpoints are both working as expected.
So, I'm a bit lost here, how could this exception happen and what the possible/reasonable fix could it be?
Thanks!

Concurrent exception for KafkaConsumer is not safe for multi-threaded access

We're calling SparkSQL job from Spark streaming. We're getting concurrent exception and Kafka consumer is closed error. Here is code and exception details:
Kafka consumer code
// Start reading messages from Kafka and get DStream
final JavaInputDStream<ConsumerRecord<String, byte[]>> consumerStream = KafkaUtils.createDirectStream(
getJavaStreamingContext(), LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe(SparkServiceConfParams.AIR.CONSUME_TOPICS,
sparkServiceConf.getKafkaConsumeParams()));
ThreadContext.put(Constants.CommonLiterals.LOGGER_UID_VAR, CommonUtils.loggerUniqueId());
// Decode each binary message and generate JSON array
JavaDStream<String> decodedStream = messagesStream.map(new Function<byte[], String>() {}
..
// publish generated json gzip to kafka
decodedStream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
private static final long serialVersionUID = 1L;
#Override
public void call(JavaRDD<String> jsonRdd4DF) throws Exception {
//Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
if(!jsonRdd4DF.isEmpty()) {
//JavaRDD<String> jsonRddDF = getJavaSparkContext().parallelize(jsonRdd4DF.collect());
Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
SparkAIRMainJsonProcessor airMainJsonProcessor = new SparkAIRMainJsonProcessor();
AIRDataSetBean processAIRData = airMainJsonProcessor.processAIRData(json, sparkSession);
Error Details
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access
Finally Kafka consumer closed:
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException:
This consumer has already been closed.
This issue is resolved using the Cache or Persist option of Spark streaming. In this scenario using cache RDD is not read from Kafka again and issue is resolved. It enables the concurrent usage of stream . But please use wisely cache option.Here is code:
JavaDStream<ConsumerRecord<String, byte[]>> cache = consumerStream.cache();

How to make Spark JDBC driver show full exception?

I want to get an appropriate exception message from Spark JDBC driver.
My test case:
Using Spark ver2.0.2 to access DB through JDBC Driver
At first locked DB, then try write to DB (write mode are append and overwrite)
Get exception message
Exception message I got:
case1 append: Cause already initialized <-Not good
case2 overwrite: DB2 SQL Error: SQLCODE=-913, SQLSTATE=57033, SQLERRMC=00C9008E;00000210;DSN00009.DRITEMRI.00000001, DRIVER=4.19.56 <-Good
My question:
How can I get an exception message like DB2 SQL Error: SQLCODE=-913, SQLSTATE=57033~ in 'case1 append'?
I guss a reason is function savePartitions (this function is called when Spark execute saveTable) doesn't show good exception. But I don't know how to fix it.
Here is detail exception message of Spark shell
case1 append:
scala> prodtbl.write.mode("Append").jdbc(url3,"DB2.D_ITEM_INFO",prop1)
17/04/03 17:50:26 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.IllegalStateException: Cause already initialized
at java.lang.Throwable.setCause(Throwable.java:365)
at java.lang.Throwable.initCause(Throwable.java:341)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:241)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:300)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:299)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:785)
case2 overwrite:
scala> prodtbl.write.mode("Overwrite").jdbc(url3,"DB2.D_ITEM_INFO",prop1)
com.ibm.db2.jcc.am.SqlException: DB2 SQL Error: SQLCODE=-913, SQLSTATE=57033, SQLERRMC=00C9008E;00000210;DSN00009.DRITEMRI.00000001, DRIVER=4.19.56
at com.ibm.db2.jcc.am.kd.a(Unknown Source)
at com.ibm.db2.jcc.am.kd.a(Unknown Source)
at com.ibm.db2.jcc.am.kd.a(Unknown Source)
at com.ibm.db2.jcc.am.fp.c(Unknown Source)
at com.ibm.db2.jcc.am.fp.d(Unknown Source)
at com.ibm.db2.jcc.am.fp.b(Unknown Source)
at com.ibm.db2.jcc.t4.bb.i(Unknown Source)
at com.ibm.db2.jcc.t4.bb.c(Unknown Source)
at com.ibm.db2.jcc.t4.p.b(Unknown Source)
at com.ibm.db2.jcc.t4.vb.h(Unknown Source)
at com.ibm.db2.jcc.am.fp.jb(Unknown Source)
at com.ibm.db2.jcc.am.fp.a(Unknown Source)
at com.ibm.db2.jcc.am.fp.c(Unknown Source)
at com.ibm.db2.jcc.am.fp.executeUpdate(Unknown Source)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.dropTable(JdbcUtils.scala:94)
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:422)
... 48 elided
So initCause described as below:
public Throwable initCause(Throwable cause)
Initializes the cause of this throwable to the specified value. (The cause is the throwable that caused this throwable to get thrown.)
This method can be called at most once. It is generally called from within the constructor, or immediately after creating the throwable. If this throwable was created with Throwable(Throwable) or Throwable(String,Throwable), this method cannot be called even once.
An example of using this method on a legacy throwable type without other support for setting the cause is:
try {
lowLevelOp();
} catch (LowLevelException le) {
throw (HighLevelException)
new HighLevelException().initCause(le); // Legacy constructor
}
Parameters:
cause - the cause (which is saved for later retrieval by the getCause() method). (A null value is permitted, and indicates that the cause is nonexistent or unknown.)
Returns:
a reference to this Throwable instance.
Throws:
IllegalArgumentException - if cause is this throwable. (A throwable cannot be its own cause.)
IllegalStateException - if this throwable was created with Throwable(Throwable) or Throwable(String,Throwable), or this method has already been called on this throwable.
And it means the cause was something not originated to initialize with initCase method and you haven't received what you waited for.

NoHostAvailableException occurs in datastax driver 3.0

I use multiple threads to query data, these threads common to the same Session, will occur as follows
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)
at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:37)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:63)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:47)
at com.tss.storage.LuceneTask.run(LuceneTask.java:51)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)
at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:211)
at com.datastax.driver.core.RequestHandler.access$1000(RequestHandler.java:43)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.sendRequest(RequestHandler.java:277)
at com.datastax.driver.core.RequestHandler.startNewExecution(RequestHandler.java:115)
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:91)
at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:132)
... 8 more
Can anyone help me?
Thanks!
ExecutorService run this:
RangeCondition range = range("time").lower(start).upper(end);
String param = search().filter(range).build();
Stopwatch start = Stopwatch.createStarted();
ResultSet result = null;
try {
Session session = storageClient.getSession();
result = session.execute("SELECT * FROM demo.tweets WHERE expr(tweets_idx, ?)", param);
} catch (Exception e) {
e.printStackTrace();
}

Resources