structured streaming idle after a time running - apache-spark

There is structured streaming running on aws EMR, apparently everything ok, but after sometime, approximately 24 hours, the streaming stop to processing message comming from kafka, so I restarted the streaming and it processes again.
There are some config for this issue ?

Related

Structured Streaming is failing but DataProc continue with Running status

We are migrating 2 Spark Streaming jobs using Structured Streaming from on-prem to GCP.
One of them stream messages from Kafka and saves in GCS. And the other, stream from GCS and save in BigQuery.
Sometimes this jobs fails because of any problem, for example: (OutOfMemoryError, Connection reset by peer, Java heap space, etc).
When we get an Exception in on-prem environment, YARN marks the job as FAILLED and we have a scheduler flow that will rise the job again.
In GCP, we developed the same flow, that will rise the job again when fails. But when we get an Exception in DataProc, YARN marks the job as SUCCEEDED and DataProc remain with the status RUNNING.
You can see in this image the log with StreamingQueryException and the status of the job is Running ("Em execução" is running in Portuguese).
Dataproc job

Spark structured streaming job stuck for hours without getting killed

I have a structured streaming job which reads from kafka, perform aggregations and write to hdfs. The job is running in cluster mode in yarn. I am using spark2.4.
Every 2-3 days this job gets stuck. It doesn't fail but gets stuck at some microbatch microbatch. The microbatch doesn't even tend to start. The driver keeps printing following log multiple times for hours.
Got an error when resolving hostNames. Falling back to /default-rack for all.
When I kill the streaming job and start again, the job again starts running fine.
How to fix this ?
See this issue https://issues.apache.org/jira/browse/SPARK-28005
This is fixed in spark 3.0. It seems that this happens because there are no active executers.

Spark streaming job exited abruptly - RECEIVED SIGNAL TERM

The running spark streaming job, which is supposed to run continuously, exited abruptly with the following error (found in the executor logs):
2017-07-28 00:19:38,807 [SIGTERM handler] ERROR org.apache.spark.util.SignalUtils$$anonfun$registerLogger$1$$anonfun$apply$1 (SignalUtils.scala:43) - RECEIVED SIGNAL TERM
The spark streaming job ran for ~62 hours before receiving this signal.
I couldn't find any other ERROR/ WARN in the executor logs. Unfortunately I haven't set up the driver logs yet, so I am not able to check on this specific issue deeper.
I am using Spark cluster in Standalone mode.
Any reason why driver might send this signal? (after spark streaming ran well and good for more than 60 hours)

Spark Job Processing Time increases to 4s without explanation

We are running a 1 namenode and 3 datanode cluster on top of Azure. On top of this I am running my spark job on Yarn-Cluster mode.
Also, We are using HDP 2.5 which have spark 1.6.2 integrated into its setup. Now I have this very weird issue where the Processing time of my job suddenly increases to 4s.
This has happened quite some times but does not follow a pattern, sometimes the 4s waiting time is from the start of the job or may be at the middle of the job as shown below.
One thing to notice is that I have no events coming in which is processed so technically the processing time should stay almost the same. Also, my spark streaming job has a batch duration of 1s so it can't be that.
I dont have any error in the logs or anywhere and I am being lost to process this issue.
Minor details about the job:
I am reading messages over kafka topic and then storing them within Hbase tables using Phoenix JDBC Connector.
EDIT: More Information
In the InsertTransactionsPerRDDPartitions, I am performing connection open and write operation to HBase using Phoenix JDBC connectivity.
updatedEventLinks.foreachRDD(rdd -> {
if(!rdd.isEmpty()) {
rdd.foreachPartition(new InsertTransactionsPerRDDPartitions(this.prop));
rdd.foreachPartition(new DoSomethingElse(this.kafkaPublishingProps, this.prop));
}
});

Spark streaming performance degradation when upgrading from 1.3.0 to 1.6.0

I'm working with a Spark streaming application that reads avro messages from Kafka and process them. The batch time of the streaming is 20 seconds.
I had an application running on Spark 1.3.0 with a scheduling delay of each batch to 0 ms but now after upgrading to Spark 1.6.0 I see that the scheduling delay goes up and the processing time of a single batch takes more time.
Processing time is increasing due to the upgrade of the Spark version but the application is running with the same configurations and the same rate of received message.
From the Spark web UI I can see that the operation that seems to take a lot of time is a map over a DStream object. It looks strange to me because it is not a particular heavy operation.
Has anyone noticing the same issue upgrading spark and spark-streaming to 1.6.0 ?
Thanks in advance

Resources