Spark Job Processing Time increases to 4s without explanation - apache-spark

We are running a 1 namenode and 3 datanode cluster on top of Azure. On top of this I am running my spark job on Yarn-Cluster mode.
Also, We are using HDP 2.5 which have spark 1.6.2 integrated into its setup. Now I have this very weird issue where the Processing time of my job suddenly increases to 4s.
This has happened quite some times but does not follow a pattern, sometimes the 4s waiting time is from the start of the job or may be at the middle of the job as shown below.
One thing to notice is that I have no events coming in which is processed so technically the processing time should stay almost the same. Also, my spark streaming job has a batch duration of 1s so it can't be that.
I dont have any error in the logs or anywhere and I am being lost to process this issue.
Minor details about the job:
I am reading messages over kafka topic and then storing them within Hbase tables using Phoenix JDBC Connector.
EDIT: More Information
In the InsertTransactionsPerRDDPartitions, I am performing connection open and write operation to HBase using Phoenix JDBC connectivity.
updatedEventLinks.foreachRDD(rdd -> {
if(!rdd.isEmpty()) {
rdd.foreachPartition(new InsertTransactionsPerRDDPartitions(this.prop));
rdd.foreachPartition(new DoSomethingElse(this.kafkaPublishingProps, this.prop));
}
});

Related

Spark structured streaming job stuck for hours without getting killed

I have a structured streaming job which reads from kafka, perform aggregations and write to hdfs. The job is running in cluster mode in yarn. I am using spark2.4.
Every 2-3 days this job gets stuck. It doesn't fail but gets stuck at some microbatch microbatch. The microbatch doesn't even tend to start. The driver keeps printing following log multiple times for hours.
Got an error when resolving hostNames. Falling back to /default-rack for all.
When I kill the streaming job and start again, the job again starts running fine.
How to fix this ?
See this issue https://issues.apache.org/jira/browse/SPARK-28005
This is fixed in spark 3.0. It seems that this happens because there are no active executers.

Apache Beam on Spark runner: Why scheduling Delay is increasing in Streaming jobs?

I defined a pipeline that reads from a Kafka topic, performs some steps and pubmished the results to an output Kafka topic.
All was fine when I tested it in direct runner mode.
But, when I submit the beam application to spark I got a strenge behavious :
- Scheduling delay increases.
After long time of investigation I figured it out that the duration of the batch is too small (500 millisecinds).
Following this link
Pipeline options for the Spark Runner
I added this option to the spark-submit:
--batchIntervalMillis=2000
Now all is back to normal:
Don't hesitate to share you opinion.
Regards,
Ali

Spark streaming performance degradation when upgrading from 1.3.0 to 1.6.0

I'm working with a Spark streaming application that reads avro messages from Kafka and process them. The batch time of the streaming is 20 seconds.
I had an application running on Spark 1.3.0 with a scheduling delay of each batch to 0 ms but now after upgrading to Spark 1.6.0 I see that the scheduling delay goes up and the processing time of a single batch takes more time.
Processing time is increasing due to the upgrade of the Spark version but the application is running with the same configurations and the same rate of received message.
From the Spark web UI I can see that the operation that seems to take a lot of time is a map over a DStream object. It looks strange to me because it is not a particular heavy operation.
Has anyone noticing the same issue upgrading spark and spark-streaming to 1.6.0 ?
Thanks in advance

Recovery after driver failure by exception with spark-streaming

We are currently working on a system using kafka, spark streaming, and Cassandra as DB. We are using checkpointing based on the content here [http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing]. Inside the function used to create the Streaming context, we use createDirectStream to create our DStream and from this point, we execute several transformations and actions derived from call saveToCassandra on different RDDs
We are running different tests to establish how the application should recover when a failure occurs. Some key points about our scenario are:
We are testing with a fixed number of records in kafka (between 10 million and 20 million), that means, we consume from kafka once and the application brings all the records from kafka.
We are executing the application in --deploy-mode 'client' inside one of the workers, that means that we stop and start the driver manually.
We are not sure how to handle exceptions after DStreams were created, for example, if while writing to cassandra all nodes are dead, we get an exception that aborts the job, but after re-submitting the application, that job is not re-scheduled and the application keeps consuming from kafka getting multiple 'isEmpty' calls.
We made a couple of tests using 'cache' on the repartitioned RDD (which didn't work after a failure different than just stopping and starting the driver), and changing the parameters "query.retry.count", "query.retry.delay" and "spark.task.maxFailures" without success, e.g., the job is aborted after x failed times.
At this point we are confused on how should we use the checkpoint to re-schedule jobs after a failure.

Spark task deserialization time

I'm running a Spark SQL job, and when looking at the master UI, task deserialization time can take 12 seconds and the compute time 2 seconds.
Let me give some background:
1- The task is simple: run a query in a PostgreSQL DB and count the results in Spark.
2- The deserialization problem comes when running on a cluster with 2+ workers (one of them the driver) and shipping tasks to the other worker.
3- I have to use the JDBC driver for Postgres, and I run each job with Spark submit.
My questions:
Am I submitting the packaged jars as part of the job every time and that is the reason for the huge task deserialization time? If so, how can I ship everything to the workers once and then in subsequent jobs already have everything needed there?
Is there a way to keep the SparkContext alive between jobs (spark-submit) so that the deserialization times reduces?
Anyways, anything that can help not paying des. time every time I run a job in a cluster.
Thanks for your time,
Cheers
As I know, YARN supports cache application jars so that they are accessible each time application runs: pls refer to property spark.yarn.jar.
To support shared SparkContext between jobs and avoid the overhead of initializing it, there is a project spark-jobserver for this purpose.

Resources