Spark stateSnapshots() not working with saveAsHadoopFiles - apache-spark

In my Spark Streaming 1.6 application, I want to store certain value in mapWithState and then periodically save them to disk as a backup option.
JavaMapWithStateDStream<String, SwMessage, CcState, Tuple2<String, CcState>> SwMessageWithState =
pairSwMsg.
mapWithState(StateSpec.function(mappingFunc).
initialState(cStateMap));
For backing up I am using the stateSnapshots() method as follows.
SwMessageWithState.stateSnapshots().
saveAsHadoopFiles("/ccd/snap", "txt",String.class,CcState.class, TextOutputFormat.class);
The problem I am facing is that the program stops consuming the messages after the first batch and then does nothing.
If I comment the stateSnapshots() line the program works fine.
Can somebody suggest what exactly is wrong with the above statement?
Also I was thinking of making the snapshopt directory as an initialState state file next time the SparkStreaming job runs.

Related

Is there a way let the Spark Streaming Application quit when a Job aborted

I have submitted a spark Streaming application to the Yarn.
When one job executes failed. The following jobs continue to execute.
Is there a way when one job execute failed, the whole application exits?
As in my case, the data should be processed in sequence, we should not skip any data. If we found any error, we need to stop the application and do troubleshooting instead of continuing.
First thing first, we have to make sure spark streaming stops gracefully, for that set spark.streaming.stopGracefullyOnShutdown parameter to be true (default is false)
Then you can throw an exception from code responsible for failure and bubble it up to the main / driver, surrounding the main body in a try catch and from inside the catch you can call ssc.stop(true, true).
Another way is - from inside catch block wrapping the code responsible for failure create a marker file in persistent storage ( hdfs or s3 or whatever spark is associated with) and keep checking that from driver - whenever the marker file is present delete it as well as call ssc.stop(true, true).
An example can be found at
https://github.com/lanjiang/streamingstopgraceful/blob/master/src/main/scala/com/cloudera/ps/GracefulShutdownExample.scala

Best approach to check if Spark streaming jobs are hanging

I have Spark streaming application which basically gets a trigger message from Kafka which kick starts the batch processing which could potentially take up to 2 hours.
There were incidents where some of the jobs were hanging indefinitely and didn't get completed within the usual time and currently there is no way we could figure out the status of the job without checking the Spark UI manually. I want to have a way where the currently running spark jobs are hanging or not. So basically if it's hanging for more than 30 minutes I want to notify the users so they can take an action. What all options do I have?
I see I can use metrics from driver and executors. If I were to choose the most important one, it would be the last received batch records. When StreamingMetrics.streaming.lastReceivedBatch_records == 0 it probably means that Spark streaming job has been stopped or failed.
But in my scenario, I will receive only 1 streaming trigger event and then it will kick start the processing which may take up to 2 hours so I won't be able to rely on the records received.
Is there a better way? TIA
YARN provides the REST API to check the status of application and status of cluster resource utilization as well.
with API call it will give a list of running applications and their start times and other details. you can have simple REST client that triggers maybe once in every 30 min or so and check if the job is running for more than 2 hours then send a simple mail alert.
Here is the API documentation:
https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_API
Maybe a simple solution like.
At the start of the processing - launch a waiting thread.
val TWO_HOURS = 2 * 60 * 60 * 1000
val t = new Thread(new Runnable {
override def run(): Unit = {
try {
Thread.sleep(TWO_HOURS)
// send an email that job didn't end
} catch {
case _: Exception => _
}
}
})
And in the place where you can say that batch processing is ended
t.interrupt()
If processing is done within 2 hours - waiter thread is interrupted and e-mail is not sent. If processing is not done - e-mail will be sent.
Let me draw your attention towards Streaming Query listeners. These are quite amazing lightweight things that can monitor your streaming query progress.
In an application that has multiple queries, you can figure out which queries are lagging or have stopped due to some exception.
Please find below sample code to understand its implementation. I hope that you can use this and convert this piece to better suit your needs. Thanks!
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(event: QueryStartedEvent) {
//logger message to show that the query has started
}
override def onQueryProgress(event: QueryProgressEvent) {
synchronized {
if(event.progress.name.equalsIgnoreCase("QueryName"))
{
recordsReadCount = recordsReadCount + event.progress.numInputRows
//Logger messages to show continuous progress
}
}
}
override def onQueryTerminated(event: QueryTerminatedEvent) {
synchronized {
//logger message to show the reason of termination.
}
}
})
I'm using Kubernetes currently with the Google Spark Operator. [1]
Some of my streaming jobs hang while using Spark 2.4.3: few tasks fail, then the current batch job never progresses.
I have set a timeout using a StreamingProgressListener so that a thread signals when no new batch is submitted for a long time. The signal is then forwarded to a Pushover client that sends a notification to an Android device. Then System.exit(1) is called. The Spark Operator will eventually restart the job.
[1] https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
One way is to monitor the output of the spark job that was kick started. Generally, for example,
If it writes to HDFS, monitor the HDFS output directory for last modified file timestamp or file count generated
If it writes to a Database, you could have a query to check the timestamp of the last record inserted into your job output table.
If it writes to Kafka, you could use Kafka GetOffsetShell to get the output topic's current offset.
Utilize
TaskContext
This provides contextual information for a task, and supports adding listeners for task completion/failure (see addTaskCompletionListener).
More detailed information such as the task 'attemptNumber' or 'taskMetrics' is available as well.
This information can be used by your application during runtime to determine if their is a 'hang' (depending on the problem)
More information about what is 'hanging' would be useful in providing a more specific solution.
I had a similar scenario to deal with about a year ago and this is what I did -
As soon as Kafka receive's message, spark streaming job picks up the event and start processing.
Spark streaming job sends an alert email to Support group saying "Event Received and spark transformation STARTED". Start timestamp is stored.
After spark processing/transformations are done - sends an alert email to Support group saying "Spark transformation ENDED Successfully". End timestamp is stored.
Above 2 steps will help support group to track if spark processing success email is not received after it's started and they can investigate by looking at spark UI for job failure or delayed processing (maybe job is hung due to resource unavailability for a long time)
At last - store event id or details in HDFS file along with start and end timestamp. And save this file to the HDFS path where some hive log_table is pointing to. This will be helpful for future reference to how spark code is performing over the period time and can be fine tuned if required.
Hope this is helpful.

How to debug a slow PySpark application

There may be an obvious answer to this, but I couldn't find any after a lot of googling.
In a typical program, I'd normally add log messages to time different parts of the code and find out where the bottleneck is. With Spark/PySpark, however, transformations are evaluated lazily, which means most of the code is executed in almost constant time (not a function of the dataset's size at least) until an action is called at the end.
So how would one go about timing individual transformations and perhaps making some parts of the code more efficient by doing things differently where necessary and possible?
You can use Spark UI to see the execution plan of your jobs and time of each phase of them. Then you can optimize your operations using that statistics. Here is a very good presentation about monitoring Spark Apps using Spark UI https://youtu.be/mVP9sZ6K__Y (Spark Sumiit Europe 2016, by Jacek Laskowski)
Any job troubleshooting should have the below steps.
Step 1: Gather data about the issue
Step 2: Check the environment
Step 3: Examine the log files
Step 4: Check cluster and instance health
Step 5: Review configuration settings
Step 6: Examine input data
From the Hadoop Admin perspective, Spark long-running job basic troubleshooting. Go to RM > Application ID.
a) Check for AM & Non-AM Preempted. This can happen if more that required memory is assigned either to driver or executors which can get preempted for a high priority job/YARN queue.
b) Click on AppMaster url. Review Environment variables.
c) Check Jobs section, review Event timeline. Check if executors are getting started immediately after driver or taking time.
d) If Driver process is taking time, see if collect()/ collectAsList() is running on driver as these method tends to take time as they retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes) to the driver node.
e) If no issue in event timeline, go to the incomplete task > stages and check Shuffle Read Size/Records for any Data Skewness issue.
f) If all tasks are complete and still Spark job is running, then go to Executor page > Driver process thread dump > Search for driver. And lookout for operation the driver is working on. Below will be NameNode operation method we can see there (if any).
*getFileInfo()
getFileList()
rename()
merge()
getblockLocation()
commit()*

Apache Spark streaming - Timeout long-running batch

I'm setting up a Apache Spark long-running streaming job to perform (non-parallelized) streaming using InputDStream.
What I'm trying to achieve is that when a batch on the queue takes too long (based on a user defined timeout), I want to be able to skip the batch and abandon it completely - and continue the rest of execution.
I wasn't able to find a solution to this problem within the spark API or online -- I looked into using StreamingContext awaitTerminationOrTimeout, but this kills the entire StreamingContext on timeout, whereas all I want to do is skip/kill the current batch.
I also considered using mapWithState, but this doesn't seem to apply to this use case. Finally, I was considering setting up a StreamingListener and starting a timer when the batch starts and then having the batch stop/skip/killed when reaching a certain timeout threshold, but there still doesn't seem to be a way to kill the batch.
Thanks!
I've seen some docs from yelp, but I haven't done it myself.
Using UpdateStateByKey(update_func) or mapWithState(stateSpec),
Attach timeout when events are first seen and state is initialized
Drop the state if it expires
def update_function(new_events, current_state):
if current_state is None:
current_state = init_state()
attach_expire_datetime(new_events)
......
if is_expired(current_state):
return None //current_state drops?
if new_events:
apply_business_logic(new_events, current_state)
This looks like that the structured streaming watermark also drops the events when they timeout, if this could apply to your jobs/stages timeout dropping.

Replaying an RDD in spark streaming to update an accumulator

I am actually running out of options.
In my spark streaming application. I want to keep a state on some keys. I am getting events from Kafka. Then I extract keys from the event, say userID. When there is no events coming from Kafka I want to keep updating a counter relative to each user ID each 3 seconds, since I configured the batchduration of my StreamingContext with 3 seconds.
Now the way I am doing it might be ugly, but at least it works: I have an accumulableCollection like this:
val userID = ssc.sparkContext.accumulableCollection(new mutable.HashMap[String,Long]())
Then I create a "fake" event and keep pushing it to my spark streaming context as the following:
val rddQueue = new mutable.SynchronizedQueue[RDD[String]]()
for ( i <- 1 to 100) {
rddQueue += ssc.sparkContext.makeRDD(Seq("FAKE_MESSAGE"))
Thread.sleep(3000)
}
val inputStream = ssc.queueStream(rddQueue)
inputStream.foreachRDD( UPDATE_MY_ACCUMULATOR )
This would let me access to my accumulatorCollection and update all the counters of all userIDs. Up to now everything works fine, however when I change my loop from:
for ( i <- 1 to 100) {} #This is for test
To:
while (true) {} #This is to let me access and update my accumulator through the whole application life cycle
Then when I run my ./spark-submit, my application gets stuck on this stage:
15/12/10 18:09:00 INFO BlockManagerMasterActor: Registering block manager slave1.cluster.example:38959 with 1060.3 MB RAM, BlockManagerId(1, slave1.cluster.example, 38959)
Any clue on how to resolve this ? Is there a pretty straightforward way that would allow me updating the values of my userIDs (rather than creating an unuseful RDD and pushing it periodically to the queuestream)?
The reason why the while (true) ... version does not work is that the control never returns to the main execution line and therefore nothing below that line gets executed. To solve that specific problem, we should execute the while loop in a separate thread. Future { while () ...} should probably work.
Also, the Thread.sleep(3000) when populating the QueueDStream in the example above is not needed. Spark Streaming will consume one message from the queue on each streaming interval.
A better way to trigger that inflow of 'tick' messages would be with the ConstantInputDStream that plays back the same RDD at each streaming interval, therefore removing the need to create the RDD inflow with the QueueDStream.
That said, it looks to me that the current approach seems fragile and would need revision.

Resources