Replaying an RDD in spark streaming to update an accumulator - apache-spark

I am actually running out of options.
In my spark streaming application. I want to keep a state on some keys. I am getting events from Kafka. Then I extract keys from the event, say userID. When there is no events coming from Kafka I want to keep updating a counter relative to each user ID each 3 seconds, since I configured the batchduration of my StreamingContext with 3 seconds.
Now the way I am doing it might be ugly, but at least it works: I have an accumulableCollection like this:
val userID = ssc.sparkContext.accumulableCollection(new mutable.HashMap[String,Long]())
Then I create a "fake" event and keep pushing it to my spark streaming context as the following:
val rddQueue = new mutable.SynchronizedQueue[RDD[String]]()
for ( i <- 1 to 100) {
rddQueue += ssc.sparkContext.makeRDD(Seq("FAKE_MESSAGE"))
Thread.sleep(3000)
}
val inputStream = ssc.queueStream(rddQueue)
inputStream.foreachRDD( UPDATE_MY_ACCUMULATOR )
This would let me access to my accumulatorCollection and update all the counters of all userIDs. Up to now everything works fine, however when I change my loop from:
for ( i <- 1 to 100) {} #This is for test
To:
while (true) {} #This is to let me access and update my accumulator through the whole application life cycle
Then when I run my ./spark-submit, my application gets stuck on this stage:
15/12/10 18:09:00 INFO BlockManagerMasterActor: Registering block manager slave1.cluster.example:38959 with 1060.3 MB RAM, BlockManagerId(1, slave1.cluster.example, 38959)
Any clue on how to resolve this ? Is there a pretty straightforward way that would allow me updating the values of my userIDs (rather than creating an unuseful RDD and pushing it periodically to the queuestream)?

The reason why the while (true) ... version does not work is that the control never returns to the main execution line and therefore nothing below that line gets executed. To solve that specific problem, we should execute the while loop in a separate thread. Future { while () ...} should probably work.
Also, the Thread.sleep(3000) when populating the QueueDStream in the example above is not needed. Spark Streaming will consume one message from the queue on each streaming interval.
A better way to trigger that inflow of 'tick' messages would be with the ConstantInputDStream that plays back the same RDD at each streaming interval, therefore removing the need to create the RDD inflow with the QueueDStream.
That said, it looks to me that the current approach seems fragile and would need revision.

Related

Spark and isolating time taken for tasks

I recently began to use Spark to process huge amount of data (~1TB). And have been able to get the job done too. However I am still trying to understand its working. Consider the following scenario:
Set reference time (say tref)
Do any one of the following two tasks:
a. Read large amount of data (~1TB) from tens of thousands of files using SciSpark into RDDs (OR)
b. Read data as above and do additional preprossing work and store the results in a DataFrame
Print the size of the RDD or DataFrame as applicable and time difference wrt to tref (ie, t0a/t0b)
Do some computation
Save the results
In other words, 1b creates a DataFrame after processing RDDs generated exactly as in 1a.
My query is the following:
Is it correct to infer that t0b – t0a = time required for preprocessing? Where can I find an reliable reference for the same?
Edit: Explanation added for the origin of question ...
My suspicion stems from Spark's lazy computation approach and its capability to perform asynchronous jobs. Can/does it initiate subsequent (preprocessing) tasks that can be computed while thousands of input files are being read? The origin of the suspicion is in the unbelievable performance (with results verified okay) I see that look too fantastic to be true.
Thanks for any reply.
I believe something like this could assist you (using Scala):
def timeIt[T](op: => T): Float = {
val start = System.currentTimeMillis
val res = op
val end = System.currentTimeMillis
(end - start) / 1000f
}
def XYZ = {
val r00 = sc.parallelize(0 to 999999)
val r01 = r00.map(x => (x,(x,x,x,x,x,x,x)))
r01.join(r01).count()
}
val time1 = timeIt(XYZ)
// or like this on next line
//val timeN = timeIt(r01.join(r01).count())
println(s"bla bla $time1 seconds.")
You need to be creative and work incrementally with Actions that cause actual execution. This has limitations thus. Lazy evaluation and such.
On the other hand, Spark Web UI records every Action, and records Stage duration for the Action.
In general: performance measuring in shared environments is difficult. Dynamic allocation in Spark in a noisy cluster means that you hold on to acquired resources during the Stage, but upon successive runs of the same or next Stage you may get less resources. But this is at least indicative and you can run in a less busy period.

Best approach to check if Spark streaming jobs are hanging

I have Spark streaming application which basically gets a trigger message from Kafka which kick starts the batch processing which could potentially take up to 2 hours.
There were incidents where some of the jobs were hanging indefinitely and didn't get completed within the usual time and currently there is no way we could figure out the status of the job without checking the Spark UI manually. I want to have a way where the currently running spark jobs are hanging or not. So basically if it's hanging for more than 30 minutes I want to notify the users so they can take an action. What all options do I have?
I see I can use metrics from driver and executors. If I were to choose the most important one, it would be the last received batch records. When StreamingMetrics.streaming.lastReceivedBatch_records == 0 it probably means that Spark streaming job has been stopped or failed.
But in my scenario, I will receive only 1 streaming trigger event and then it will kick start the processing which may take up to 2 hours so I won't be able to rely on the records received.
Is there a better way? TIA
YARN provides the REST API to check the status of application and status of cluster resource utilization as well.
with API call it will give a list of running applications and their start times and other details. you can have simple REST client that triggers maybe once in every 30 min or so and check if the job is running for more than 2 hours then send a simple mail alert.
Here is the API documentation:
https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_API
Maybe a simple solution like.
At the start of the processing - launch a waiting thread.
val TWO_HOURS = 2 * 60 * 60 * 1000
val t = new Thread(new Runnable {
override def run(): Unit = {
try {
Thread.sleep(TWO_HOURS)
// send an email that job didn't end
} catch {
case _: Exception => _
}
}
})
And in the place where you can say that batch processing is ended
t.interrupt()
If processing is done within 2 hours - waiter thread is interrupted and e-mail is not sent. If processing is not done - e-mail will be sent.
Let me draw your attention towards Streaming Query listeners. These are quite amazing lightweight things that can monitor your streaming query progress.
In an application that has multiple queries, you can figure out which queries are lagging or have stopped due to some exception.
Please find below sample code to understand its implementation. I hope that you can use this and convert this piece to better suit your needs. Thanks!
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(event: QueryStartedEvent) {
//logger message to show that the query has started
}
override def onQueryProgress(event: QueryProgressEvent) {
synchronized {
if(event.progress.name.equalsIgnoreCase("QueryName"))
{
recordsReadCount = recordsReadCount + event.progress.numInputRows
//Logger messages to show continuous progress
}
}
}
override def onQueryTerminated(event: QueryTerminatedEvent) {
synchronized {
//logger message to show the reason of termination.
}
}
})
I'm using Kubernetes currently with the Google Spark Operator. [1]
Some of my streaming jobs hang while using Spark 2.4.3: few tasks fail, then the current batch job never progresses.
I have set a timeout using a StreamingProgressListener so that a thread signals when no new batch is submitted for a long time. The signal is then forwarded to a Pushover client that sends a notification to an Android device. Then System.exit(1) is called. The Spark Operator will eventually restart the job.
[1] https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
One way is to monitor the output of the spark job that was kick started. Generally, for example,
If it writes to HDFS, monitor the HDFS output directory for last modified file timestamp or file count generated
If it writes to a Database, you could have a query to check the timestamp of the last record inserted into your job output table.
If it writes to Kafka, you could use Kafka GetOffsetShell to get the output topic's current offset.
Utilize
TaskContext
This provides contextual information for a task, and supports adding listeners for task completion/failure (see addTaskCompletionListener).
More detailed information such as the task 'attemptNumber' or 'taskMetrics' is available as well.
This information can be used by your application during runtime to determine if their is a 'hang' (depending on the problem)
More information about what is 'hanging' would be useful in providing a more specific solution.
I had a similar scenario to deal with about a year ago and this is what I did -
As soon as Kafka receive's message, spark streaming job picks up the event and start processing.
Spark streaming job sends an alert email to Support group saying "Event Received and spark transformation STARTED". Start timestamp is stored.
After spark processing/transformations are done - sends an alert email to Support group saying "Spark transformation ENDED Successfully". End timestamp is stored.
Above 2 steps will help support group to track if spark processing success email is not received after it's started and they can investigate by looking at spark UI for job failure or delayed processing (maybe job is hung due to resource unavailability for a long time)
At last - store event id or details in HDFS file along with start and end timestamp. And save this file to the HDFS path where some hive log_table is pointing to. This will be helpful for future reference to how spark code is performing over the period time and can be fine tuned if required.
Hope this is helpful.

Is there a way to dynamically stop Spark Structured Streaming?

In my scenario I have several dataSet that comes every now and then that i need to ingest in our platform. The ingestion processes involves several transformation steps. One of them being Spark. In particular I use spark structured streaming so far. The infrastructure also involve kafka from which spark structured streaming reads data.
I wonder if there is a way to detect when there is nothing else to consume from a topic for a while to decide to stop the job. That is i want to run it for the time it takes to consume that specific dataset and then stop it. For specific reasons we decided not to use the batch version of spark.
Hence is there any timeout or something that can be used to detect that there is no more data coming it and that everything has be processed.
Thank you
Structured Streaming Monitoring Options
You can use query.lastProgress to get the timestamp and build logic around that. Don't forget to save your checkpoint to a durable, persistent, available store.
Putting together a couple pieces of advice:
As #Michael West pointed out, there are listeners to track progress
From what I gather, Structured Streaming doesn't yet support graceful shutdown
So one option is to periodically check for query activity, dynamically shutting down depending on a configurable state (when you determine no further progress can/should be made):
// where you configure your spark job...
spark.streams.addListener(shutdownListener(spark))
// your job code starts here by calling "start()" on the stream...
// periodically await termination, checking for your shutdown state
while(!spark.sparkContext.isStopped) {
if (shutdown) {
println(s"Shutting down since first batch has completed...")
spark.streams.active.foreach(_.stop())
spark.stop()
} else {
// wait 10 seconds before checking again if work is complete
spark.streams.awaitAnyTermination(10000)
}
}
Your listener can dynamically shutdown in a variety of ways. For instance, if you're only waiting on a single batch, then just shutdown after the first update:
var shutdown = false
def shutdownListener(spark: SparkSession) = new StreamingQueryListener() {
override def onQueryStarted(_: QueryStartedEvent): Unit = println("Query started: " + queryStarted.id)
override def onQueryTerminated(_: QueryTerminatedEvent): Unit = println("Query terminated! " + queryTerminated.id)
override def onQueryProgress(_: QueryProgressEvent): Unit = shutdown = true
}
Or, if you need to shutdown after more complicated state changes, you could parse the json body of the queryProgress.progress to determine whether or not to shutdown at a given onQueryUpdate event firing.
You can probably use this:-
def stopStreamQuery(query: StreamingQuery, awaitTerminationTimeMs: Long) {
while (query.isActive) {
try{
if(query.lastProgress.numInputRows < 10){
query.awaitTermination(1000)
}
}
catch
{
case e:NullPointerException => println("First Batch")
}
Thread.sleep(500)
}
}
You can make a numInputRows variable.

synchronize(wait/notify) pattern in multiple streams in Spark

I have two Spark streams running in my application. At some point i need to see if first stream has created a table , so that i can use that Table in second stream.
I am using Accumulator as an indicator. So first stream updates the value of this accumulator after doing its job, and then the second stream executes its logic if the accumulator value has changed
dStream1.foreachRdd(rdd -> {
--creates ABC Sql table --
accumulator1.setValue(1);
});
dStream2.foreachRdd(rdd -> {
if(accumulator.value == 1){
--uses ABC Sql table--
}
});
so far it works fine, as dStream2 keeps on running the foreachRdd loop , and when it finds the accumulator value to be 1, it executes the logic.
But i want to find out more efficient way in which dStream2 waits until the value of accumulator changes.
Is it possible to do wait-notify pattern in Spark?
If we are running multiple streams, then the "foreachRDD" is executed one at a time. So they will not conflict with each other if they are sharing any resource or working on a common object.
Even if you want to use "spark.streaming.concurrentJobs" to run the streams or jobs parallely, you can manage the concurrency by using "java.util.concurrent.locks" package.

Using Java 8 parallelStream inside Spark mapParitions

I am trying to understand the behavior of Java 8 parallel stream inside spark parallelism. When I run the below code, I am expecting the output size of listOfThings to be the same as input size. But that's not the case, I sometimes have missing items in my output. This behavior is not consistent. If I just iterate through the iterator instead of using parallelStream, everything is fine. Count matches every time.
// listRDD.count = 10
JavaRDD test = listRDD.mapPartitions(iterator -> {
List listOfThings = IteratorUtils.toList(iterator);
return listOfThings.parallelStream.map(
//some stuff here
).collect(Collectors.toList());
});
// test.count = 9
// test.count = 10
// test.count = 8
// test.count = 7
it's a very good question.
Whats going on here is Race Condition. when you parallelize the stream then stream split the full list into several equal parts [Based on avaliable threads and size of list] then it tries to run subparts independently on each avaliable thread to perform the work.
But you are also using apache spark which is famous for computing the work faster i.e. general purpose computation engine. Spark uses the same approach [parallelize the work] to perform the action.
Now Here in this Scenerio what is happening is Spark already parallelized the whole work then inside this you are again parallelizing the work due to this the race condition starts i.e. spark executor starts processing the work and then you parallelized the work then stream process aquires other thread and start processing IF THE THREAD THAT WAS PROCESSING STREAM WORK FINISHES WORK BEFORE THE SPARK EXECUTOR COMPLETE HIS WORK THEN IT ADD THE RESULT OTHERWISE SPARK EXECUTOR CONTINUES TO REPORT RESULT TO MASTER.
This is not a good approach to re-parallelize the work it will always gives you the pain let the spark do it for you.
Hope you understand whats going on here
Thanks

Resources