How to parallelise Spark to Kafka writes - apache-spark

The Kafka producer returns a Java Future (or you can use the callback) anyway, in my spark job I want to make sure that things are correctly sent and that it goes fast.
Using:
rdd.foreach { msg =>
kafkaProducer.send(msg).get() //wait for the future to complete
}
does not perform well at all.
I was thinking of using
rdd.repartition(20).foreachPartition { iterator =>
iterator.foreach { msg =>
kafkaProducer.send(msg).get()
}
}
my question is though, is foreachPartition going to run in parallel? It's hard to see from a simple test I've written as it appears that the content of foreachPartition are run from the same thread. But I don't know if this depends on my test setup...

Related

using Kafka Consumer in Node JS app to indicate computations have been made

So my question may involve some brainstorming based on the nature of the application.
I have a Node JS app that sends messages to Kafka. For example, every single time a user clicks on a page, a Kafka app runs a computation based on the visit. I then at the same instance want to retrieve this computation after triggering it through my Kafka message. So far, this computation is stored in a Cassandra database. The problem is that, if we try to read from Cassandra before the computation is complete then we will query nothing from the database(key has not been inserted yet)and won't return anything(error), or possibly the computation is stale. This is my code so far.
router.get('/:slug', async (req, res) =>{
Producer = kafka.Producer
KeyedMessage = kafka.KeyedMessage
client = new kafka.KafkaClient()
producer = new Producer(client)
km = new KeyedMessage('key', 'message')
kafka_message = JSON.stringify({ id: req.session.session_id.toString(), url: arbitrary_url })
payloads = [
{ topic: 'MakeComputationTopic', messages: kafka_message}
];
const clientCass = new cassandra.Client({
contactPoints: ['127.0.0.1:9042'],
localDataCenter: 'datacenter1', // here is the change required
keyspace: 'computation_space',
authProvider: new auth.PlainTextAuthProvider('cassandra', 'cassandra')
});
const query = 'SELECT * FROM computation WHERE id = ?';
clientCass.execute(query, [req.session.session_id],{ hints : ['int'] })
.then(result => console.log('User with email %s', result.rows[0].computations))
.catch((message) => {
console.log('Could not find key')
});
}
Firstly, async and await came to mind but that is ruled out since this does not stop stale computations.
Secondly, I looked into letting my application sleep, but it seems that this way will slow my application down.
I am possibly deciding on using Kafka Consumer (in my node-js) to consume a message that indicates that it's safe to now look into the Cassandra table.
For e.g. (using kafka-node)
consumer.on('message', function (message) {
clientCass.execute(query, [req.session.session_id],{ hints : ['int'] })
.then(result => console.log('User with computation%s', result.rows[0].computations))
.catch((message) => {
console.log('Could not find key')
});
});
This approach while better seems a bit off since I will have to make a consumer every time a user clicks on a page, and I only care about it being sent 1 message.
I was wondering how I should deal with this challenge? Am I possibly missing a scenario, or is there a way to use kafka-node to solve this problem? I was also thinking of doing a while loop that waits for the promise to succeed and that computations are not stale(compare values in the cache)
This approach while better seems a bit off since I will have to make a consumer every time a user clicks on a page, and I only care about it being sent 1 message.
I would come to the same conclusion. Cassandra is not designed for these kind of use cases. The database is eventually consistence. Your current approach maybe works at the moment, if you hack something together, but will definitely result in undefined behavior once you have a Cassandra cluster. Especially when you update the entry.
The id in the computation table is your partition key. This means once you have a cluster Cassandra distributes the data by the id. It looks like it only contains one row. This is a very inefficient way of modeling your Cassandra tables.
Your use case looks like one for a session storage or cache. Redis or LevelDB are well suited for these kind of use cases. Any other key value storage would do the job too.
Why don't you write your result into another topic and have another application which reads this topic and writes the result into a database. So that you don't need to keep any state. The result will be in the topic when it is done. It would look like this:
incoming data -> first kafka topic -> computational app -> second kafka topic -> another app writing it into the database <- another app reading regularly the data.
If it is there it is there and therefore not done yet.

Returning results from executor to driver

I have a spark application, which basically takes in a big dataset, performs some computations over it, and finally does some IO to store it in a database. All of these stages happen on executors, and driver gets (collects) a boolean from each task, representing the success/failure status of that task (e.g. computation or IO may fail for some items).
E.g., following is an over-simplified lineage (in the actual implementation, there are multiple repartitioning and computation steps):
readSomeDataset()
.repartition()
.mapPartition { // do some calculation }
.mapPartition { // do some IO }
.collect()
Problem:
Based on the result of the computations, I would like to do something else on the driver (like publishing a message saying "computation was success"). This needs to be done once for the entire dataset, and not for individual partition, and thus needs to happen on the driver.
However, the IO on executors takes a long time, and I do not want to wait for that to finish before publishing.
Is there a way for the executors to send a 'message' back to the driver while in middle of processing the tasks?
(Something like Accumulators comes to mind, however, afaik they will be usable only once the final action finishes on the executors)
Spark is a lazy framework, and need a complete job (from reading to writing) to execute, it can't execute only part.
To do these changes without reprocessing, you can cache dataframes, to recover as fast as you can, something like this.
val calculatedDF = readSomeDataset()
.repartition()
.mapPartition { // do some calculation }
.cache() // or persist if can't fit in memory of the executors
if (caculatedDF.map(checkEackAreOK).reduce(_ && _).head) { // a condition to see if the calculations are ok and an action to launch it
println("correct calculation")
calculatedDF
.mapPartition { // do some IO }
.collect()
} else {
println("incorrect calculation")
}

Is there a way to dynamically stop Spark Structured Streaming?

In my scenario I have several dataSet that comes every now and then that i need to ingest in our platform. The ingestion processes involves several transformation steps. One of them being Spark. In particular I use spark structured streaming so far. The infrastructure also involve kafka from which spark structured streaming reads data.
I wonder if there is a way to detect when there is nothing else to consume from a topic for a while to decide to stop the job. That is i want to run it for the time it takes to consume that specific dataset and then stop it. For specific reasons we decided not to use the batch version of spark.
Hence is there any timeout or something that can be used to detect that there is no more data coming it and that everything has be processed.
Thank you
Structured Streaming Monitoring Options
You can use query.lastProgress to get the timestamp and build logic around that. Don't forget to save your checkpoint to a durable, persistent, available store.
Putting together a couple pieces of advice:
As #Michael West pointed out, there are listeners to track progress
From what I gather, Structured Streaming doesn't yet support graceful shutdown
So one option is to periodically check for query activity, dynamically shutting down depending on a configurable state (when you determine no further progress can/should be made):
// where you configure your spark job...
spark.streams.addListener(shutdownListener(spark))
// your job code starts here by calling "start()" on the stream...
// periodically await termination, checking for your shutdown state
while(!spark.sparkContext.isStopped) {
if (shutdown) {
println(s"Shutting down since first batch has completed...")
spark.streams.active.foreach(_.stop())
spark.stop()
} else {
// wait 10 seconds before checking again if work is complete
spark.streams.awaitAnyTermination(10000)
}
}
Your listener can dynamically shutdown in a variety of ways. For instance, if you're only waiting on a single batch, then just shutdown after the first update:
var shutdown = false
def shutdownListener(spark: SparkSession) = new StreamingQueryListener() {
override def onQueryStarted(_: QueryStartedEvent): Unit = println("Query started: " + queryStarted.id)
override def onQueryTerminated(_: QueryTerminatedEvent): Unit = println("Query terminated! " + queryTerminated.id)
override def onQueryProgress(_: QueryProgressEvent): Unit = shutdown = true
}
Or, if you need to shutdown after more complicated state changes, you could parse the json body of the queryProgress.progress to determine whether or not to shutdown at a given onQueryUpdate event firing.
You can probably use this:-
def stopStreamQuery(query: StreamingQuery, awaitTerminationTimeMs: Long) {
while (query.isActive) {
try{
if(query.lastProgress.numInputRows < 10){
query.awaitTermination(1000)
}
}
catch
{
case e:NullPointerException => println("First Batch")
}
Thread.sleep(500)
}
}
You can make a numInputRows variable.

Using Java 8 parallelStream inside Spark mapParitions

I am trying to understand the behavior of Java 8 parallel stream inside spark parallelism. When I run the below code, I am expecting the output size of listOfThings to be the same as input size. But that's not the case, I sometimes have missing items in my output. This behavior is not consistent. If I just iterate through the iterator instead of using parallelStream, everything is fine. Count matches every time.
// listRDD.count = 10
JavaRDD test = listRDD.mapPartitions(iterator -> {
List listOfThings = IteratorUtils.toList(iterator);
return listOfThings.parallelStream.map(
//some stuff here
).collect(Collectors.toList());
});
// test.count = 9
// test.count = 10
// test.count = 8
// test.count = 7
it's a very good question.
Whats going on here is Race Condition. when you parallelize the stream then stream split the full list into several equal parts [Based on avaliable threads and size of list] then it tries to run subparts independently on each avaliable thread to perform the work.
But you are also using apache spark which is famous for computing the work faster i.e. general purpose computation engine. Spark uses the same approach [parallelize the work] to perform the action.
Now Here in this Scenerio what is happening is Spark already parallelized the whole work then inside this you are again parallelizing the work due to this the race condition starts i.e. spark executor starts processing the work and then you parallelized the work then stream process aquires other thread and start processing IF THE THREAD THAT WAS PROCESSING STREAM WORK FINISHES WORK BEFORE THE SPARK EXECUTOR COMPLETE HIS WORK THEN IT ADD THE RESULT OTHERWISE SPARK EXECUTOR CONTINUES TO REPORT RESULT TO MASTER.
This is not a good approach to re-parallelize the work it will always gives you the pain let the spark do it for you.
Hope you understand whats going on here
Thanks

Replaying an RDD in spark streaming to update an accumulator

I am actually running out of options.
In my spark streaming application. I want to keep a state on some keys. I am getting events from Kafka. Then I extract keys from the event, say userID. When there is no events coming from Kafka I want to keep updating a counter relative to each user ID each 3 seconds, since I configured the batchduration of my StreamingContext with 3 seconds.
Now the way I am doing it might be ugly, but at least it works: I have an accumulableCollection like this:
val userID = ssc.sparkContext.accumulableCollection(new mutable.HashMap[String,Long]())
Then I create a "fake" event and keep pushing it to my spark streaming context as the following:
val rddQueue = new mutable.SynchronizedQueue[RDD[String]]()
for ( i <- 1 to 100) {
rddQueue += ssc.sparkContext.makeRDD(Seq("FAKE_MESSAGE"))
Thread.sleep(3000)
}
val inputStream = ssc.queueStream(rddQueue)
inputStream.foreachRDD( UPDATE_MY_ACCUMULATOR )
This would let me access to my accumulatorCollection and update all the counters of all userIDs. Up to now everything works fine, however when I change my loop from:
for ( i <- 1 to 100) {} #This is for test
To:
while (true) {} #This is to let me access and update my accumulator through the whole application life cycle
Then when I run my ./spark-submit, my application gets stuck on this stage:
15/12/10 18:09:00 INFO BlockManagerMasterActor: Registering block manager slave1.cluster.example:38959 with 1060.3 MB RAM, BlockManagerId(1, slave1.cluster.example, 38959)
Any clue on how to resolve this ? Is there a pretty straightforward way that would allow me updating the values of my userIDs (rather than creating an unuseful RDD and pushing it periodically to the queuestream)?
The reason why the while (true) ... version does not work is that the control never returns to the main execution line and therefore nothing below that line gets executed. To solve that specific problem, we should execute the while loop in a separate thread. Future { while () ...} should probably work.
Also, the Thread.sleep(3000) when populating the QueueDStream in the example above is not needed. Spark Streaming will consume one message from the queue on each streaming interval.
A better way to trigger that inflow of 'tick' messages would be with the ConstantInputDStream that plays back the same RDD at each streaming interval, therefore removing the need to create the RDD inflow with the QueueDStream.
That said, it looks to me that the current approach seems fragile and would need revision.

Resources