scala - best way to do parallel constant polling and processing - multithreading

I am trying to figure out what is the best way to do constant polling in async and non blocking way. Entire goal of the application is to start few threads and with each thread do constant polling on external service (kafka) to get data; each thread then can process that data or hand it over to some other thread. I don't see a way to do this just with scala Future as it requires timeout value. I can set it to a year but that still doesn't seem like a good solution. e.g. Await.result(future, 365 days) Any pointers ?

There are couple of Async Non-Blocking Kafka libraries. You can write a consumer in any of these to pull data from Kafka topics.
https://github.com/cakesolutions/scala-kafka-client
https://github.com/akka/reactive-kafka

Related

Spark Streaming - Poison Pill?

I'm trying to decide how best to design a data pipeline that will involve Spark Streaming.
The essential process I imagine is:
Set up a streaming job that watches a fileStream (this is the consumer)
Do a bunch of computation elsewhere, which populates that file (this is the producer)
The streaming job consumes the data as it comes in, performing various actions
When the producer is done, wait for all the streaming computations to finish, and tear down the streaming job.
It's step (4) that has me confused. I'm not sure how to shut it down gracefully. Recommendations I've found generally seem to recommend "Ctrl-C" on the driver, along with the spark.streaming.stopGracefullyOnShutdown config setting
I don't like that approach since it requires the producing code to somehow access the consumer's driver and send it a signal. These two systems could be completely unrelated; this is not necessarily easy to do.
Plus, there is already a communication channel — the fileStream — can't I use that?
In a traditional threaded producer/consumer situation, one common technique is to use a "poison pill". The producer sends a special piece of data indicating "no more data", then you wait for your consumers to exit.
Is there a reason this can't be done in Spark?
Surely there is a way for the stream processing code, upon seeing some special data, to send a message back to its driver?
The Spark docs have an example of listening to a socket, with socketTextStream, and it somehow is able to terminate when the producer is done. I haven't dived into that code yet, but this seems like it should be possible.
Any advice?
Is this fundamentally wrong-headed?

How to properly construct a Twitter Future

We're using Finatra and have services return a Twitter Future.
Currently we use either Future { ... } or Future.value(..) to construct Future instances, but looking at the source this does not seem correct.
In Future.apply source doc it says: "that a is executed in the calling thread and as such some care must be taken with blocking code."
So, how to create a Future which executes the function on a separate thread, just like the Scala Future does?
You need a FuturePool for that. Something like val future = FuturePool.defaultPool { doStuff () }
Both Future.value and Future.apply are immediate. They are more or less equivalent to scala.concurrent.Future.successful.
+1 to Dima's answer, but...
Doing things in a background thread (FuturePool) because your server is struggling to keep up with request load isn't usually the correct solution. Assuming you are just processing a CPU intensive task for 100ms, its probably better to keep it on the same thread and adjust the number of servers you have and the number of threads servicing requests.
But if you are doing something like querying a database or remote service, that call would ideally return a truly asynchronous Future that isn't blocking any finagle threads.
If you have a sync API wrapping a network service, then FuturePool is probably the correct thing to workaround it.

synchronous vs asynchronous write/delete in Cassandra

What is the difference in synchronous and asynchronous write/delete in Cassandra ?
If I use executeAsynchronously() instead of execute() method of CqlOperation class (datastax driver) will it improve the performance in terms of throughput (TPS) ? In my application I am doing single insert/delete as well as batch insert.
Till now i was using only execute method (synchronous) and I am thinking to use asynchronous execute to improve the performance of application in terms of TPS.
Async writes offer better performance per worker but it adds overhead of callbacks and error handling.
I have done a test recently to find performance benefits as well as callback implementation with error handling using a single worker with 1M records written Async was found 4 Times as fast as Synchronous ones. in_flight queries were limited to 1000, this number can be tuned accordingly as per environment conditions (Number of connections you want to put on the wire, say with 200ms network latency and 1ms server response time one may go for 200 queries to be put in_flight, while Sync call would have let server free for 199ms out of 200ms in this case server would be processing atleast one query almost all the time) but without restriction it will congest network with possible degradation in performance.
In some cases Synchronous query may be more suitable, especially if the result of the query is critical before moving ahead with program. But in most of the cases Async suffices.
In short the answer to your question is Yes - I have tested TPS increase of 4x.
Reference - performance evaluation using async writes
sync write(or delete) to cassandra will block code execution until the client receives a confirmation that the operation has been completed based on the consistency level.
On the other hand, async write(or delete) will send the query to cassandra, and then proceed with the code execution(will not block). Now you have to register some kind of callback that will inform you(asynchronously) that the write operation has completed.
All of the blocking adds up, and can slow down your application. Because async queries immediately proceed, they allow you send more async queries right after instead of waiting on the first one to finish. This is where the performance increase occurs, especially if you are sending a lot of queries to cassandra.
It will definitely increase the performance.
I have not tried it but link below says the same
Read the question
http://www.datastax.com/dev/blog/java-driver-async-queries

Esper UpdateListener's concurrency

My boss want to me learning Esper, the open source library for CEP, so I need some help.
I want to many UpdateListener subscribing one event stream, and they run on concurrently. That means, if one listener have a long and big process, then other listener running concurrency, because we have so many event at short time, so I need more fast processing.
The UpdateListener code can simply use a Java threadpool to do its work. For an example there is http://www.javacodegeeks.com/2013/01/java-thread-pool-example-using-executors-and-threadpoolexecutor.html.
In Esper you can also configure threading.
http://esper.codehaus.org/esper-5.1.0/doc/reference/en-US/html_single/index.html#api-threading-advanced

How to have many consumer threads using BlockingCollection

I am using a producer / consumer pattern backed with a BlockingCollection to read data off a file, parse/convert and then insert into a database. The code I have is very similar to what can be found here: http://dhruba.name/2012/10/09/concurrent-producer-consumer-pattern-using-csharp-4-0-blockingcollection-tasks/
However, the main difference is that my consumer threads not only parse the data but also insert into a database. This bit is slow, and I think is causing the threads to block.
In the example, there are two consumer threads. I am wondering if there is a way to have the number of threads increase in a somewhat intelligent way? I had thought a threadpool would do this, but can't seem to grasp how that would be done.
Alternatively, how would you go about choosing the number of consumer threads? 2 does not seem correct for me, but I'm not sure what the best # would be. Thoughts on the best way to choose # of consumer threads?
The best way to choose the number of consumer threads is math: figure out how many packets per minute are coming in from the producers, divide that by how many packets per minute a single consumer can handle, and you have a pretty good idea of how many consumers you need.
I solved the blocking output problem (consumers blocking when trying to update the database) by adding another BlockingCollection that the consumers put their completed packets in. A separate thread reads that queue and updates the database. So it looks something like:
input thread(s) => input queue => consumer(s) => output queue => output thread
This has the added benefit of divorcing the consumers from the output, meaning that you can optimize the output or completely change the output method without affecting the consumer. That might allow you, for example, to batch the database updates so that rather than making one database call per record, you could update a dozen or a hundred (or more) records with a single call.
I show a very simple example of this (using a single consumer) in my article Simple Multithreading, Part 2. That works with a text file filter, but the concepts are the same.

Resources