assume having a List where results of jobs that are computed distributed are stored.
Now I have a main thread that is waiting for all jobs finished.
I know the size of the List needs to have until all jobs are finished.
What is the most elegant way in scala let the main thread (while(true) loop) sleep and getting it awake when the jobs are finished?
thanks for your answers
EDIT: ok after trying the concept from #Stefan-Kunze without success (guess I didnt got the point...) I give an example with some code:
The first node:
class PingPlugin extends SmasPlugin
{
val messages = new ListBuffer[BaseMessage]()
val sum = 5
def onStop = true
def onStart =
{
log.info("Ping Plugin created!")
true
}
def handleInit(msg: Init)
{
log.info("Init received")
for( a <- 1 to sum)
{
msg.pingTarget ! Ping() // Ping extends BaseMessage
}
// block here until all messages are received
// wait for messages.length == sum
log.info("handleInit - messages received: %d/%d ".format(messages.length, sum))
}
/**
* This method handles incoming Pong messages
* #param msg Pong extends BaseMessage
*/
def handlePong(msg: Pong)
{
log.info("Pong received from: " + msg.sender)
messages += msg
log.info("handlePong - messages received: %d/%d ".format(messages.length, sum))
}
}
a second node:
class PongPlugin extends SmasPlugin
{
def onStop = true
def onStart =
{
log.info("Pong Plugin created!")
true
}
/**
* This method receives Ping messages and send a Pong message back after a random time
* #param msg Ping extends BaseMessage
*/
def handlePing(msg: Ping)
{
log.info("Ping received from: " + msg.sender)
val sleep: Int = math.round(5000 * Random.nextFloat())
log.info("sleep: " + sleep)
Thread.sleep(sleep)
msg.sender ! Pong()
}
}
I guess the solution is possible with futures...
Picking up #jilen 's approach: (this code is assuming your results are of a type result)
//just like lists futures can be yielded
val tasks: Seq[Future[Result]] = for (i <- 1 to results.size) yield future {
//results.size is the number of //results you are expecting
println("Executing task " + i)
Thread.sleep(i * 1000L)
val result = ??? //your code goes here
result
}
//merge all future results into a future of a sequence of results
val aggregated: Future[Seq[Result]] = Future.sequence(tasks)
//awaits for your results to be computed
val squares: Seq[Int] = Await.result(aggregated, Duration.Inf)
println("Squares: " + squares)
It's hard to test the code here, since I don't have the rest of this system, but I'll try. I'm assuming that somewhere underneath all of this is Akka.
First, blocking like this suggests a real design problem. In an actor system, you should send your messages and move on. Your log command should be in handlePong when the correct number of pings have returned. Blocking init hangs the entire actor. You really should never do that.
But, ok, what if you absolutely have to do that? Then a good tool here would be the ask pattern. Something like this (I can't check that this compiles without more of your code):
import akka.pattern.ask
import akka.util.Timeout
import scala.concurrent.duration._
...
implicit val timeout = Timeout(5 seconds)
var pendingPongs = List.empty[Future[Pong]]
for( a <- 1 to sum)
{
// Ask each target for Ping. Append the returned Future to pendingPongs
pendingPongs += msg.pingTarget ? Ping() // Ping extends BaseMessage
}
// pendingPongs is a list of futures. We want a future of a list.
// sequence() does that for us. We then block using Await until the future completes.
val pongs = Await.result(Future.sequence(pendingPongs), 5 seconds)
log.info(s"handlePong - messages received: ${pongs.length}/$sum")
Related
In my existing Scala code I replaced Thread.sleep(10000) with ZIO.sleep(Duration.fromScala(10.seconds)) with the understanding that it won't block thread from the thread pool (performance issue). When program runs it does not wait at this line (whereas of course in first case it does). Do I need to add any extra code for ZIO method to work ?
Adding code section from Play+Scala code:
def sendMultipartEmail = Action.async(parse.multipartFormData) { request =>
.....
//inside this controller below method is called
def retryEmailOnFail(pList: ListBuffer[JsObject], content: String) = {
if (!sendAndGetStatus(pList, content)) {
println("<--- email sending failed - retry once after a delay")
ZIO.sleep(Duration.fromScala(10.seconds))
println("<--- retrying email sending after a delay")
finalStatus = finalStatus && sendAndGetStatus(pList, content)
} else {
finalStatus = finalStatus && true
}
}
.....
}
As you said, ZIO.sleep will only suspend the fiber that is running, not the operating system thread.
If you want to start something after sleeping, you should just chain it after the sleep:
// value 42 will only be computed after waiting for 10s
val io = ZIO.sleep(Duration.fromScala(10.seconds)).map(_ => 42)
I'm new to Groovy and am a bit lost on how to batch up requests so they can be submitted to a server as a batch, instead of individually, as I currently have:
class Handler {
private String jobId
// [...]
void submit() {
// [...]
// client is a single instance of Client used by all Handlers
jobId = client.add(args)
}
}
class Client {
//...
String add(String args) {
response = postJson(args)
return parseIdFromJson(response)
}
}
As it is now, something calls Client.add(), which POSTs to a REST API and returns a parsed result.
The issue I have is that the add() method is called maybe thousands of times in quick succession, and it would be much more efficient to collect all the args passed in to add(), wait until there's a moment when the add() calls stop coming in, and then POST to the REST API a single time for that batch, sending all the args in one go.
Is this possible? Potentially, add() can return a fake id immediately, as long as the batching occurs, the submit happens, and Client can later know the lookup between fake id and the ID coming from the REST API (which will return IDs in the order corresponding to the args sent to it).
As mentioned in the comments, this might be a good case for gpars which is excellent at these kinds of scenarios.
This really is less about groovy and more about asynchronous programming in java and on the jvm in general.
If you want to stick with the java concurrent idioms I threw together a code snippet you could use as a potential starting point. This has not been tested and edge cases have not been considered. I wrote this up for fun and since this is asynchronous programming and I haven't spent the appropriate time thinking about it, I suspect there are holes in there big enough to drive a tank through.
That being said, here is some code which makes an attempt at batching up the requests:
import java.util.concurrent.*
import java.util.concurrent.locks.*
// test code
def client = new Client()
client.start()
def futureResponses = []
1000.times {
futureResponses << client.add(it as String)
}
client.stop()
futureResponses.each { futureResponse ->
// resolve future...will wait if the batch has not completed yet
def response = futureResponse.get()
println "received response with index ${response.responseIndex}"
}
// end of test code
class FutureResponse extends CompletableFuture<String> {
String args
}
class Client {
int minMillisLullToSubmitBatch = 100
int maxBatchSizeBeforeSubmit = 100
int millisBetweenChecks = 10
long lastAddTime = Long.MAX_VALUE
def batch = []
def lock = new ReentrantLock()
boolean running = true
def start() {
running = true
Thread.start {
while (running) {
checkForSubmission()
sleep millisBetweenChecks
}
}
}
def stop() {
running = false
checkForSubmission()
}
def withLock(Closure c) {
try {
lock.lock()
c.call()
} finally {
lock.unlock()
}
}
FutureResponse add(String args) {
def future = new FutureResponse(args: args)
withLock {
batch << future
lastAddTime = System.currentTimeMillis()
}
future
}
def checkForSubmission() {
withLock {
if (System.currentTimeMillis() - lastAddTime > minMillisLullToSubmitBatch ||
batch.size() > maxBatchSizeBeforeSubmit) {
submitBatch()
}
}
}
def submitBatch() {
// here you would need to put the combined args on a format
// suitable for the endpoint you are calling. In this
// example we are just creating a list containing the args
def combinedArgs = batch.collect { it.args }
// further there needs to be a way to map one specific set of
// args in the combined args to a specific response. If the
// endpoint responds with the same order as the args we submitted
// were in, then that can be used otherwise something else like
// an id in the response etc would need to be figured out. Here
// we just assume responses are returned in the order args were submitted
List<String> combinedResponses = postJson(combinedArgs)
combinedResponses.indexed().each { index, response ->
// here the FutureResponse gets a value, can be retrieved with
// futureResponse.get()
batch[index].complete(response)
}
// clear the batch
batch = []
}
// bogus method to fake post
def postJson(combinedArgs) {
println "posting json with batch size: ${combinedArgs.size()}"
combinedArgs.collect { [responseIndex: it] }
}
}
A few notes:
something needs to be able to react to the fact that there were no calls to add for a while. This implies a separate monitoring thread and is what the start and stop methods manage.
if we have an infinite sequence of adds without pauses, you might run out of resources. Therefore the code has a max batch size where it will submit the batch even if there is no lull in the calls to add.
the code uses a lock to make sure (or try to, as mentioned above, I have not considered all potential issues here) we stay thread safe during batch submissions etc
assuming the general idea here is sound, you are left with implementing the logic in submitBatch where the main problem is dealing with mapping specific args to specific responses
CompletableFuture is a java 8 class. This can be solved using other constructs in earlier releases, but I happened to be on java 8.
I more or less wrote this without executing or testing, I'm sure there are some mistakes in there.
as can be seen in the printout below, the "maxBatchSizeBeforeSubmit" setting is more a recommendation that an actual max. Since the monitoring thread sleeps for some time and then wakes up to check how we are doing, the threads calling the add method might have accumulated any number of requests in the batch. All we are guaranteed is that every millisBetweenChecks we will wake up and check how we are doing and if the criteria for submitting a batch has been reached, then the batch will be submitted.
If you are unfamiliar with java Futures and locks, I would recommend you read up on them.
If you save the above code in a groovy script code.groovy and run it:
~> groovy code.groovy
posting json with batch size: 153
posting json with batch size: 234
posting json with batch size: 243
posting json with batch size: 370
received response with index 0
received response with index 1
received response with index 2
...
received response with index 998
received response with index 999
~>
it should work and print out the "responses" received from our fake json submissions.
I'd like to retrieve multiple "costly" results using parallel processing but within a specific timeout.
I'm using GPars Dataflow.task but it looks like I'm missing something as the process returns only when all dataflow variable are bound.
def timeout = 500
def mapResults = []
GParsPool.withPool(3) {
def taskWeb1 = Dataflow.task {
mapResults.web1 = new URL('http://web1.com').getText()
}.join(timeout, TimeUnit.MILLISECONDS)
def taskWeb2 = Dataflow.task {
mapResults.web2 = new URL('http://web2.com').getText()
}.join(timeout, TimeUnit.MILLISECONDS)
def taskWeb3 = Dataflow.task {
mapResults.web3 = new URL('http://web3.com').getText()
}.join(timeout, TimeUnit.MILLISECONDS)
}
I did see in the GPars Timeouts doc a way to use Select to get the fastest result within the timeout.
But I'm looking for a way to retrieve as much as possible results in the given time frame.
Is there a better "GPars" way to achieve this?
Or with Java 8 Future/Callable ?
Since you're interested in Java 8 based solutions too, here's a way to do it:
int timeout = 250;
ExecutorService executorService = Executors.newFixedThreadPool(3);
try {
Map<String, CompletableFuture<String>> map =
Stream.of("http://google.com", "http://yahoo.com", "http://bing.com")
.collect(
Collectors.toMap(
// the key will be the URL
Function.identity(),
// the value will be the CompletableFuture text fetched from the url
(url) -> CompletableFuture.supplyAsync(
() -> readUrl(url, timeout),
executorService
)
)
);
executorService.awaitTermination(timeout, TimeUnit.MILLISECONDS);
//print the resulting map, cutting the text at 100 chars
map.entrySet().stream().forEach(entry -> {
CompletableFuture<String> future = entry.getValue();
boolean completed = future.isDone()
&& !future.isCompletedExceptionally()
&& !future.isCancelled();
System.out.printf("url %s completed: %s, error: %s, result: %.100s\n",
entry.getKey(),
completed,
future.isCompletedExceptionally(),
completed ? future.getNow(null) : null);
});
} catch (InterruptedException e) {
//rethrow
} finally {
executorService.shutdownNow();
}
This will give you as many Futures as URLs you have, but gives you an opportunity to see if any of the tasks failed with an exception. The code could be simplified if you're not interested in these exceptions, only the contents of successful retrievals:
int timeout = 250;
ExecutorService executorService = Executors.newFixedThreadPool(3);
try {
Map<String, String> map = Collections.synchronizedMap(new HashMap<>());
Stream.of("http://google.com", "http://yahoo.com", "http://bing.com")
.forEach(url -> {
CompletableFuture
.supplyAsync(
() -> readUrl(url, timeout),
executorService
).thenAccept(content -> map.put(url, content));
});
executorService.awaitTermination(timeout, TimeUnit.MILLISECONDS);
//print the resulting map, cutting the text at 100 chars
map.entrySet().stream().forEach(entry -> {
System.out.printf("url %s completed, result: %.100s\n",
entry.getKey(), entry.getValue() );
});
} catch (InterruptedException e) {
//rethrow
} finally {
executorService.shutdownNow();
}
Both of the codes will wait for about 250 milliseconds (it will take only a tiny bit more because of the submissions of the tasks to the executor service) before printing the results. I found about 250 milliseconds is the threshold where some of these url-s can be fetched on my network, but not necessarily all. Feel free to adjust the timeout to experiment.
For the readUrl(url, timeout) method you could use a utility library like Apache Commons IO. The tasks submitted to the executor service will get an interrupt signal even if you don't explicitely take into account the timeout parameter. I could provide an implementation for that but I believe it's out of scope for the main issue in your question.
I wanted to stream the data returned from a slick 3.0.0 query via db.stream(yourquery) through scalaz-stream.
It looks like reactive-streams.org is used an API and dataflow model that different libraries implement.
How do you do that with the back pressure flowing back from the scalaz-stream process to the slick publisher?
Take a look at https://github.com/krasserm/streamz
Streamz is a resource combinator library for scalaz-stream. It allows Process instances to consume from and produce to:
Apache Camel endpoints
Akka Persistence journals and snapshot stores and
Akka Stream flows (reactive streams) with full back-pressure support
I finally did answer my own question. If you are willing to use scalaz-streams queues to queue up streaming results.
def getData[T](publisher: slick.backend.DatabasePublisher[T],
queue: scalaz.stream.async.mutable.Queue[T], batchRequest: Int = 1): Task[scala.concurrent.Future[Long]] =
Task {
val p = scala.concurrent.Promise[Unit]()
var counter: Long = 0
val s = new org.reactivestreams.Subscriber[T] {
var sub: Subscription = _
def onSubscribe(s: Subscription): Unit = {
sub = s
sub.request(batchRequest)
}
def onComplete(): Unit = {
sub.cancel()
p.success(counter)
}
def onError(t: Throwable): Unit = p.failure(t)
def onNext(e: T): Unit = {
counter += 1
queue.enqueueOne(e).run
sub.request(batchRequest)
}
}
publisher.subscribe(s)
p.future
}
When you run this using run you obtain a future that when finished, means the query finished streaming. You can compose on this future if you wanted your computation to wait for all the data to arrive. You could also add use an Await in the Task in getData then compose your computation on the returned Task object if you need all the data to run before continuing. For what I do, I compose on the future's completion and shutdown the queue so that my scalaz-stream knows to terminate cleanly.
Here is a slightly different implementation (than the one posted by user1763729) which returns a Process:
def getData[T](publisher: DatabasePublisher[T], batchSize: Long = 1L): Process[Task, T] = {
val q = async.boundedQueue[T](10)
val subscribe = Task.delay {
publisher.subscribe(new Subscriber[T] {
#volatile var subscription: Subscription = _
override def onSubscribe(s: Subscription) {
subscription = s
subscription.request(batchSize)
}
override def onNext(next: T) = {
q.enqueueOne(next).attemptRun
subscription.request(batchSize)
}
override def onError(t: Throwable) = q.fail(t).attemptRun
override def onComplete() = q.close.attemptRun
})
}
Process.eval(subscribe).flatMap(_ => q.dequeue)
}
We've got some code in Java using ThreadPoolExecutor and CompletionService. Tasks are submitted in large batches to the pool; results go to the completion service where we collect completed tasks when available without waiting for the entire batch to complete:
ThreadPoolExecutor _executorService =
new ThreadPoolExecutor(MAX_NUMBER_OF_WORKERS, new LinkedBlockingQueue(20));
CompletionService _completionService =
new ExecutorCompletionService<Callable>(_executorService)
//submit tasks
_completionService.submit( some task);
//get results
while(...){
Future result = _completionService.poll(timeout);
if(result)
//process result
}
The total number of workers in the pool is MAX_NUMBER_OF_WORKERS; tasks submitted without an available worker are queued; up to 20 tasks may be queued, after which, tasks are rejected.
What is the Gpars counterpart to this approach?
Reading the documentation on gpars parallelism, I found many potential options: collectManyParallel(), anyParallel(), fork/join, etc., and I'm not sure which ones to even test. I was hoping to find some mention of "completion" or "completion service" as a comparison in the docs, but found nothing. I'm looking for some direction/pointers on where to start from those experienced with gpars.
Collecting results on-the-fly, throttling producers - this calls for a dataflow solution. Please find a sample runnable demo below:
import groovyx.gpars.dataflow.DataflowQueue
import groovyx.gpars.group.DefaultPGroup
import groovyx.gpars.scheduler.DefaultPool
import java.util.concurrent.LinkedBlockingQueue
import java.util.concurrent.ThreadPoolExecutor
import java.util.concurrent.TimeUnit
int MAX_NUMBER_OF_WORKERS = 10
ThreadPoolExecutor _executorService =
new ThreadPoolExecutor(MAX_NUMBER_OF_WORKERS, MAX_NUMBER_OF_WORKERS, 1000, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(200));
final group = new DefaultPGroup(new DefaultPool(_executorService))
final results = new DataflowQueue()
//submit tasks
30.times {value ->
group.task(new Runnable() {
#Override
void run() {
println 'Starting ' + Thread.currentThread()
sleep 5000
println 'Finished ' + Thread.currentThread()
results.bind(value)
}
});
}
group.task {
results << -1 //stop the consumer eventually
}
//get results
while (true) {
def result = results.val
println result
if (result == -1) break
//process result
}
group.shutdown()