GPars report status on large number of async functions and wait for completion - groovy

I have a parser, and after gathering the data for a row, I want to fire an aync function and let it process the row, while the main thread continues on and gets the next row.
I've seen this post: How do I execute two tasks simultaneously and wait for the results in Groovy? but I'm not sure it is the best solution for my situation.
What I want to do is, after all the rows are read, wait for all the async functions to finish before I go on. One concern with using a collection of Promises is that the list could be large (100,000+).
Also, I want to report status as we go. And finally, I'm not sure I want to automatically wait for a timeout (like on a get()), because the file could be huge, however, I do want to allow the user to kill the process for various reasons.
So what I've done for now is record the number of rows parsed (as they occur via rowsRead), then use a callback from the Promise to record another row being finished processing, like this:
def promise = processRow(row)
promise.whenBound {
rowsProcessed.incrementAndGet()
}
Where rowsProcessed is an AtomicInteger.
Then in the code invoked at the end of the sheet, after all parsing is done and I'm waiting for the processing to finish, I'm doing this:
boolean test = true
while (test) {
Thread.sleep(1000) // No need to pound the CPU with this check
println "read: ${sheet.rowsRead}, processed: ${sheet.rowsProcessed.get()}"
if (sheet.rowsProcessed.get() == sheet.rowsRead) {
test = false
}
}
The nice thing is, I don't have an explosion of Promise objects here - just a simple count to check. But I'm not sure sleeping every so often is as efficient as checking the get() on each Promise() object.
So, my questions are:
If I used the collection of Promises instead, would a get() react and return if the thread executing the while loop above was interrupted with Thread.interrupt()?
Would using the collection of Promises and calling get() on each be more efficient than trying to sleep and check every so often?
Is there another, better approach that I haven't considered?
Thanks!

Call to allPromises*.get() will throw InterruptedException if the waiting (main) thread gets interrupted
Yes, the promises have been created anyway, so grouping them in a list should not impose additional memory requirements, in my opinion.
The suggested solutions with a CountDownLanch or a Phaser are IMO much more suitable than using busy waiting.

An alternative to an AtomicInteger is to use a CountDownLatch. It avoids both the sleep and the large collection of Promise objects. You could use it like this:
latch = new CountDownLatch(sheet.rowsRead)
...
def promise = processRow(row)
promise.whenBound {
latch.countDown()
}
...
while (!latch.await(1, TimeUnit.SECONDS)) {
println "read: ${sheet.rowsRead}, processed: ${sheet.rowsRead - latch.count}"
}

Related

Using Promises to make In Memory Processing Concurrent

We have a project where we need to process ~5,000 objects and each object takes 200-500 milliseconds each to process. A developer on my team suggested using promises to try to process each object concurrently. So basically something like this:
let result = await Promise.all(objects.map(o => process(o));
The process() code might look like this:
async process(theObject) {
return new Promise(resolve => {
1 + 1 = 2;
resolve();
});
}
While it seems like a fair pattern, it seems like an anti-pattern, or a code smell. There also seems to be something about how Node/V8 handles promises that might introduce major issues later. Anyone have any thoughts on this pattern and whether it might be use-ful/less?
One caveat of using Promise.all() is how it handles errors. From the MDN:
It rejects with the reason of the first promise that rejects.
So if a single processing error of the ~5000 objects stops the entire process is okay, then it seems like a decent tool. I would recommend setting up a queue to both separate out the processing from the orchestration of the messages as well as provide scalability advantages.

await Task.WhenAll() vs Task.WhenAll().Wait()

I have a method that produces an array of tasks (See my previous post about threading) and at the end of this method I have the following options:
await Task.WhenAll(tasks); // done in a method marked with async
Task.WhenAll(tasks).Wait(); // done in any type of method
Task.WaitAll(tasks);
Basically I am wanting to know what the difference between the two whenalls are as the first one doesn't seem to wait until tasks are completed where as the second one does, but I'm not wanting to use the second one if it's not asynchronus.
I have included the third option as I understand that this will lock the current thread until all the tasks have completed processing (seemingly synchronously instead of asynchronus) - please correct me if I am wrong about this one
Example function with await:
public async void RunSearchAsync()
{
_tasks = new List<Task>();
Task<List<SearchResult>> products = SearchProductsAsync(CoreCache.AllProducts);
Task<List<SearchResult>> brochures = SearchProductsAsync(CoreCache.AllBrochures);
_tasks.Add(products);
_tasks.Add(brochures);
await Task.WhenAll(_tasks.ToArray());
//code here hit before all _tasks completed but if I take off the async and change the above line to:
// Task.WhenAll(_tasks.ToArray()).Wait();
// code here hit after _tasks are completed
}
await will return to the caller, and resume method execution when the awaited task completes.
WhenAll will create a task When All all the tasks are complete.
WaitAll will block the creation thread (main thread) until all the tasks are complete.
Talking about await Task.WhenAll(tasks) vs Task.WhenAll(tasks).Wait(). If execution is in an async context, always try to avoid .Wait and .Result because those break async paradigm.
Those two blocks the thread, nothing other operation can take it. Maybe it is not a big problem in small apps but if you are working with high demand services that is bad. It could lead to thread starvation.
In the other hand await waits for the task to be completed in background, but this does not block the thread allowing to Framework/CPU to take it for any other task.

Java: ordering results retrieved from asynchronous tasks

I've got a computation (CTR encryption) that requires results in a precise order.
For this I created a multithreaded design that calculates said results, in this case the result is a ByteBuffer. The calculation itself of course runs asynchronous, so the results may become available at any time and in any order. The "user" is a single-threaded application that uses the results by calling a method, after which the ByteBuffers are returned to the pool of resources by said method - the management of resources is already handled (using a thread safe stack).
Now the question: I need something that aggregates the results and makes them available in the right order. If the next result is not available, the method that the user called should block until it is. Does anyone know a good strategy or class in java.util.concurrent that can return asynchronously calculated results in order?
The solution it must be thread safe. I would like to avoid third party libraries, Thread.sleep() / Thread.wait() and theading related keywords other than "synchronized". Futhermore, The tasks may be given to e.g. an Executor in the correct order if that is required. This is for research, so feel free to use Java 1.6 or even 1.7 constructs.
Note: I've tagged these quesions [jre] as I want to keep within the classes defined in the JRE and [encryption] as somebody may already have had to deal with it, but the question itself is purely about java & multi-threading.
Use the executors framework:
ExecutorService executorService = Executors.newFixedThreadPool(5);
List<Future> futures = executorService.invokeAll(listOfCallables);
for (Future future : futures) {
//do something with future.get();
}
executorService.shutdown();
The listOfCallables will be a List<Callable<ByteBuffer>> that you have constructed to operate on the data. For example:
list.add(new SubTaskCalculator(1, 20));
list.add(new SubTaskCalculator(21, 40));
list.add(new SubTaskCalculator(41, 60));
(arbitrary ranges of numbers, adjust that to your task at hand)
.get() blocks until the result is complete, but at the same time other tasks are also running, so when you reach them, their .get() will be ready.
Returning results in the right order is trivial. As each result arrives, store it in an arraylist, and once you have ALL the results, just sort the arraylist. You could use a PriorityQueue to keep the results sorted at all times as they arrive, but there is no point in doing this, since you will not be making any use of the results before all of them have arrived anyway.
So, what you could do is this:
Declare a "WorkItem" class which contains one of your bytearrays and its ordinal number, so that they can be sorted by ordinal number.
In your work threads, do something like this:
...do work and produce a work_item...
synchronized( LockObject )
{
ResultList.Add( work_item );
number_of_results++;
LockObject.notifyAll();
}
In your main thread, do something like this:
synchronized( LockObject )
while( number_of_results != number_of_items )
LockObject.wait();
ResultList.Sort();
...go ahead and use the results...
My new answer after gaining a better understanding of what you want to do:
Declare a "WorkItem" class which contains one of your bytearrays and its ordinal number, so that they can be sorted by ordinal number.
Make use of a java.util.PriorityQueue which is kept sorted by ordinal number. Essentially, all we care is that the first item in the priority queue at any given time will be the next item to process.
Each work thread stores its result in the PriorityQueue and issues a NotifyAll on some locking object.
The main thread waits on the locking object, and then if there are items in the queue, and if the ordinal of the (peeked, not dequeued) first item in the queue is equal to the number of items processed so far, then it dequeues the item and processes it. If not, it keeps waiting. If all of the items have been produced and processed, it is done.

Locking on an object?

I'm very new to Node.js and I'm sure there's an easy answer to this, I just can't find it :(
I'm using the filesystem to hold 'packages' (folders with a status extensions 'mypackage.idle') Users can perform actions on these which would cause the status to go to something like 'qa', or 'deploying' etc... If the server is accepting lots of requests and multiple requests come in for the same package how would I check the status and then perform an action, which would change the status, guaranteeing that another request didn't alter it before/during the action took place?
so in c# something like this
lock (someLock) { checkStatus(); performAction(); }
Thanks :)
If checkStatus() and performAction() are synchronous functions called one after another, then as others mentioned earlier: their exectution will run uninterupted till completion.
However, I suspect that in reality both of these functions are asynchoronous, and the realistic case of composing them is something like:
function checkStatus(callback){
doSomeIOStuff(function(something){
callback(something == ok);
});
}
checkStatus(function(status){
if(status == true){
performAction();
}
});
The above code is subject to race conditions, as when doSomeIOStuff is being perfomed instead of waiting for it new request can be served.
You may want to check https://www.npmjs.com/package/rwlock library.
This is a bit misleading. There are many script languages that are suppose to be single threaded, but when sharing data from the same source this creates a problem. NodeJs might be single threaded when you are running a single request, but when you have multiple requests trying to access the same data, it just behaves as it creates kind of the same problem as if you were running a multithreaded language.
There is already an answer about this here : Locking on an object?
WATCH sentinel_key
GET value_of_interest
if (value_of_interest = FULL)
MULTI
SET sentinel_key = foo
EXEC
if (EXEC returned 1, i.e. succeeded)
do_something();
else
do_nothing();
else
UNWATCH
One thing you can do is lock on an external object, for instance, a sequence in a database such as Oracle or Redis.
http://redis.io/commands
For example, I am using cluster with node.js (I have 4 cores) and I have a node.js function and each time I run through it, I increment a variable. I basically need to lock on that variable so no two threads use the same value of that variable.
check this out How to create a distributed lock with Redis?
and this https://engineering.gosquared.com/distributed-locks-using-redis
I think you can run with this idea if you know what you are doing.
If you are making asynchronous calls with callbacks, this means multiple clients could potentially make the same, or related requests, and receive responses in different orders. This is definitely a case where locking is useful. You won't be 'locking a thread' in the traditional sense, but merely ensuring asynchronous calls, and their callbacks are made in a predictable order. The async-lock package looks like it handles this scenario.
https://www.npmjs.com/package/async-lock
warning, node.js change semantic if you add a log entry beucause logging is IO bound.
if you change from
qa_action_performed = false
function handle_request() {
if (check_status() == STATUS_QA && !qa_action_performed) {
qa_action_performed = true
perform_action()
}
}
to
qa_action_performed = false
function handle_request() {
if (check_status() == STATUS_QA && !qa_action_performed) {
console.log("my log stuff");
qa_action_performed = true
perform_action()
}
}
more than one thread can execute perform_action().
You don't have to worry about synchronization with Node.js since it's single threaded with an event loop. This is one of the advantage of the architecture that Node.js use.
Nothing will be executed between checkStatus() and performAction().
There are no locks in node.js -- because you shouldn't need them. There's only one thread (the event loop) and your code is never interrupted unless you perform an asynchronous action like I/O. Hence your code should never block. You can't do any parallel code execution.
That said, your code could look something like this:
qa_action_performed = false
function handle_request() {
if (check_status() == STATUS_QA && !qa_action_performed) {
qa_action_performed = true
perform_action()
}
}
Between check_status() and perform_action() no other thread can interrupt because there is no I/O. As soon as you enter the if clause and set qa_action_performed = true, no other code will enter the if block and hence perform_action() is never executed twice, even if perform_action() takes time performing I/O.

How to specify a timeout value on HttpWebRequest.BeginGetResponse without blocking the thread

I’m trying to issue web requests asynchronously. I have my code working fine except for one thing: There doesn’t seem to be a built-in way to specify a timeout on BeginGetResponse. The MSDN example clearly show a working example but the downside to it is they all end up with a
SomeObject.WaitOne()
Which again clearly states it blocks the thread. I will be in a high load environment and can’t have blocking but I also need to timeout a request if it takes more than 2 seconds. Short of creating and managing a separate thread pool, is there something already present in the framework that can help me?
Starting examples:
http://msdn.microsoft.com/en-us/library/ms227433(VS.100).aspx
http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.begingetresponse.aspx
What I would like is a way for the async callback on BeginGetResponse() to be invoked after my timeout parameter expires, with some indication that a timeout occurred.
The seemingly obvious TimeOut parameter is not honored on async calls.
The ReadWriteTimeout parameter doesn't come into play until the response returns.
A non-proprietary solution would be preferable.
EDIT:
Here's what I came up with: after calling BeginGetResponse, I create a Timer with my duration and that's the end of the "begin" phase of processing. Now either the request will complete and my "end" phase will be called OR the timeout period will expire.
To detect the race and have a single winner I call increment a "completed" counter in a thread-safe manner. If "timeout" is the 1st event to come back, I abort the request and stop the timer. In this situation, when "end" is called the EndGetResponse throws an error. If the "end" phase happens first, it increments the counter and the "timeout" foregoes aborting the request.
This seems to work like I want while also providing a configurable timeout. The downside is the extra timer object and the callbacks which I make no effort to avoid. I see 1-3 threads processing various portions (begin, timed out, end) so it seems like this working. And I don't have any "wait" calls.
Have I missed too much sleep or have I found a way to service my requests without blocking?
int completed = 0;
this.Request.BeginGetResponse(GotResponse, this.Request);
this.timer = new Timer(Timedout, this, TimeOutDuration, Timeout.Infinite);
private void Timedout(object state)
{
if (Interlocked.Increment(ref completed) == 1)
{
this.Request.Abort();
}
this.timer.Change(Timeout.Infinite, Timeout.Infinite);
this.timer.Dispose();
}
private void GotRecentSearches(IAsyncResult result)
{
Interlocked.Increment(ref completed);
}
You can to use a BackgroundWorker to run your HttpWebRequest into a separated thread, so your main thread still alive. So, this background thread will be blocked, but first one don't.
In this context, you can to use a ManualResetEvent.WaitOne() just like in that sample: HttpWebRequest.BeginGetResponse() method.
What kind of an application is this? Is this a service proces/ web application/console app?
How are you creating your work load (i.e requests)? If you have a queue of work that needs to be done, you can start off 'N' number of async requests (with the framework for timeouts that you have built) and then, once each request completes (either with timeout or success) you can grab the next request from the queue.
This will thus become a Producer/consumer pattern.
So, if you configure your application to have a maximum of "N' requests outstanding, you can maintain a pool of 'N' timers that you reuse (without disposing) between the requests.
Or, alternately, you can use ThreadPool.SetTimerQueueTimer() to manage your timers. The threadpool will manage the timers for you and reuse the timer between requests.
Hope this helps.
Seems like my original approach is the best thing available.
If you can user async/await then
private async Task<WebResponse> getResponseAsync(HttpWebRequest request)
{
var responseTask = Task.Factory.FromAsync(request.BeginGetResponse, ar => (HttpWebResponse)request.EndGetResponse(ar), null);
var winner = await (Task.WhenAny(responseTask, Task.Delay(new TimeSpan(0, 0, 20))));
if (winner != responseTask)
{
throw new TimeoutException();
}
return await responseTask;
}

Resources