collecting results asynchronously from gpars parallel executor

collecting results asynchronously from gpars parallel executor - groovy

We've got some code in Java using ThreadPoolExecutor and CompletionService. Tasks are submitted in large batches to the pool; results go to the completion service where we collect completed tasks when available without waiting for the entire batch to complete:
ThreadPoolExecutor _executorService =
new ThreadPoolExecutor(MAX_NUMBER_OF_WORKERS, new LinkedBlockingQueue(20));
CompletionService _completionService =
new ExecutorCompletionService<Callable>(_executorService)
//submit tasks
_completionService.submit( some task);
//get results
while(...){
Future result = _completionService.poll(timeout);
if(result)
//process result
}
The total number of workers in the pool is MAX_NUMBER_OF_WORKERS; tasks submitted without an available worker are queued; up to 20 tasks may be queued, after which, tasks are rejected.
What is the Gpars counterpart to this approach?
Reading the documentation on gpars parallelism, I found many potential options: collectManyParallel(), anyParallel(), fork/join, etc., and I'm not sure which ones to even test. I was hoping to find some mention of "completion" or "completion service" as a comparison in the docs, but found nothing. I'm looking for some direction/pointers on where to start from those experienced with gpars.

Collecting results on-the-fly, throttling producers - this calls for a dataflow solution. Please find a sample runnable demo below:
import groovyx.gpars.dataflow.DataflowQueue
import groovyx.gpars.group.DefaultPGroup
import groovyx.gpars.scheduler.DefaultPool
import java.util.concurrent.LinkedBlockingQueue
import java.util.concurrent.ThreadPoolExecutor
import java.util.concurrent.TimeUnit
int MAX_NUMBER_OF_WORKERS = 10
ThreadPoolExecutor _executorService =
new ThreadPoolExecutor(MAX_NUMBER_OF_WORKERS, MAX_NUMBER_OF_WORKERS, 1000, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(200));
final group = new DefaultPGroup(new DefaultPool(_executorService))
final results = new DataflowQueue()
//submit tasks
30.times {value ->
group.task(new Runnable() {
#Override
void run() {
println 'Starting ' + Thread.currentThread()
sleep 5000
println 'Finished ' + Thread.currentThread()
results.bind(value)
}
});
}
group.task {
results << -1 //stop the consumer eventually
}
//get results
while (true) {
def result = results.val
println result
if (result == -1) break
//process result
}
group.shutdown()

Related

How to reject new tasks in newSingleThreadExecutor when another task is running

The actual scenario is Executors.newSingleThreadExecutor instance is executing a long-running task, I need to reject the new requests for this task until the completion of an existing one. while rejecting new requests, I need to simply send a message called "Thread is already running a task! Please wait until it completes".
Is it possible to implement using the newSingleThreadExecutor? Can Anyone Please help me...

The factory Executors.newSingleThreadExecutor() returns the equivalent to new ThreadPoolExecutor(1, 1, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue<>()), except that it is wrapped in another ExecutorService to prevent other code from casting it to ThreadPoolExecutor and changing the configuration. The wrapper also adds finalization support which you should not rely on anyway.
So you can construct a similar executor and alter its setup to your needs.
ExecutorService es = new ThreadPoolExecutor(1, 1, 0L, TimeUnit.MILLISECONDS,
new SynchronousQueue<>());
A SynchronousQueue has no capacity but can only hand elements over to already waiting consumers, i.e. will accept a Runnable only when there is already an idle worker thread. When the queue rejects the new job, the ThreadPoolExecutor will check its configured thread count and either, start a worker thread (at most one here) or call into a RejectedExecutionHandler. The default handler does throw a RejectedExecutionException, so we’re basically done with your requirements here.
The finetuning we can do, is to change the message of the RejectedExecutionException by providing our own RejectedExecutionHandler:
ExecutorService es = new ThreadPoolExecutor(1, 1, 0L, TimeUnit.MILLISECONDS,
new SynchronousQueue<>(),
(runnable, executor) -> { throw new RejectedExecutionException(
"Thread is already running a task! Please wait until it completes");
});
When the amount of code having access to the ExecutorService is rather small and you can trust it to not doing things like casting es back to ThreadPoolExecutor and mess around, you can keep it this way. Otherwise, you can protect it against such modifications by wrapping it:
ExecutorService es = Executors.unconfigurableExecutorService(
new ThreadPoolExecutor(1, 1, 200L, TimeUnit.MILLISECONDS, new SynchronousQueue<>(),
(runnable, executor) -> { throw new RejectedExecutionException(
"Thread is already running a task! Please wait until it completes");
}));
Now we are as close as we can get to the behavior of Executors.newSingleThreadExecutor() plus your deviations, without implementing our own executor. The only thing this executor doesn’t have, is finalization support, but as this bug report suggests, it’s not a good idea to have it anyway. Take care to invoke shutdown() on it at the end of its lifetime (unless it overlaps with a call to System.exit(…) anyway).
The behavior can be tested with a program like
Future<?> previous = null;
for(int i = 0; i < 100; i++) {
int jobID = i;
System.out.println(jobID + " Trying to submit job");
try {
Future<?> next = es.submit(() -> {
Thread.sleep(200);
System.out.println("job " + jobID);
return null;
});
if(previous != null && !previous.isDone()) {
throw new AssertionError("new job accepted before previous completed");
}
previous = next;
} catch(RejectedExecutionException ex) {
System.out.println("rejected: " + ex);
}
Thread.sleep(ThreadLocalRandom.current().nextInt(90, 190));
}
This program attempts to submit multiple jobs and will throw an AssertionError if a job is accepted while the previous is not completed.

Batch up requests in Groovy?

I'm new to Groovy and am a bit lost on how to batch up requests so they can be submitted to a server as a batch, instead of individually, as I currently have:
class Handler {
private String jobId
// [...]
void submit() {
// [...]
// client is a single instance of Client used by all Handlers
jobId = client.add(args)
}
}
class Client {
//...
String add(String args) {
response = postJson(args)
return parseIdFromJson(response)
}
}
As it is now, something calls Client.add(), which POSTs to a REST API and returns a parsed result.
The issue I have is that the add() method is called maybe thousands of times in quick succession, and it would be much more efficient to collect all the args passed in to add(), wait until there's a moment when the add() calls stop coming in, and then POST to the REST API a single time for that batch, sending all the args in one go.
Is this possible? Potentially, add() can return a fake id immediately, as long as the batching occurs, the submit happens, and Client can later know the lookup between fake id and the ID coming from the REST API (which will return IDs in the order corresponding to the args sent to it).

As mentioned in the comments, this might be a good case for gpars which is excellent at these kinds of scenarios.
This really is less about groovy and more about asynchronous programming in java and on the jvm in general.
If you want to stick with the java concurrent idioms I threw together a code snippet you could use as a potential starting point. This has not been tested and edge cases have not been considered. I wrote this up for fun and since this is asynchronous programming and I haven't spent the appropriate time thinking about it, I suspect there are holes in there big enough to drive a tank through.
That being said, here is some code which makes an attempt at batching up the requests:
import java.util.concurrent.*
import java.util.concurrent.locks.*
// test code
def client = new Client()
client.start()
def futureResponses = []
1000.times {
futureResponses << client.add(it as String)
}
client.stop()
futureResponses.each { futureResponse ->
// resolve future...will wait if the batch has not completed yet
def response = futureResponse.get()
println "received response with index ${response.responseIndex}"
}
// end of test code
class FutureResponse extends CompletableFuture<String> {
String args
}
class Client {
int minMillisLullToSubmitBatch = 100
int maxBatchSizeBeforeSubmit = 100
int millisBetweenChecks = 10
long lastAddTime = Long.MAX_VALUE
def batch = []
def lock = new ReentrantLock()
boolean running = true
def start() {
running = true
Thread.start {
while (running) {
checkForSubmission()
sleep millisBetweenChecks
}
}
}
def stop() {
running = false
checkForSubmission()
}
def withLock(Closure c) {
try {
lock.lock()
c.call()
} finally {
lock.unlock()
}
}
FutureResponse add(String args) {
def future = new FutureResponse(args: args)
withLock {
batch << future
lastAddTime = System.currentTimeMillis()
}
future
}
def checkForSubmission() {
withLock {
if (System.currentTimeMillis() - lastAddTime > minMillisLullToSubmitBatch ||
batch.size() > maxBatchSizeBeforeSubmit) {
submitBatch()
}
}
}
def submitBatch() {
// here you would need to put the combined args on a format
// suitable for the endpoint you are calling. In this
// example we are just creating a list containing the args
def combinedArgs = batch.collect { it.args }
// further there needs to be a way to map one specific set of
// args in the combined args to a specific response. If the
// endpoint responds with the same order as the args we submitted
// were in, then that can be used otherwise something else like
// an id in the response etc would need to be figured out. Here
// we just assume responses are returned in the order args were submitted
List<String> combinedResponses = postJson(combinedArgs)
combinedResponses.indexed().each { index, response ->
// here the FutureResponse gets a value, can be retrieved with
// futureResponse.get()
batch[index].complete(response)
}
// clear the batch
batch = []
}
// bogus method to fake post
def postJson(combinedArgs) {
println "posting json with batch size: ${combinedArgs.size()}"
combinedArgs.collect { [responseIndex: it] }
}
}
A few notes:
something needs to be able to react to the fact that there were no calls to add for a while. This implies a separate monitoring thread and is what the start and stop methods manage.
if we have an infinite sequence of adds without pauses, you might run out of resources. Therefore the code has a max batch size where it will submit the batch even if there is no lull in the calls to add.
the code uses a lock to make sure (or try to, as mentioned above, I have not considered all potential issues here) we stay thread safe during batch submissions etc
assuming the general idea here is sound, you are left with implementing the logic in submitBatch where the main problem is dealing with mapping specific args to specific responses
CompletableFuture is a java 8 class. This can be solved using other constructs in earlier releases, but I happened to be on java 8.
I more or less wrote this without executing or testing, I'm sure there are some mistakes in there.
as can be seen in the printout below, the "maxBatchSizeBeforeSubmit" setting is more a recommendation that an actual max. Since the monitoring thread sleeps for some time and then wakes up to check how we are doing, the threads calling the add method might have accumulated any number of requests in the batch. All we are guaranteed is that every millisBetweenChecks we will wake up and check how we are doing and if the criteria for submitting a batch has been reached, then the batch will be submitted.
If you are unfamiliar with java Futures and locks, I would recommend you read up on them.
If you save the above code in a groovy script code.groovy and run it:
~> groovy code.groovy
posting json with batch size: 153
posting json with batch size: 234
posting json with batch size: 243
posting json with batch size: 370
received response with index 0
received response with index 1
received response with index 2
...
received response with index 998
received response with index 999
~>
it should work and print out the "responses" received from our fake json submissions.

How to get multiple async results within a given timeout with GPars?

I'd like to retrieve multiple "costly" results using parallel processing but within a specific timeout.
I'm using GPars Dataflow.task but it looks like I'm missing something as the process returns only when all dataflow variable are bound.
def timeout = 500
def mapResults = []
GParsPool.withPool(3) {
def taskWeb1 = Dataflow.task {
mapResults.web1 = new URL('http://web1.com').getText()
}.join(timeout, TimeUnit.MILLISECONDS)
def taskWeb2 = Dataflow.task {
mapResults.web2 = new URL('http://web2.com').getText()
}.join(timeout, TimeUnit.MILLISECONDS)
def taskWeb3 = Dataflow.task {
mapResults.web3 = new URL('http://web3.com').getText()
}.join(timeout, TimeUnit.MILLISECONDS)
}
I did see in the GPars Timeouts doc a way to use Select to get the fastest result within the timeout.
But I'm looking for a way to retrieve as much as possible results in the given time frame.
Is there a better "GPars" way to achieve this?
Or with Java 8 Future/Callable ?

Since you're interested in Java 8 based solutions too, here's a way to do it:
int timeout = 250;
ExecutorService executorService = Executors.newFixedThreadPool(3);
try {
Map<String, CompletableFuture<String>> map =
Stream.of("http://google.com", "http://yahoo.com", "http://bing.com")
.collect(
Collectors.toMap(
// the key will be the URL
Function.identity(),
// the value will be the CompletableFuture text fetched from the url
(url) -> CompletableFuture.supplyAsync(
() -> readUrl(url, timeout),
executorService
)
)
);
executorService.awaitTermination(timeout, TimeUnit.MILLISECONDS);
//print the resulting map, cutting the text at 100 chars
map.entrySet().stream().forEach(entry -> {
CompletableFuture<String> future = entry.getValue();
boolean completed = future.isDone()
&& !future.isCompletedExceptionally()
&& !future.isCancelled();
System.out.printf("url %s completed: %s, error: %s, result: %.100s\n",
entry.getKey(),
completed,
future.isCompletedExceptionally(),
completed ? future.getNow(null) : null);
});
} catch (InterruptedException e) {
//rethrow
} finally {
executorService.shutdownNow();
}
This will give you as many Futures as URLs you have, but gives you an opportunity to see if any of the tasks failed with an exception. The code could be simplified if you're not interested in these exceptions, only the contents of successful retrievals:
int timeout = 250;
ExecutorService executorService = Executors.newFixedThreadPool(3);
try {
Map<String, String> map = Collections.synchronizedMap(new HashMap<>());
Stream.of("http://google.com", "http://yahoo.com", "http://bing.com")
.forEach(url -> {
CompletableFuture
.supplyAsync(
() -> readUrl(url, timeout),
executorService
).thenAccept(content -> map.put(url, content));
});
executorService.awaitTermination(timeout, TimeUnit.MILLISECONDS);
//print the resulting map, cutting the text at 100 chars
map.entrySet().stream().forEach(entry -> {
System.out.printf("url %s completed, result: %.100s\n",
entry.getKey(), entry.getValue() );
});
} catch (InterruptedException e) {
//rethrow
} finally {
executorService.shutdownNow();
}
Both of the codes will wait for about 250 milliseconds (it will take only a tiny bit more because of the submissions of the tasks to the executor service) before printing the results. I found about 250 milliseconds is the threshold where some of these url-s can be fetched on my network, but not necessarily all. Feel free to adjust the timeout to experiment.
For the readUrl(url, timeout) method you could use a utility library like Apache Commons IO. The tasks submitted to the executor service will get an interrupt signal even if you don't explicitely take into account the timeout parameter. I could provide an implementation for that but I believe it's out of scope for the main issue in your question.

Task is ignoring Thread.Sleep

trying to grasp the TPL.
Just for fun I tried to create some Tasks with a random sleep to see how it was processed. I was targeting a fire and forget pattern..
static void Main(string[] args)
{
Console.WriteLine("Demonstrating a successful transaction");
Random d = new Random();
for (int i = 0; i < 10; i++)
{
var sleep = d.Next(100, 2000);
Action<int> succes = (int x) =>
{
Thread.Sleep(x);
Console.WriteLine("sleep={2}, Task={0}, Thread={1}: Begin successful transaction",
Task.CurrentId, Thread.CurrentThread.ManagedThreadId, x);
};
Task t1 = Task.Factory.StartNew(() => succes(sleep));
}
Console.ReadLine();
}
But I don't understand why it outputs all lines to the Console ignoring the Sleep(random)
Can someone explain that to me?

Important:
The TPL default TaskScheduler does not guarantee Thread per Task - one thread can be used for processing several tasks.
Calling Thread.Sleep might impact other tasks performance.
You can construct your task with the TaskCreationOptions.LongRunning hint this way the TaskScheduler will assign a dedicated thread for the task and it will be safe to block on it.

Your code uses the value of i instead of the generated random number. It does not ignore the sleep but rather sleeps between 0 and 10ms each iteration.
Try:
Thread.Sleep(sleep);

The sentence
Task t1 = Task.Factory.StartNew(() => succes(sleep));
Will create the Task and automatically start it, then will iterate again inside the for, without waiting the task to end its process. So when the second task is created and executed, the first one may be finished. I mean you are not waiting for the tasks to end:
You should try
Task t1 = Task.Factory.StartNew(() => succes(sleep));
t1.Wait();

Parallel Task advice

I am trying to use the parallel task library to kick off a number of tasks like this:
var workTasks = _schedules.Where(x => x.Task.Enabled);
_tasks = new Task[workTasks.Count()];
_cancellationTokenSource = new CancellationTokenSource();
_cancellationTokenSource.Token.ThrowIfCancellationRequested();
int i = 0;
foreach (var schedule in _schedules.Where(x => x.Task.Enabled))
{
_log.InfoFormat("Reading task information for task {0}", schedule.Task.Name);
if(!schedule.Task.Enabled)
{
_log.InfoFormat("task {0} disabled.", schedule.Task.Name);
i++;
continue;
}
schedule.Task.ServiceStarted = true;
_tasks[i] = Task.Factory.StartNew(() =>
schedule.Task.Run()
, _cancellationTokenSource.Token);
i++;
_log.InfoFormat("task {0} has been added to the worker threads and has been started.", schedule.Task.Name);
}
I want these tasks to sleep and then wake up every 5 minutes and do their stuff, at the moment I am using Thread.Sleep in the Schedule object whose Run method is the Action that is passed into StartNew as an argument like this:
_tasks[i] = Task.Factory.StartNew(() =>
schedule.Task.Run()
, _cancellationTokenSource.Token);
I read somewhere that Thread.Sleep is a bad solution for this. Can anyone recommend a better approach?

By my understanding, Thread.Sleep is bad generally, because it force-shifts everything out of memory even when that's not necessary. It won't be a big deal in most cases, but it could be a performance issue.
I'm in the habit of using this snippet instead:
new System.Threading.EventWaitHandle(false, EventResetMode.ManualReset).WaitOne(1000);
Fits on one line, and isn't overly complicated -- it creates an event handle that will never be set, and then waits for the full timeout period before continuing.
Anyway, if you're just trying to have something repeat every 5 minutes, a better approach would probably be to use a Timer. You could even make a class to neatly wrap everything if your repeated work methods are already factored out:
using System.Threading;
using System.Threading.Tasks;
public class WorkRepeater
{
Timer m_Timer;
WorkRepeater(Action workToRepeat, TimeSpan interval)
{
m_Timer = new System.Timers.Timer((double)Interval.Milliseconds);
m_Timer.Elapsed +=
new System.Timers.ElapsedEventHandler((o, ea) => WorkToRepeat());
}
public void Start()
{
m_Timer.Start();
}
public void Stop()
{
m_Timer.Stop();
}
}

Bad solution are Tasks here. Task should be used for short living operations, like asynch IO. If you want to control life time of task you should use Thread and sleep as much as you like, because Thread is individual, but Tasks are rotated in thread pool which is shared.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

collecting results asynchronously from gpars parallel executor - groovy

Related

How to reject new tasks in newSingleThreadExecutor when another task is running

Batch up requests in Groovy?

How to get multiple async results within a given timeout with GPars?

Task is ignoring Thread.Sleep

Parallel Task advice

Categories

Resources