Number of threads decreases as Parallel.Foreach loop goes on - multithreading

I have a Parallel Foreach loop which loops through a list of items, and performs some actions against them. Some of these actions take longer than others, depending on the item.
Parallel.ForEach(list, new ParallelOptions { MaxDegreeOfParallelism = 5 }, item =>
{
var subItems = item.subItems;
foreach (var subItem in subItems)
{
//do some actions for subItem
}
Console.WriteLine("Action Complete for {0}", item);
});
After a while, when there are only about 5-10 items left in the list to run, it seems that there is only 1 thread left running. This is not ideal, because some items will then be stuck behind another one to finish.
If I stop the script, and then start it again, with only the leftover 5-10 items in the list, it spins up multiple threads to do each of the items in parallel again.
How can I ensure that the other threads will keep being used, without me needing to restart the script?

The problem here is that the default partitioner is blocking the work per task up into blocks of N items. It assumes that the number of items is large and each item takes the same amount of time then you would expect that the several threads will run the last ~ N * 5 items and all finish at the same time.
However in your case this is not true. You could write your own Partitioner to use a smaller number of items per block, See Partitioner Class. This may improve performance but it the work done per item is very small then you will increase the ratio of useful work to work done managing the tasks and possibly degrade performance.
You could also write a dynamic partitioner that decreases the partition size so that the last few items are in smaller partitions, thus ensuring that you are still using all the available threads. This MSDN article covers writing custom partitioners, Custom Partitioners for PLINQ and TPL.

Related

How does Hazelcast Jet assign task-to-CPU priority?

If I have the following code and let's say I'm running on 10 nodes of 32 cores each:
IList<...> ds = ....; //large collection, eg 1e6 elements
ds
.map() //expensive computation
.flatMap()//generates 10,000x more elements for every 1 incoming element
.rebalance()
.map() //expensive computation
....//other transformations (ie can be a sink, keyby, flatmap, map etc)
What will Hazelcast do with respect to task-to-CPU assignment priority when the SECOND map operation wants to process 10,000 elements that was generated from the 1st original element? Will it devote the 320 CPU cores (from 10 nodes) to processing the 1st original element's 10,000 generated elements? If so, will it "boot off" already running tasks? Or, will it wait for already running tasks to complete, and then give priority to the 10,000 elements resulting from the output of the flatmap-rebalance operations? Or, would the 10,000 elements be forced to run on a single core, since the remaining 319 cores are already being consumed by the output of the ds operation (ie the input of the 1st map). Or, is there some random competition for who gets access to the CPU cores?
What I would ideally like to have happen is that a) Hazelcast does NOT boot off running tasks (it lets them complete), but when deciding which tasks gets priority to run on a core, it chooses the path that would lead to the lowest latency, ie it would process all 10,000 elements which result from the output of the flatmap-rebalance operation on all 320 cores.
Note: I asked a virtually identical question to Flink a few weeks ago, but have since switched to trying out Hazelcast: How does Flink (in streaming mode) assign task-to-CPU priority?
First, IList is a non-distributed data structure, all its data are stored on a single node. The IList source therefore produces all data on that node. So the 1st expensive map will be all done on that member, but map is backed, by default by as many workers as there are cores, so 32 workers in your case.
The rebalance stage will cause that the 2nd map is run on all members. Each of the 10,000 elements produced in the 1st map is handled separately, so if you have 1 element in your IList, the 10k elements produced from it will be processed concurrently by 320 workers.
The workers backing different stages of the pipeline compete for cores normally. There will be total 96 workers for the 1st map, 2nd map and for the flatMap together. Jet uses cooperative scheduling for these workers, which means it cannot preempt the computation if it's taking too long. This means that one item taking a long time to process will block other workers.
Also keep in mind that the map and flatMap functions must be cooperative, that means they must not block (by waiting on IO, sleeping, or by waiting for monitors). If they block, you'll see less than 100% CPU utilization. Check out the documentation for more information.

Using Timer for batch operations

I'm new to using Micrometer and am trying to see if there's a way to use a Timer that would also include a count of the number of items in a batch processing scenario. Since I'm processing the batch with Java streams, I didn't see an obvious way to record the timer for each item processed, so I was looking for a way to set a batch size attribute. One way I think that could work is to use the FunctionTimer from https://micrometer.io/docs/concepts#_function_tracking_timers, but I believe that requires the app to maintain a persistent monotonically increasing set of values for the total count and total time.
Is there a simpler way this can be done? Ultimately this data will be fed to New Relic. I've also tried setting tags for the batch size, but those seem to be reported as strings so I can't do any type of aggregation on the values.
Thanks!
A timer is intended for measuring an action and at a minimum results in two measurements: a count and a duration.
So a timer will work perfectly for your batch processing. In the the java stream, a peek operation might be a good place to put a timer.
If you were about to process 20 elements and you were just measuring the time for all 20 elements, you would need to create a new Counter for measuring the batch size. You could them divide the timer's total duration against your counter to get a per-item duration or divide it against the timer's total count to get a per-batch duration.
Feel free to add code snippets if you would like feedback for those.

LMAX Distruptor Partition and join batch

So currently I have a Executor implementation with blocking queue and the implementation specific is like, I have list of items per request and I divide them into partitions each partition is then computed and finally they are joined to have the final list.
How do I go about implementing it in LMAX? I see that once I have partition and push them into RingBuffer, each partition is treated as separate item so I am custom joining them.
something like,
ConcurrentHashMap<Long, LongAdder> map = new ConcurrentHashMap<>();
#Override
public List<SomeTask> score(final List<SomeTask> tasks) {
long id = tasks.get(0).id;
map.put(id, new LongAdder());
for (SomeTask task : tasks) {
producer.onData(task);
}
while (map.get(id).intValue() != tasks.size()) ;
map.remove(id);
return tasks;
}
Is there a clean way to do it ? I looked at https://github.com/LMAX-Exchange/disruptor/tree/master/src/test/java/com/lmax/disruptor/example and KeyedBatching specifically but they seem to batch and execute on one thread.
Currently for me each partition takes up around 200ms and I wanted to parallel execute them.
Any help is greatly appreciated.
I think you should take a look at the worker-pool options and followed by a final event processor which re-combines the shards.

Task server on ML

I have a query that may return up to 2000 documents.
Within these documents I need six pcdata items return as string values.
There is a possiblity, since the documents size range from small to very large,
exp tree cache error.
I am looking at spawn-function to break up my result set.
I will pass wildcard values, based on known "unique key structure", and will know the max number of results possible;each wildcard values will return 100 documents max.
Note: The pcdata for the unique key structure does have a range index on it.
Am I on the right track with below?
The task server will create three tasks.
The task server will allow multiple queries to run, but what stops them all running simultaneously and blowing out the exp tree cache?
i.e. What, if anything, forces one thread to wait for another? Or one task to wait for another so they all do not blow out the exp tree cache together?
xquery version "1.0-ml";
let $messages :=
(:each wildcard values will return 100 documents max:)
for $message in ("WILDCARDVAL1","WILDCARDVAL2", "WILDCARDVAL3")
let $_ := xdmp:log("Starting")
return
xdmp:spawn-function(function() {
let $_ := xdmp:sleep(5000)
let $_ := xdmp:log(concat("Searching on wildcard val=", $message))
return concat("100 pcdata items from the matched documents for ", $message) },
<options xmlns="xdmp:eval">
<result>true</result>
<transaction-mode>update-auto-commit</transaction-mode>
</options>)
return $messages
The Task Server configuration listed in the Admin UI defines the maximum number of simultaneous threads. If more tasks are spawned than there are threads, they are queued (FIFO I think, although ML9 has task priority options that modify that behavior), and the first queued task takes the next available thread.
The <result>true</result> option will force the spawning query to block until the tasks return. The tasks themselves are run independently and in parallel, and they don't wait on each other to finish. You may still run into problems with the expanded tree cache, but by splitting up the query into smaller ones, it could be less likely.
For a better understanding of why you are blowing out the cache, take a look at the functions xdmp:query-trace() and xdmp:query-meters(). Using the Task Server is more of a brute force solution, and you will probably get better results by optimizing your queries using information from those functions.
If you can't make your query more selective than 2000 documents, but you only need a few string values, consider creating range indexes on those values and using cts:values to select only those values directly from the index, filtered by the query. That method would avoid forcing the database to load documents into the cache.
It might be more efficient to use MarkLogic's capability to return co-occurrences, or even 3+ tuples of value combinations from within documents using functions like cts:values. You can blend in a (cts:uri-reference](http://docs.marklogic.com/cts:uri-reference) to get the document uri returned as part of the tuples.
It requires having range indexes on all those values though..
HTH!

Parallel.ForEach in C# when number of iterations is unknown

I have TPL (Task Parallel Library) code for executing a loop in parallel in C# in a class library project using .Net 4.0. I am new to TPL in C# and had following questions .
CODE Background:
In the code that appears just after the questions, I am getting all unprocessed batches and then processing each batch one at a time. Each batch can be processed independently since there are no dependencies between batches, but for each batch the sequence of steps is very important when processing it.
My questions are:
Will using Parallel.ForEach be advisable in this scenario where the number of batches and therefore the number of iterations could be very small or very large like 10,000 batches? I am afraid that with too many batches, using parallelism might cause more harm than good in this case.
When using Parallel.ForEach is the sequence of steps in ProcessBatch method guaranteed to execute in the same order as step1, step2, step3 and then step4?
public void ProcessBatches() {
List < Batch > batches = ABC.Data.GetUnprocessesBatches();
Parallel.ForEach(batches, batch = > {
ProcessBatch(batch);
});
}
public void ProcessBatch(Batch batch) {
//step 1
ABC.Data.UpdateHistory(batch);
//step2
ABC.Data.AssignNewRegions(batch);
//step3
UpdateStatus(batch);
//step4
RemoveBatchFromQueue(batch);
}
UPDATE 1:
From the accepted answer, the number of iterations is not an issue even when its large. In fact according to an article at this url: Potential Pitfalls in Data and Task Parallelism, performance improvements with parallelism will likely occur when there are many iterations and for fewer iterations parallel loop is not going to provide any benefits over a sequential/synchronous loop.
So it seems having a large number of iterations in the loop is the best situation for using Parallel.ForEach.
The basic rule of thumb is that parallel loops that have few iterations and fast user delegates are unlikely to speedup much.
Parallel foreach will us the appropriate number of threads for the hardware you are running on. So you don't need to worry about too many batches causing harm
The steps will run in order for each batch. ProcessBatch will get called on different threads for different batches but for each batch the steps will get executed in the order they are defined in that method

Resources