Parallel.ForEach in C# when number of iterations is unknown - c#-4.0

I have TPL (Task Parallel Library) code for executing a loop in parallel in C# in a class library project using .Net 4.0. I am new to TPL in C# and had following questions .
CODE Background:
In the code that appears just after the questions, I am getting all unprocessed batches and then processing each batch one at a time. Each batch can be processed independently since there are no dependencies between batches, but for each batch the sequence of steps is very important when processing it.
My questions are:
Will using Parallel.ForEach be advisable in this scenario where the number of batches and therefore the number of iterations could be very small or very large like 10,000 batches? I am afraid that with too many batches, using parallelism might cause more harm than good in this case.
When using Parallel.ForEach is the sequence of steps in ProcessBatch method guaranteed to execute in the same order as step1, step2, step3 and then step4?
public void ProcessBatches() {
List < Batch > batches = ABC.Data.GetUnprocessesBatches();
Parallel.ForEach(batches, batch = > {
ProcessBatch(batch);
});
}
public void ProcessBatch(Batch batch) {
//step 1
ABC.Data.UpdateHistory(batch);
//step2
ABC.Data.AssignNewRegions(batch);
//step3
UpdateStatus(batch);
//step4
RemoveBatchFromQueue(batch);
}
UPDATE 1:
From the accepted answer, the number of iterations is not an issue even when its large. In fact according to an article at this url: Potential Pitfalls in Data and Task Parallelism, performance improvements with parallelism will likely occur when there are many iterations and for fewer iterations parallel loop is not going to provide any benefits over a sequential/synchronous loop.
So it seems having a large number of iterations in the loop is the best situation for using Parallel.ForEach.
The basic rule of thumb is that parallel loops that have few iterations and fast user delegates are unlikely to speedup much.

Parallel foreach will us the appropriate number of threads for the hardware you are running on. So you don't need to worry about too many batches causing harm
The steps will run in order for each batch. ProcessBatch will get called on different threads for different batches but for each batch the steps will get executed in the order they are defined in that method

Related

How to reduce white space in the task stream?

I have obtained task stream using distributed computing in Dask for different number of workers. I can observe that as the number of workers increase (from 16 to 32 to 64), the white spaces in task stream also increases which reduces the efficiency of parallel computation. Even when I increase the work-load per worker (that is, more number of computation per worker), I obtain the similar trend. Can anyone suggest how to reduce the white spaces?
PS: I need to extend the computation to 1000s of workers, so reducing the number of workers is not an option for me.
Image for: No. of workers = 16
Image for: No. of workers = 32
Image for: No. of workers = 64
As you mention, white space in the task stream plot means that there is some inefficiency causing workers to not be active all the time.
This can be caused by many reasons. I'll list a few below:
Very short tasks (sub millisecond)
Algorithms that are not very parallelizable
Objects in the task graph that are expensive to serialize
...
Looking at your images I don't think that any of these apply to you.
Instead, I see that there are gaps of inactivity followed by gaps of activity. My guess is that this is caused by some code that you are running locally. My guess is that your code looks like the following:
for i in ...:
results = dask.compute(...) # do some dask work
next_inputs = ... # do some local work
So you're being blocked by doing some local work. This might be Dask's fault (maybe it takes a long time to build and serialize your graph) or maybe it's the fault of your code (maybe building the inputs for the next computation takes some time).
I recommend profiling your local computations to see what is going on. See https://docs.dask.org/en/latest/phases-of-computation.html

Controlling the Max Degree of Parallelism in the Fan-Out/Fan-In pattern in Durable Functions

Is there a way to control the maximum degree of parallelism when implementing the fan out/fan in pattern on Azure Durable Functions?
I'm currently implementing this pattern to perform a data loading process but I'm hitting database limits on DTU's because of the number of operations is too high for the database to handle.
The solution I'm thinking about involves using the following properties:
maxConcurrentActivityFunctions
maxConcurrentOrchestratorFunctions
of the (host.json) file in conjuntion with:
WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT.
The first 2 properties should limit the number of parallel function executions per host, and the WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT should limit the number of hosts.
Would this be a correct approach for limiting the max degree of parallelism?
The approach you describe is the best approach for limiting max parallelism, globally. Just be careful that WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT is not always reliable, at the current time of writing. Later in the year, the team plans to make this a fully reliable setting.
This may not apply to your case, but for the benefit of other readers who find this question I've added one more thing to consider. If you have a single orchestration, one simple C# technique you can use to limit max parallelism is to do something like this:
static async Task FanOutFanInAsync(
string functionName,
object[] workItems,
int maxParallelism)
{
var inFlight = new HashSet<Task>();
foreach (var item in workItems)
{
if (inFlight.Count > maxParallelism)
{
Task finished = await Task.WhenAny(inFlight);
inFlight.Remove(finished);
}
inFlight.Add(ctx.CallActivityAsync(functionName, item));
}
await Task.WhenAll(inFlight);
}
This will allow you to limit how many activity functions you fan out to at a single time for a single orchestration instance.

Using Java 8 parallelStream inside Spark mapParitions

I am trying to understand the behavior of Java 8 parallel stream inside spark parallelism. When I run the below code, I am expecting the output size of listOfThings to be the same as input size. But that's not the case, I sometimes have missing items in my output. This behavior is not consistent. If I just iterate through the iterator instead of using parallelStream, everything is fine. Count matches every time.
// listRDD.count = 10
JavaRDD test = listRDD.mapPartitions(iterator -> {
List listOfThings = IteratorUtils.toList(iterator);
return listOfThings.parallelStream.map(
//some stuff here
).collect(Collectors.toList());
});
// test.count = 9
// test.count = 10
// test.count = 8
// test.count = 7
it's a very good question.
Whats going on here is Race Condition. when you parallelize the stream then stream split the full list into several equal parts [Based on avaliable threads and size of list] then it tries to run subparts independently on each avaliable thread to perform the work.
But you are also using apache spark which is famous for computing the work faster i.e. general purpose computation engine. Spark uses the same approach [parallelize the work] to perform the action.
Now Here in this Scenerio what is happening is Spark already parallelized the whole work then inside this you are again parallelizing the work due to this the race condition starts i.e. spark executor starts processing the work and then you parallelized the work then stream process aquires other thread and start processing IF THE THREAD THAT WAS PROCESSING STREAM WORK FINISHES WORK BEFORE THE SPARK EXECUTOR COMPLETE HIS WORK THEN IT ADD THE RESULT OTHERWISE SPARK EXECUTOR CONTINUES TO REPORT RESULT TO MASTER.
This is not a good approach to re-parallelize the work it will always gives you the pain let the spark do it for you.
Hope you understand whats going on here
Thanks

Conditional iteration in a single job in Apache Spark

I am working on an iterative algorithm using Apache Spark, which claims to be perfect for just that. The examples I have found so far creates a single job with a hardcoded number of iterations. I need the algorithm to run until a certain condition is met.
My current implementation launches a new job for each iteration something like this:
var data = sc.textFile(...).map().cache()
while(data.filter(...).isEmpty()) {
// Run the Algorithm (also handles caching)
val data = performStep(data)
}
This is pretty inefficient. Between each iteration I wait a long time for the next job to start. For four servers I wait around 10 seconds in between each job, for 32 servers is almost 100 seconds. In total I end up spending at least half of the runtime waiting in between jobs.
I find conditional iterations quite common in certain types of algorithms, for example early stopping criteria in machine learning. So I am hoping this can be improved.
Is there a more efficient way of doing this? For example away to run this conditional repetition in a single job? Thanks!

Number of threads decreases as Parallel.Foreach loop goes on

I have a Parallel Foreach loop which loops through a list of items, and performs some actions against them. Some of these actions take longer than others, depending on the item.
Parallel.ForEach(list, new ParallelOptions { MaxDegreeOfParallelism = 5 }, item =>
{
var subItems = item.subItems;
foreach (var subItem in subItems)
{
//do some actions for subItem
}
Console.WriteLine("Action Complete for {0}", item);
});
After a while, when there are only about 5-10 items left in the list to run, it seems that there is only 1 thread left running. This is not ideal, because some items will then be stuck behind another one to finish.
If I stop the script, and then start it again, with only the leftover 5-10 items in the list, it spins up multiple threads to do each of the items in parallel again.
How can I ensure that the other threads will keep being used, without me needing to restart the script?
The problem here is that the default partitioner is blocking the work per task up into blocks of N items. It assumes that the number of items is large and each item takes the same amount of time then you would expect that the several threads will run the last ~ N * 5 items and all finish at the same time.
However in your case this is not true. You could write your own Partitioner to use a smaller number of items per block, See Partitioner Class. This may improve performance but it the work done per item is very small then you will increase the ratio of useful work to work done managing the tasks and possibly degrade performance.
You could also write a dynamic partitioner that decreases the partition size so that the last few items are in smaller partitions, thus ensuring that you are still using all the available threads. This MSDN article covers writing custom partitioners, Custom Partitioners for PLINQ and TPL.

Resources