Application design for parallel collection processing - multithreading

I'm experimenting with the System.Collections.Concurrent namespace but I have a problem implementing my design.
My input queue (ConcurrentQueue) is getting populated fine from a Thread which is doing some I/O at startup to read and parse.
Next I kick off a Parallel.ForEach() on the input queue. I'm doing some I/O bound work on each item.
A log item is created for each item processed in the ForEach() and is dropped into a result queue.
What I would like to do is kick off the logging I start reading the input because I may not be able to fit all of the log items in memory. What is the best way to wait for items to land in the result queue? Are there design patterns or examples that I should be looking at?

I think the pattern you're looking for is the producer/consumer pattern. More specifically, you can have a producer/consumer implementation built around TPL and BlockingCollection.
The main concepts you want to read about are:
Task,
BlockingCollection,
TaskFactory.ContinueWhenAll(will allow you to perform some action when a set of tasks/threads is finished running).
Bounding and Blocking in BlockingCollection. This allows you to set a maximum size for your output collection (for memory reasons) and producer thread(s) will wait for consumers to pick up elements in case the maximum size you specify is reached.
BlockingCollection.CompleteAdding and BlockingCollection.IsCompleted which can be used to synchronize producers and consumers (producer can say when it's finished, consumer can check for that and keep running until the producer(s) are finised).
A more complete sample is in the second article I linked.
In your case I think you want the consumer to just pick up things from the result queue and dispose of them as soon as possible (write them to a logging store, or similar).
So your final collection, where you dump log items should be a BlockingCollection, not a ConcurrentQueue.

Related

Is my design for sending data to clients at various intervals correct?

The code should be written in C++. I'm mentioning this just in case someone will suggest a solution that won't work efficient when implementing in C++.
Objective:
Producer that runs on thread t1 inserts images to Consumer that runs on thread t2. The consumer has a list of clients that he should send the images to at various intervals. E.g. client1 requires images every 1sec, client2 requires images every 5sec and etc.
Suggested implementation:
There is one main queue imagesQ in Consumer to which Producer enqueues images to. In addition to the main queue, the Consumer manages a list of vector of queues clientImageQs of size as number of clients. The Consumer creates a sub-consumer, which runs on its own thread, for each client. Each such sub-consumer dequeues the images from a relevant queue from clientImageQs and sends images to its client at its interval.
Every time a new image arrives to imagesQ, the Consumer duplicates it and enqueus to each queue in clientImageQs. Thus, each sub-consumer will be able to send the images to its client at its own frequency.
Potential problem and solution:
If Producer enqueues images at much higher rate than one of the sub-consumers dequeues, the queue will explode. But, the Consumer can check the size of the queue in clientImageQs before enqueuing. And, if needed, Consumer will dequeue a few old images before enqueuing new ones.
Question
Is this a good design or there is a better one?
You describe the problem within a set of already determined solution limitations. Your description is complex, confusing, and I dare say, confused.
Why have a consumer that only distributes images out of a shared buffer? Why not allow each "client" as you call it read from the buffer as it needs to?
Why not implement the shared buffer as a single-image buffer. The producer writes at its rate. The clients perform non-destructive reads of the buffer at their own rate. Each client is ensured to read the most recent image in the buffer whenever the client reads the buffer. The producer simply over-writes the buffer with each write.
A multi-element queue offers no benefit in this application. In fact, as you have described, it greatly complicates the solution.
See http://sworthodoxy.blogspot.com/2015/05/shared-resource-design-patterns.html Look for the heading "unconditional buffer".
The examples in the posting listed above are all implemented using Ada, but the concepts related to concurrent design patterns are applicable to all programming languages supporting concurrency.

Do DirectShow filters ever return S_FALSE from ReceiveCanBlock?

All of the standard Microsoft system filters appear to return S_OK from IMemInputPin::ReceiveCanBlock to say that they can block. Even the system Null renderer filter returns S_OK to signify that it can block - of all filters surely this is the least likely to block since it just discards samples?
Do any filters not block on their input pins? What does "blocking" in this context really mean?
That a filter might hold onto a sample indefinitely until the sample's presentation time (which might be a long time if playback is paused)?
That a filter might take a non-negligible amount of time to process a sample before returning to Receive?
That a filter will process the sample on the same thread as it was received?
Even if a filter process input samples on a worker thread, most will use some sort of queueing mechanism with a finite capacity which could end up blocked if the downstream filter blocks.
The default behaviour in the baseclasses seems to be that a filter blocks unless it has at least one connected output pin and all of the connected output pins are connnected to non-blocking IMemInputPin pins. If there are no non-blocking renderers then how can any other filter be non-blocking?
The whole idea is documented as ability to report upstream that next sample will be accepted without blocking. Which in turn takes place when filter queues samples and fetches them asynchronously. BaseClasses use this in COutputQueue class to decide whether queue should just skip queuing and deliver directly (if no blocking behavior is guaranteed). A filter would not start a worker thread and save certain resources.
I suppose filter don't use this, with possibly rare exceptions. A filter that does not block, which I can think of, is not a renderer (even though Dump/Null filters do not block), it is rather a transformation filter which - for its own reasons - processes using worker threads, e.g. it queues data on the input because processing takes place by thread pool, and a thread would pick the same from the queue once its in idle state and ready for next piece of data.

How to have many consumer threads using BlockingCollection

I am using a producer / consumer pattern backed with a BlockingCollection to read data off a file, parse/convert and then insert into a database. The code I have is very similar to what can be found here: http://dhruba.name/2012/10/09/concurrent-producer-consumer-pattern-using-csharp-4-0-blockingcollection-tasks/
However, the main difference is that my consumer threads not only parse the data but also insert into a database. This bit is slow, and I think is causing the threads to block.
In the example, there are two consumer threads. I am wondering if there is a way to have the number of threads increase in a somewhat intelligent way? I had thought a threadpool would do this, but can't seem to grasp how that would be done.
Alternatively, how would you go about choosing the number of consumer threads? 2 does not seem correct for me, but I'm not sure what the best # would be. Thoughts on the best way to choose # of consumer threads?
The best way to choose the number of consumer threads is math: figure out how many packets per minute are coming in from the producers, divide that by how many packets per minute a single consumer can handle, and you have a pretty good idea of how many consumers you need.
I solved the blocking output problem (consumers blocking when trying to update the database) by adding another BlockingCollection that the consumers put their completed packets in. A separate thread reads that queue and updates the database. So it looks something like:
input thread(s) => input queue => consumer(s) => output queue => output thread
This has the added benefit of divorcing the consumers from the output, meaning that you can optimize the output or completely change the output method without affecting the consumer. That might allow you, for example, to batch the database updates so that rather than making one database call per record, you could update a dozen or a hundred (or more) records with a single call.
I show a very simple example of this (using a single consumer) in my article Simple Multithreading, Part 2. That works with a text file filter, but the concepts are the same.

How does one determine if all messages in an Azure Queue have been processed?

I've just begun tinkering with Windows Azure and would appreciate help with a question.
How does one determine if a Windows Azure Queue is empty and that all work-items in it have been processed? If I have multiple worker processes querying a work-item queue, GetMessage(s) returns no messages if the queue is empty. But there is no guarantee that a currently invisible message will not be pushed back into the queue.
I need this functionality since follow-up behavior of my workflow depends on completion of all work-items in that particular queue. A possible way of tackling this problem would be to count the number of puts and deletes. But this will again require synchronization at a shared storage level and I would like to avoid it if possible.
Any ideas?
Take a look at the ApproximateMessageCount method. This should return the number of messages on the queue, including invisible messages (e.g. the ones being processed).
Mike Wood blogged about this subtlety, along with a tidbit about the queue's Clear method, here.
That said: you might want to choose a different mechanism for workflow management. Maybe a table row, where you have your rowkey equal to some multi-queue-item transation id, and individual properties being status flags. This allows you to track failed parts of the transaction (say, 9 out of 10 queue items process ok, the 10th fails; you can still delete the 10th queue item, but set its status flag to failed, then letting you deal with this scenario accordingly). Also: let's say you use the same queue to process another 'transaction' (meaning the queue is again non-zero in length). By using a separate object like a Table Row, you can still determine that your 'transaction' is complete even though there are additional queue messages.
The best way is to have another queue, call it termination indicator queue, and put a message in that queue for every message your process from your main queue. That is how it is done in research projects too. Check this out http://www.cs.gsu.edu/dimos/content/gis-vector-data-overlay-processing-azure-platform.html

How to approach parallel processing of messages?

I am redesigning the messaging system for my app to use intel threading building blocks and am stumped trying to decide between two possible approaches.
Basically, I have a sequence of message objects and for each message type, a sequence of handlers. For each message object, I apply each handler registered for that message objects type.
The sequential version would be something like this (pseudocode):
for each message in message_sequence <- SEQUENTIAL
for each handler in (handler_table for message.type)
apply handler to message <- SEQUENTIAL
The first approach which I am considering processes the message objects in turn (sequentially) and applies the handlers concurrently.
Pros:
predictable ordering of messages (ie, we are guaranteed a FIFO processing order)
(potentially) lower latency of processing each message
Cons:
more processing resources available than handlers for a single message type (bad parallelization)
bad use of processor cache since message objects need to be copied for each handler to use
large overhead for small handlers
The pseudocode of this approach would be as follows:
for each message in message_sequence <- SEQUENTIAL
parallel_for each handler in (handler_table for message.type)
apply handler to message <- PARALLEL
The second approach is to process the messages in parallel and apply the handlers to each message sequentially.
Pros:
better use of processor cache (keeps the message object local to all handlers which will use it)
small handlers don't impose as much overhead (as long as there are other handlers also to be run)
more messages are expected than there are handlers, so the potential for parallelism is greater
Cons:
Unpredictable ordering - if message A is sent before message B, they may both be processed at the same time, or B may finish processing before all of A's handlers are finished (order is non-deterministic)
The pseudocode is as follows:
parallel_for each message in message_sequence <- PARALLEL
for each handler in (handler_table for message.type)
apply handler to message <- SEQUENTIAL
The second approach has more advantages than the first, but non-deterministic ordering is a big disadvantage..
Which approach would you choose and why? Are there any other approaches I should consider (besides the obvious third approach: parallel messages and parallel handlers, which has the disadvantages of both and no real redeeming factors as far as I can tell)?
Thanks!
EDIT:
I think what I'll do is use #2 by default, but allow a "conversation tag" to be attached to each message. Any messages with the same tag are ordered and handled sequentially in relation to its conversation. Handlers are passed the conversation tag alongside the message, so they may continue the conversation if they require. Something like this:
Conversation c = new_conversation()
send_message(a, c)
...
send_message(b, c)
...
send_message(x)
handler foo (msg, conv)
send_message(z, c)
...
register_handler(foo, a.type)
a is handled before b, which is handled before z. x can be handled in parallel to a, b and z. Once all messages in a conversation have been handled, the conversation is destroyed.
I'd say do something even different. Don't send work to the threads. Have the threads pull work when they finish previous work.
Maintain a fixed amount of worker threads (the optimal amount equal to the number of CPU cores in the system) and have each of them pull sequentially the next task to do from the global queue after it finishes with the previous one. Obviously, you would need to keep track of dependencies between messages to defer handling of a message until its dependencies are fully handled.
This could be done with very small synchronization overhead - possibly only with atomic operations, no heavy primitives like mutexes or semaphores.
Also, if you pass a message to each handler by reference, instead of making a copy, having the same message handled simultaneously by different handlers on different CPU cores can actually improve cache performance, as higher levels of cache (usually from L2 upwards) are often shared between CPU cores - so when one handler reads a message into the cache, the other handler on the second core will have this message already in L2. So think carefully - do you really need to copy the messages?
If possible I would go for number two with some tweaks. Do you really need every message tp be in order? I find that to be an unusual case. Some messages we just need to handle as soon as possible, and then some messages need be processed before another message but not before every message.
If there are some messages that have to be in order, then mark them someway. You can mark them with some conversation code that lets the processor know that it must be processed in order relative to the other messages in that conversation. Then you can process all conversation-less messages and one message from each conversation concurrently.
Give your design a good look and make sure that only messages that need to be in order are.
I Suppose it comes down to wether or not the order is important. If the order is unimportant you can go for method 2. If the order is important you go for method 1. Depending on what your application is supposed to do, you can still go for method 2, but use a sequence number so all the messages are processed in the correct order (unless of cause if it is the processing part you are trying to optimize).
The first method also has unpredictable ordering. The processing of message 1 on thread 1 could take very long, making it possible that message 2, 3 and 4 have long been processed
This would tip the balance to method 2
Edit:
I see what you mean.
However why in method 2 would you do the handlers sequentially. In method 1 the ordering doesn't matter and you're fine with that.
E.g. Method 3: both handle the messages and the handlers in parallel.
Of course, here also, the ordering is unguaranteed.
Given that there is some result of the handlers, you might just store the results in an ordered list, this way restoring ordering eventually.

Resources