I am using a producer / consumer pattern backed with a BlockingCollection to read data off a file, parse/convert and then insert into a database. The code I have is very similar to what can be found here: http://dhruba.name/2012/10/09/concurrent-producer-consumer-pattern-using-csharp-4-0-blockingcollection-tasks/
However, the main difference is that my consumer threads not only parse the data but also insert into a database. This bit is slow, and I think is causing the threads to block.
In the example, there are two consumer threads. I am wondering if there is a way to have the number of threads increase in a somewhat intelligent way? I had thought a threadpool would do this, but can't seem to grasp how that would be done.
Alternatively, how would you go about choosing the number of consumer threads? 2 does not seem correct for me, but I'm not sure what the best # would be. Thoughts on the best way to choose # of consumer threads?
The best way to choose the number of consumer threads is math: figure out how many packets per minute are coming in from the producers, divide that by how many packets per minute a single consumer can handle, and you have a pretty good idea of how many consumers you need.
I solved the blocking output problem (consumers blocking when trying to update the database) by adding another BlockingCollection that the consumers put their completed packets in. A separate thread reads that queue and updates the database. So it looks something like:
input thread(s) => input queue => consumer(s) => output queue => output thread
This has the added benefit of divorcing the consumers from the output, meaning that you can optimize the output or completely change the output method without affecting the consumer. That might allow you, for example, to batch the database updates so that rather than making one database call per record, you could update a dozen or a hundred (or more) records with a single call.
I show a very simple example of this (using a single consumer) in my article Simple Multithreading, Part 2. That works with a text file filter, but the concepts are the same.
Related
I'm looking for the best way to preform ETL using Python.
I'm having a channel in RabbitMQ which send events (can be even every second).
I want to process every 1000 of them.
The main problem is that RabbitMQ interface (I'm using pika) raise callback upon every message.
I looked at Celery framework, however the batch feature was depreciated in version 3.
What is the best way to do it? I thinking about saving my events in a list, and when it reaches 1000 to copy it to other list and preform my processing. However, how do I make it thread-safe? I don't want to lose events, and I'm afraid of losing events while synchronising the list.
It sounds like a very simple use-case, however I didn't find any good best practice for it.
How do I make it thread-safe?
How about set consumer prefetch-count=1000. If a consumer's unack messages reach its prefetch limit, rabbitmq will not deliver any message to it.
Don't ACK received message, until you have 1000 messages, then copy it to other list and preform your processing. When your job done, ACK the last message, and all message before this message will be ACK by rabbitmq server.
But I am not sure whether large prefetch is the best practice.
First of all, you should not "batch" messages from RabbitMQ unless you really have to. The most efficient way to work with messaging is to process each message independently.
If you need to combine messages in a batch, I would use a separate data store to temporarily store the messages, and then process them when they reach a certain condition. Each time you add an item to the batch, you check that condition (for example, you reached 1000 messages) and trigger the processing of the batch.
This is better than keeping a list in memory, because if your service dies, the messages will still be persisted in the database.
Note : If you have a single processor per queue, this can work without any synchronization mechanism. If you have multiple processors, you will need to implement some sort of locking mechanism.
I am trying to work out how to process bulk records into elastic search using the bulk function and need to use threads to get some performance out of it. But I am stuck trying to work out how to limit the threads to 5 concurrent so its not to heavy on elastic.
I was thinking of just looping the db and filling a list, then when it hits eg (50), push to a thread for processing and continue. But this method will spawn to many threads and I cannot see an obvious way to limit the treads without waiting for all of them to finish, before adding another thread.
I have done this in golang before, where you can just add threads and when it hits the limit it will just wait before adding more to the queue, but seeming a little more elusive in python so far.
I am open to alternatives but this seems like the cleanest way to go so far, but there might be better methods like db -> queue with limit, then just threads to consume from the queue.. ?
look forward to some responses.
The code should be written in C++. I'm mentioning this just in case someone will suggest a solution that won't work efficient when implementing in C++.
Objective:
Producer that runs on thread t1 inserts images to Consumer that runs on thread t2. The consumer has a list of clients that he should send the images to at various intervals. E.g. client1 requires images every 1sec, client2 requires images every 5sec and etc.
Suggested implementation:
There is one main queue imagesQ in Consumer to which Producer enqueues images to. In addition to the main queue, the Consumer manages a list of vector of queues clientImageQs of size as number of clients. The Consumer creates a sub-consumer, which runs on its own thread, for each client. Each such sub-consumer dequeues the images from a relevant queue from clientImageQs and sends images to its client at its interval.
Every time a new image arrives to imagesQ, the Consumer duplicates it and enqueus to each queue in clientImageQs. Thus, each sub-consumer will be able to send the images to its client at its own frequency.
Potential problem and solution:
If Producer enqueues images at much higher rate than one of the sub-consumers dequeues, the queue will explode. But, the Consumer can check the size of the queue in clientImageQs before enqueuing. And, if needed, Consumer will dequeue a few old images before enqueuing new ones.
Question
Is this a good design or there is a better one?
You describe the problem within a set of already determined solution limitations. Your description is complex, confusing, and I dare say, confused.
Why have a consumer that only distributes images out of a shared buffer? Why not allow each "client" as you call it read from the buffer as it needs to?
Why not implement the shared buffer as a single-image buffer. The producer writes at its rate. The clients perform non-destructive reads of the buffer at their own rate. Each client is ensured to read the most recent image in the buffer whenever the client reads the buffer. The producer simply over-writes the buffer with each write.
A multi-element queue offers no benefit in this application. In fact, as you have described, it greatly complicates the solution.
See http://sworthodoxy.blogspot.com/2015/05/shared-resource-design-patterns.html Look for the heading "unconditional buffer".
The examples in the posting listed above are all implemented using Ada, but the concepts related to concurrent design patterns are applicable to all programming languages supporting concurrency.
I have an application A which calls another application B which does some calculation and writes to a file File.txt
A invokes multiple instances of B through multiple threads and each instances tries to write to same file File.txt
Here comes the actual problem :
Since multiple threads tries to access the same file , the file access throws out which is common.
I tried an approach of using a concurrent queue in a singleton class and each instances of B adds to the queue And another thread in this class takes care of dequeing the items from queue and writes to the file File.txt. The queue is fetched synchronously and write operation succeeded . This works fine .
If I have too many threads and too many items in queue the file writing works but if for some reason my queue crashes or stops abruptly all the information which is supposed to be written to file is lost .
If I make the file writing synchronous from the B without using the queue then it will be slow as it needs to check for file locking but here there are less chances of data being missed as after B immediately writes to file.
What could be there best approach or design to handle this scenario? I don't need the response after file writing is completed . I can't make B wait for the file writing to be completed.
Would async await file writing could be of any use here ?
I think what you've done is the best that can be done. You may have to tune your producer/consumer queue solution if there are still problems, but it seems to me that you've done rather well with this approach.
If an in-memory queue isn't the answer, perhaps externalizing that to a message queue and a pool of listeners would be an improvement.
Relational databases and transaction managers are born to solve this problem. Why continue with a file based solution? Is it possible to explore an alternative?
is there a better approach or design to handle this scenario?
You can make each producer thread write to it's own rolling file instead of queuing the operation. Every X seconds the producers move to new files and some aggregation thread wakes up, read the previous files (of each producer) and writes the results to the final File.txt output file. No read / write locks are required here.
This ensures safe recovery since the rolling files exist until you process and delete them.
This also mean that you always write to disk, which is much slower than queuing tasks in memory and write to disk in bulks. But that's the price you pay for consistency.
Would async await file writing could be of any use here ?
Using asynchronous IO has nothing to do with this. The problems you mentioned were 1) shared resources (the output file) and 2) lack of consistency (when the queue crash), none of which async programming is about.
Why the async is in picture is because I dont want to delay the existing work by B because of this file writing operation
async would indeed help you with that. Whatever pattern you choose to implement (to solve the original problem) it can always be async by merely using the asynchronous IO api's.
I'm experimenting with the System.Collections.Concurrent namespace but I have a problem implementing my design.
My input queue (ConcurrentQueue) is getting populated fine from a Thread which is doing some I/O at startup to read and parse.
Next I kick off a Parallel.ForEach() on the input queue. I'm doing some I/O bound work on each item.
A log item is created for each item processed in the ForEach() and is dropped into a result queue.
What I would like to do is kick off the logging I start reading the input because I may not be able to fit all of the log items in memory. What is the best way to wait for items to land in the result queue? Are there design patterns or examples that I should be looking at?
I think the pattern you're looking for is the producer/consumer pattern. More specifically, you can have a producer/consumer implementation built around TPL and BlockingCollection.
The main concepts you want to read about are:
Task,
BlockingCollection,
TaskFactory.ContinueWhenAll(will allow you to perform some action when a set of tasks/threads is finished running).
Bounding and Blocking in BlockingCollection. This allows you to set a maximum size for your output collection (for memory reasons) and producer thread(s) will wait for consumers to pick up elements in case the maximum size you specify is reached.
BlockingCollection.CompleteAdding and BlockingCollection.IsCompleted which can be used to synchronize producers and consumers (producer can say when it's finished, consumer can check for that and keep running until the producer(s) are finised).
A more complete sample is in the second article I linked.
In your case I think you want the consumer to just pick up things from the result queue and dispose of them as soon as possible (write them to a logging store, or similar).
So your final collection, where you dump log items should be a BlockingCollection, not a ConcurrentQueue.