Is Parallel.ForEach in ConcurrentBag<T> thread safe - multithreading

Description of ConcurrentBag on MSDN is not clear:
Bags are useful for storing objects when ordering doesn't matter, and unlike sets, bags support duplicates. ConcurrentBag is a thread-safe bag implementation, optimized for scenarios where the same thread will be both producing and consuming data stored in the bag.
My question is it thread safe and if this is a good practice to use ConcurrentBag in Parallel.ForEach.
For Instance:
private List<XYZ> MyMethod(List<MyData> myData)
{
var data = new ConcurrentBag<XYZ>();
Parallel.ForEach(myData, item =>
{
// Some data manipulation
data.Add(new XYZ(/* constructor parameters */);
});
return data.ToList();
}
This way I don't have to use synchronization locking in Parallel.ForEach using regular List.
Thanks a lot.

That looks fine to me. The way you're using it is thread-safe.
If you could return an IEnumerable<XYZ>, it could be made more efficient by not copying to a List<T> when you're done.

ConcurrentBag and Parallel.ForEach seems to me, no problem. If you uses this types in scenarios that has a large volume multi-user access, these classes in your implementation could rise up cpu process to levels that can crash your web-server. Furthermore, this implementation starts N tasks (threads) to execute each one of iterations, so be careful when choice this classes ans implementations. I recently spent in this situation and I had to extract memory dump to analyze whats happens into my web application core. So, be careful 'cause Concurrentbag is ThreadSafe and in web scenarios it is no better way.

Related

How do I Yield() to another thread in a Win8 C++/Xaml app?

Note: I'm using C++, not C#.
I have a bit of code that does some computation, and several bits of code that use the result. The bits that use the result are already in tasks, but the original computation is not -- it's actually in the callstack of the main thread's App::App() initialization.
Back in the olden days, I'd use:
while (!computationIsFinished())
std::this_thread::yield(); // or the like, depending on API
Yet this doesn't seem to exist for Windows Store apps (aka WinRT, pka Metro-style). I can't use a continuation because the bits that use the results are unconnected to where the original computation takes place -- in addition to that computation not being a task anyway.
Searching found Concurrency::Context::Yield(), but Context appears not to exist for Windows Store apps.
So... say I'm in a task on the background thread. How do I yield? Especially, how do I yield in a while loop?
First of all, doing expensive computations in a constructor is not usually a good idea. Even less so when it's the "App" class. Also, doing heavy work in the main (ASTA) thread is pretty much forbidden in the WinRT model.
You can use concurrency::task_completion_event<T> to interface code that isn't task-oriented with other pieces of dependent work.
E.g. in the long serial piece of code:
...
task_completion_event<ComputationResult> tce;
task<ComputationResult> computationTask(tce);
// This task is now tied to the completion event.
// Pass it along to interested parties.
try
{
auto result = DoExpensiveComputations();
// Successfully complete the task.
tce.set(result);
}
catch(...)
{
// On failure, propagate the exception to continuations.
tce.set_exception(std::current_exception());
}
...
Should work well, but again, I recommend breaking out the computation into a task of its own, and would probably start by not doing it during construction... surely an anti-pattern for a responsive UI. :)
Qt simply uses Sleep(0) in their WinRT yield implementation.

Effects of swapping buffers on concurrent access

Consider an application with two threads, Producer and Consumer.
Both threads are running approximately equally frequent, multiple times in a second.
Both threads access the same memory region, where Producer writes to the memory, and Consumer reads the current chunk of data and does something with it, without invalidating the data.
A classical approach is this one:
int[] sharedData;
//Called frequently by thread Producer
void WriteValues(int[] data)
{
lock(sharedData)
{
Array.Copy(data, sharedData, LENGTH);
}
}
//Called frequently by thread Consumer
void WriteValues()
{
int[] data;
lock(sharedData)
{
Array.Copy(sharedData, data, LENGTH);
}
DoSomething(data);
}
If we assume that the Array.Copy takes time, this code would run slow, since Producer always has to wait for Consumer during copying and vice versa.
An approach to this problem would be to create two buffers, one which is accessed by the Consumer, and one which is written to by the Producer, and swap the buffers, as soon as writing has finished.
int[] frontBuffer;
int[] backBuffer;
//Called frequently by thread Producer
void WriteValues(int[] data)
{
lock(backBuffer)
{
Array.Copy(data, backBuffer, LENGTH);
int[] temp = frontBuffer;
frontBuffer = backBuffer;
backBuffer = temp;
}
}
//Called frequently by thread Consumer
void WriteValues()
{
int[] data;
int[] currentFrontBuffer = frontBuffer;
lock(currentForntBuffer)
{
Array.Copy(currentFrontBuffer , data, LENGTH);
}
DoSomething(currentForntBuffer );
}
Now, my questions:
Is locking, as shown in the 2nd example, safe? Or does the change of references introduce problems?
Will the code in the 2nd example execute faster than the code in the 1st example?
Are there any better methods to efficiently solve the problem described above?
Could there be a way to solve this problem without locks? (Even if I think it is impossible)
Note: this is no classical producer/consumer problem: It is possible for Consumer to read the values multiple times before Producer writes it again - the old data stays valid until Producer writes new data.
Is locking, as shown in the 2nd example, safe? Or does the change of references introduce problems?
As far as I can tell, because reference assignment is atomic, this may be safe but not ideal. Because the WriteValues() method reads from frontBuffer without a lock or memory barrier forcing a cache refresh, there no guarantee that the variable will ever be updated with new values from main memory. There is then a potential to continuously read the stale, cached values of that instance from the local register or CPU cache. I'm unsure of whether the compiler/JIT might infer a cache refresh anyway based on the local variable, maybe somebody with more specific knowledge can speak to this area.
Even if the values aren't stale, you may also run into more contention than you would like. For example...
Thread A calls WriteValues()
Thread A takes a lock on the instance in frontBuffer and starts copying.
Thread B calls WriteValues(int[])
Thread B writes its data, moves the currently locked frontBuffer instance into backBuffer.
Thread B calls WriteValues(int[])
Thread B waits on the lock for backBuffer because Thread A still has it.
Will the code in the 2nd example execute faster than the code in the 1st example?
I suggest that you profile it and find out. X being faster than Y only matters if Y is too slow for your particular needs, and you are the only one who knows what those are.
Are there any better methods to efficiently solve the problem described above?
Yes. If you are using .Net 4 and above, there is a BlockingCollection type in System.Collections.Concurrent that models the Producer/Consumer pattern well. If you consistently read more than you write, or have multiple readers to very few writers, you may also want to consider the ReaderWriterLockSlim class. As a general rule of thumb, you should do as little within a lock as you can, which will also help to alleviate your time issue.
Could there be a way to solve this problem without locks? (Even if I think it is impossible)
You might be able to, but I wouldn't suggest trying that unless you are extremely familiar with multi-threading, cache coherency, and potential compiler/JIT optimizations. Locking will most likely be fine for your situation and it will be much easier for you (and others reading your code) to reason about and maintain.

Is there such a thing as a lockless queue for multiple read or write threads?

I was thinking, is it possible to have a lockless queue when more than one thread is reading or writing? I've seen an implementation with a lockless queue that worked with one read and one write thread but never more than one for either. Is it possible? I don't think it is. Can/does anyone want to prove it?
There are multiple algorithms available, I ended up implementing the An Optimistic Approach to Lock-Free FIFO Queues, which avoids the ABA problem via pointer-tagging (needs the CMPXCHG8B instruction on x86), and it runs fine in a production app (written in Delphi). (Another version, with Java code)
Nevertheless, to be really-really lockless, you would also need a lock-free memory allocator - see Scalable Lock-Free Dynamic Memory Allocation (implemented in Concurrent Building Block) or NBMalloc (but so far, I didn't get to use one of these).
You may also want to look at answers for optimistic lock-free FIFO queues impl?
Java's implementation of a Lockless Queue allows both reads and writes. This work is done with a compare and set operation (which is a single CPU instruction).
The ConcurrentLinkedQueue uses a method in which threads help each other read (or poll) objects from the queue. Since it is linked, the head of the queue can accept writes while the tail of the queue can accept reads (assuming enough room). All of this can be done in parallel and is completely thread safe.
With .NET 4.0, there is ConcurrentQueue(T) Class.
According to C# 4.0 in a nutshell, this is a lock free implementation. See also this blog entry.
You don't specifically need a lock, but an atomic way of deleting things from the queue. This is also possible without a lock and with an atomic test-and-set instruction.
There is a dynamic lock free queue in the OmniThreadLibrary by Primoz Gabrijelcic (the Delphi Geek): http://www.thedelphigeek.com/2010/02/omnithreadlibrary-105.html
With .NET 4.0, there is ConcurrentQueue Class.
Sample
https://dotnetfiddle.net/ehLZCm
public static void Main()
{
PopulateQueueParallel(new ConcurrentQueue<string>(), 500);
}
static void PopulateQueueParallel(ConcurrentQueue<string> queue, int queueSize)
{
Parallel.For(0, queueSize, (i) => queue.Enqueue(string.Format("my message {0}", i)));
Parallel.For(0, queueSize,
(i) =>
{
string message;
bool success = queue.TryDequeue(out message);
if (!success)
throw new Exception("Error!");
Console.WriteLine(message);
});
}

Is this a safe version of double-checked locking?

Slightly modified version of canonical broken double-checked locking from Wikipedia:
class Foo {
private Helper helper = null;
public Helper getHelper() {
if (helper == null) {
synchronized(this) {
if (helper == null) {
// Create new Helper instance and store reference on
// stack so other threads can't see it.
Helper myHelper = new Helper();
// Atomically publish this instance.
atomicSet(helper, myHelper);
}
}
}
return helper;
}
}
Does simply making the publishing of the newly created Helper instance atomic make this double checked locking idiom safe, assuming that the underlying atomic ops library works properly? I realize that in Java, one could just use volatile, but even though the example is in pseudo-Java, this is supposed to be a language-agnostic question.
See also:
Double checked locking Article
It entirely depends on the exact memory model of your platform/language.
My rule of thumb: just don't do it. Lock-free (or reduced lock, in this case) programming is hard and shouldn't be attempted unless you're a threading ninja. You should only even contemplate it when you've got profiling proof that you really need it, and in that case you get the absolute best and most recent book on threading for that particular platform and see if it can help you.
I don't think you can answer the question in a language-agnostic fashion without getting away from code completely. It all depends on how synchronized and atomicSet work in your pseudocode.
The answer is language dependent - it comes down to the guarantees provided by atomicSet().
If the construction of myHelper can be spread out after the atomicSet() then it doesn't matter how the variable is assigned to the shared state.
i.e.
// Create new Helper instance and store reference on
// stack so other threads can't see it.
Helper myHelper = new Helper(); // ALLOCATE MEMORY HERE BUT DON'T INITIALISE
// Atomically publish this instance.
atomicSet(helper, myHelper); // ATOMICALLY POINT UNINITIALISED MEMORY from helper
// other thread gets run at this time and tries to use helper object
// AT THE PROGRAMS LEISURE INITIALISE Helper object.
If this is allowed by the language then the double checking will not work.
Using volatile would not prevent a multiple instantiations - however using the synchronize will prevent multiple instances being created. However with your code it is possible that helper is returned before it has been setup (thread 'A' instantiates it, but before it is setup thread 'B' comes along, helper is non-null and so returns it straight away. To fix that problem, remove the first if (helper == null).
Most likely it is broken, because the problem of a partially constructed object is not addressed.
To all the people worried about a partially constructed object:
As far as I understand, the problem of partially constructed objects is only a problem within constructors. In other words, within a constructor, if an object references itself (including it's subclass) or it's members, then there are possible issues with partial construction. Otherwise, when a constructor returns, the class is fully constructed.
I think you are confusing partial construction with the different problem of how the compiler optimizes the writes. The compiler can choose to A) allocate the memory for the new Helper object, B) write the address to myHelper (the local stack variable), and then C) invoke any constructor initialization. Anytime after point B and before point C, accessing myHelper would be a problem.
It is this compiler optimization of the writes, not partial construction that the cited papers are concerned with. In the original single-check lock solution, optimized writes can allow multiple threads to see the member variable between points B and C. This implementation avoids the write optimization issue by using a local stack variable.
The main scope of the cited papers is to describe the various problems with the double-check lock solution. However, unless the atomicSet method is also synchronizing against the Foo class, this solution is not a double-check lock solution. It is using multiple locks.
I would say this all comes down to the implementation of the atomic assignment function. The function needs to be truly atomic, it needs to guarantee that processor local memory caches are synchronized, and it needs to do all this at a lower cost than simply always synchronizing the getHelper method.
Based on the cited paper, in Java, it is unlikely to meet all these requirements. Also, something that should be very clear from the paper is that Java's memory model changes frequently. It adapts as better understanding of caching, garbage collection, etc. evolve, as well as adapting to changes in the underlying real processor architecture that the VM runs on.
As a rule of thumb, if you optimize your Java code in a way that depends on the underlying implementation, as opposed to the API, you run the risk of having broken code in the next release of the JVM. (Although, sometimes you will have no choice.)
dsimcha:
If your atomicSet method is real, then I would try sending your question to Doug Lea (along with your atomicSet implementation). I have a feeling he's the kind of guy that would answer. I'm guessing that for Java he will tell you that it's cheaper to always synchronize and to look to optimize somewhere else.

examples of garbage collection bottlenecks

I remembered someone telling me one good one. But i cannot remember it. I spent the last 20mins with google trying to learn more.
What are examples of bad/not great code that causes a performance hit due to garbage collection ?
from an old sun tech tip -- sometimes it helps to explicitly nullify references in order to make them eligible for garbage collection earlier:
public class Stack {
private static final int MAXLEN = 10;
private Object stk[] = new Object[MAXLEN];
private int stkp = -1;
public void push(Object p) {stk[++stkp] = p;}
public Object pop() {return stk[stkp--];}
}
rewriting the pop method in this way helps ensure that garbage collection gets done in a timely fashion:
public Object pop() {
Object p = stk[stkp];
stk[stkp--] = null;
return p;
}
What are examples of bad/not great code that causes a performance hit due to garbage collection ?
The following will be inefficient when using a generational garbage collector:
Mutating references in the heap because write barriers are significantly more expensive than pointer writes. Consider replacing heap allocation and references with an array of value types and an integer index into the array, respectively.
Creating long-lived temporaries. When they survive the nursery generation they must be marked, copied and all pointers to them updated. If it is possible to coalesce updates in order to reuse of an old version of a collection, do so.
Complicated heap topologies. Again, consider replacing many references with indices.
Deep thread stacks. Try to keep stacks shallow to make it easier for the GC to collate the global roots.
However, I would not call these "bad" because there is nothing objectively wrong with them. They are only inefficient when used with this kind of garbage collector. With manual memory management, none of the issues arise (although many are replaced with equivalent issues, e.g. performance of malloc vs pool allocators). With other kinds of GC some of these issues disappear, e.g. some GCs don't have a write barrier, mark-region GCs should handle long-lived temporaries better, not all VMs need thread stacks.
When you have some loop involving the creation of new object's instances: if the number of cycles is very high you procuce a lot of trash causing the Garbage Collector to run more frequently and so decreasing performance.
One example would be object references that are kept in member variables oder static variables. Here is an example:
class Something {
static HugeInstance instance = new HugeInstance();
}
The problem is the garbage collector has no way of knowing, when this instance is not needed anymore. So its usually better to keep things in local variables and have small functions.
String foo = new String("a" + "b" + "c");
I understand Java is better about this now, but in the early days that would involve the creation and destruction of 3 or 4 string objects.
I can give you an example that will work with the .Net CLR GC:
If you override a finalize method from a class and do not call the super class Finalize method such as
protected override void Finalize(){
Console.WriteLine("Im done");
//base.Finalize(); => you should call him!!!!!
}
When you resurrect an object by accident
protected override void Finalize(){
Application.ObjJolder = this;
}
class Application{
static public object ObjHolder;
}
When you use an object that uses Finalize it takes two GC collections to get rid of the data, and in any of the above codes you won't delete it.
frequent memory allocations
lack of memory reusing (when dealing with large memory chunks)
keeping objects longer than needed (keeping references on obsolete objects)
In most modern collectors, any use of finalization will slow the collector down. And not just for the objects that have finalizers.
Your custom service does not have a load limiter on it, so:
A lot requests come in for some reason at the same time (everyone logs on in the morning say)
The service takes longer to process each requests as it now has 100s of threads (1 per request)
Yet more part processed requests builds up due to the longer processing time.
Each part processed request has created lots of objects that live until the end of processing that request.
The garbage collector spends lots of time trying to free memory it, however it can’t due to the above.
Yet more part processed requests builds up due to the longer processing time…. (including time in GC)
I have encountered a nice example while doing some parallel cell based simulation in Python. Cells are initialized and sent to worker processes after pickling for running. If you have too many cells at any one time the master node runs out of ram. The trick is to make a limited number of cells pack them and send them off to cluster nodes before making some more, remember to set the objects already sent off to "None". This allows you to perform large simulations using the total RAM of the cluster in addition to the computing power.
The application here was cell based fire simulation, only the cells actively burning were kept as objects at any one time.

Resources