What is data-dependency barrier: Linux Kernel - linux

As question says it all I was looking for in depth explanation of data-dependency barrier in SMP especially with respect to Linux Kernel. I have the definition and brief description handy in this link here.
Linux Kernel Memory Barriers Documentation
However I was attempting to get a profound understanding of this concept. Your thoughts and inputs are highly appreciated.

Actually, at least in terms of C++11, this is more closely related to consume semantics. You can read more about it e.g. here. In short, they provide weaker guarantees than acquire semantics, which makes them more efficient on certain platforms that support data dependency ordering.

In C++ terms, it is called memory_order_consume. See this presentation from Paul Mckenney.

Old answer: I believe "acquire semantics" is the more commonly used term for what the document is calling a "data-dependency barrier". See for example this presentation or the C++11 memory_order_acquire.
Update: per comments, the Linux description for Data Dependency Barriers sounds more like C++ "consume semantics".

Related

How do I implement the ABA solution?

I am trying to implement the Michael-Scott FIFO queue from here. I'm unable to implement their solution for the ABA problem. I get this error.
error: incompatible type for argument 1 of '__sync_val_compare_and_swap'
For reference, I am using a linux box to compile this on an intel architecture. If you need more information on my setup please ask.
It seems that sync_val_CAS handles only up to 32 bit values. So when I remove their counter which is used to eliminate the ABA problem, everything compiles and runs fine.
Does anyone know of the relevant 64-bit CAS instruction I should be using here?
As an additional question, are there better (faster) implementations of lock-free fifo queues out there? I came across this by Nir Shavit et al which seems interesting. I am wondering if others have seen similar efforts? Thanks.
Assuming gcc, try using the "march" switch. Something like this: -march=i686
There is also the __sync_bool_compare_and_swap. I don't know if its faster or not.
GCC, last I looked in 2009, does not support contigious double-word CAS. I had to implement in-line assembly.
You can find my implemenation of the M&S queue (including in the abstraction layer the assembly implementation of DCAS) and other lock-free data structures here;
http://www.liblfds.org
Briefly looking at the Nir Shavit et al paper, the queue requires Safe Memory Reclaimation, which I suspect you'll need to implement - it won't be built into the queue. An SMR API will be available in the next release (couple weeks).
Lock-free may not be what you want, since lock-free is not necessarily wait-free. If you need a fast thread-safe queue (not lock-free!), then consider using Threading Building Blocks concurrent_queue.

Good resources for developing multithread program with C++0x

I am looking for good books/resources for introducing how to use the thread library with C++0x. I have searched amazon.com and SO without informative post.
I asked a similar question myself recently: Where can I find good, solid documentation for the C++0x synchronization primitives?
And I got back a fantastic answer: C++ Concurrency in Action by Anthony Williams
The JustThread library at the end of that link also has good Doxygen documentation as well as implementations of a lot of the C++ threading stuff, though it's a commercial library :-/.
Lastly, you can get a pre-release PDF of this book. I've gotten it myself, and I can tell you that it's a pretty good book.
gcc/g++ implements more of this than they let on. While it's not yet complete, they have a decent implementation of the classes for threads and futures and they also implement the atomic family of classes which allow for some fairly fine-grained synchronization that you would normally only be able to achieve by somehow getting memory barrier instructions into your code by hand.
Right now there are very few (at least not gcc) that support the thread section of C++0x.
Therefore you have to use boost which closely follows the C++0x specification.
I find that the best resource for using boost libraries is their own online documentation, which can be found at http://www.boost.org/doc/libs/1_47_0/doc/html/thread.html.

How efficient is a try_lock on a mutex?

How efficient is a try_lock on a mutex? I.e. how much assembler instructions are there likely and how much time are they consuming in both possible cases (i.e. the mutex was already locked before or it was free and could be locked).
In case you have problems to answer the question, here is a how to (in case that is really unclear):
If that answer depends a lot on the OS implementation and hardware: Please answer it for common OS`s (e.g. Linux, Windows, MacOSX), recent versions of them (in case they differ a lot from earlier versions) and common hardware (x86, amd64, ppc, arm).
If that also depends on the library: Take pthread as an example.
Please also answer if they really differ at all. And if they differ, please state the differences. I.e. what do they do differently? What common algorithms are there around? Are there different algorithms around or do all common systems (common by the above list if that is unclear) have implemented mutexes just in the same way?
As of this Meta discussion, this really should be a separate question.
Also, I have asked this as a separate question from the performance of a lock because I am not sure if try_lock may behave different. Maybe also depending on the implementation. Then again, please answer it for common implementations. And this very similar/related question obviously shows that this is an interesting question which can be answered.
A mutex is a logical construction that is independent of any implementation. Operations on mutexes therefore are neither efficient nor inefficient - they are simply defined.
Your question is therefore akin to asking "How efficient is a car?", without reference to what kind of car you might be talking about.
I could implement mutexes in the real world with smoke signals, carrier pigeons or a pencil and paper. I could also implement them on a computer. I could implement a mutex with certain operations on a Cray 1, on an Intel Core 2 Duo, or on the 486 in my basement. I could implement them in hardware. I could implement them in software in the operating system kernel, or in userspace, or using some combination of the two. I might simulate mutexes (but not implement them) using lock-free algorithms that are guaranteed conflict-free within a critical section.
EDIT: Your subsequent edits don't help the situation. "In a low level language (like C or whatever)" is mostly irrelevant, because then we're into measuring language implementation performance, and that's a slippery slope at best. "[F]rom pthread or whatever the native system library provides" is similarly unhelpful, because as I said, there are so many ways that one could implement mutexes in different environments that it's not even a useful comparison to make.
This is why your question is unanswerable.

Question regarding Unix/Linux kernel programming

I would like to learn about linux/Unix kernel programming for scalable multi processors (smps). I found this book UNIX(R) Systems for Modern Architectures http://www.amazon.com/UNIX-Systems-Modern-Architectures-Multiprocessing/dp/0201633388/ref=pd_rhf_p_t_3 . Is there any other good resources or a better book since its released in 1994. Thank you very much in advance.
Thanks & Regards,
Mousey.
Definitely buy this excellent book! You will get thorough introduction into:
caches, their types, and how to deal with them in the kernel,
synchronization and what hardware primitives are behind it,
general kernel designs as related to concurrency (cli/sti, giant lock, cli+spinlock, etc.)
The book is general enough not to be out of date by now. The only thing I don't remember mentioned there is NUMA, but I don't think there are any good published texts on this subjects yet except for maybe Gorman's Linux memman paper (somebody correct me if I'm wrong here).
I think the book was really worth the money.
Understanding the Linux Kernel is a great book about how the Linux kernel is built, it describes Linux 2.2, 2.4 and 2.6 (Third Edition).
If you want to make drivers, there's Linux Device Drivers , and is also a reference about how Linux is built.
For Linux, Rusty's Unreliable Guide to Kernel Locking is a must-read. After that, you can also read the file Documentation/spinlocks.txt located in the Linux kernel sources.

Lockfree standard collections and tutorial or articles

Does someone know of a good resource for the implementation (meaning source code) of lock-free usual data types. I'm thinking of Lists, Queues and so on?
Locking implementations are extremely easy to find but I can't find examples of lock free algorithms and how to exactly does CAS work and how to use it to implement those structures.
Check out Julian M Bucknall's blog. He describes (in detail) lock-free implementations of queues, lists, stacks, etc.
http://www.boyet.com/Articles/LockfreeQueue.html
http://www.boyet.com/Articles/LockfreeStack.html
http://www.liblfds.org
Written in C.
If C++ is okay with you, take a look at boost::lockfree. It has lock-free Queue, Stack, and Ringbuffer implementations.
In the boost::lockfree::details section, you'll find a lock-free freelist and tagged pointer (ABA prevention) implementation. You will also see examples of explicit memory ordering via boost::atomic (an in-development version of C++0x std::atomic).
Both boost::lockfree and boost::atomic are not part of boost yet, but both have seen attention from the boost-development mailing list and are on the schedule for review.

Resources