Design of a high-performance sorted data structure read by many threads and written by few

Design of a high-performance sorted data structure read by many threads and written by few - multithreading

I have an interesting data structure design problem that is beyond my current expertise. I'm seeking data structure or algorithm answers about tackling this problem.
The requirements:
Store a reasonable number of (pointer address, size) pairs (effectively two numbers; the first is useful as a sorting key) in one location
In a highly threaded application, many threads will look up values, to see if a specific pointer is within one of the (address, size) pairs - that is, if the pair defines a memory range, if the pointer is within any range in the list. Threads will much more rarely add or remove entries from this list.
Reading or searching for values must be as fast as possible, happening hundreds of thousands to millions of times a second
Adding or removing values, ie mutating the list, happens much more rarely; performance is not as important
It is acceptable but not ideal for the list contents to be out of date, ie for a thread's lookup code to not find an entry that should exist, so long as at some point the entry will exist.
I am keen to avoid a naive implementation such as having a critical section to serialize access to a sorted list or tree. What data structures or algorithms might be suitable for this task?
Tagged with Delphi since I am using that language for
this task. Language-agnostic answers are very welcome.
However, I probably cannot use any of the standard
libraries in any language without a lot of care. The reason is that memory access
(allocation, freeing, etc of objects and their internal memory, eg
tree nodes, etc) is strictly controlled and must go through my own
functions. My current code elsewhere in the same program uses
red/black trees and a bit trie, and I've written these myself. Object
and node allocation runs through custom memory allocation routines.
It's beyond the scope of the question, but is mentioned here to avoid
an answer like 'use STL structure foo.' I'm keen for an algorithmic or
structure answer that, so long as I have the right references or textbooks,
I can implement myself.

I would use a TDictionary<Pointer, Integer> (from Generics.Collections) combined with a TMREWSync (from SysUtils) for the multi-read exclusive-write access. TMREWSync allows multiple readers simulatenous access to the dictionary, as long as no writer is active. The dictionary itself provides O(1) lookup of pointers.
If you don't want to use the RTL classes the answer becomes: use a hash map combined with a multi-read exclusive-write synchronization object.
EDIT: Just realized that your pairs really represent memory ranges, so a hash map does not work. In this case you could use a sorted list (sorted by memory adress) and then use binary search to quickly find a matching range. That makes the lookup O(log n) instead of O(1) though.

Exploring a bit the replication idea ...
From the correctness point of view, reader/writer locks will do the work. However,
in practice, while readers may be able to proceed concurrently and in parallel
with accessing the structure, they will create a huge contention on the lock, for the
obvious reason that locking even for read access involves writing to the lock itself.
This will kill the performance in a multi-core system and even more in a multi-socket
system.
The reason for the low performance is the cache line invalidation/transfer traffic
between cores/sockets. (As a side note, here's a very recent and very interesting study
on the subject Everything You Always Wanted to Know About
Synchronization but Were Afraid to Ask ).
Naturally, we can avoid inter core cache transfers, triggered by readers, by making
a copy of the structure on each core and restricting the reader threads to accessing only
the copy local to the core they are currently executing. This requires some mechanism for a thread to obtain its current core id. It also relies to on the operating system scheduler to not move gratuitously threads across cores, i.e. to maintain core affinity to some extent.
AFACT, most current operating systems do it.
As for the writers, their job would be to update all the existing replicas, by obtaining each lock for writing. Updating one tree (apparently the structure should be some tree) at a time does mean a temporary inconsistency between replicas. From the problem
description this seams acceptable. When a writer works, it will block readers on a single
core, but not all readers. The drawback is that a writer has the perform the same work
many times - as many time as there are cores or sockets in the system.
PS.
Maybe, just maybe, another alternative is some RCU-like approach, but I don't know
this well, so I'll just stop after mentioning it :)

With replication you could have:
- one copy of your data structure (list w/ binary search, the interval tree mentioned, ..) (say, the "original" one) that is used only for the lookup (read-access).
- A second copy, the "update" one, is created when the data is to be altered (write access). So the write is made to the update copy.
Once writing completes, change some "current"-pointer from the "original" to the "update" version. Involving an access-counter to the "original" copy, this one can be destroyed when the counter decremented back to zero readers.
In pseudo-code:
// read:
data = get4Read();
... do the lookup
release4Read(data);
// write
data = get4Write();
... alter the data
release4Write(data);
// implementation:
// current is the datat structure + a 'readers' counter, initially set to '0'
get4Read() {
lock(current_lock) { // exclusive access to current
current.readers++; // one more reader
return current;
}
}
release4Read(copy) {
lock(current_lock) { // exclusive access to current
if(0 == --copy.readers) { // last reader
if(copy != current) { // it was the old, "original" one
delete(copy); // destroy it
}
}
}
}
get4Write() {
aquire_writelock(update_lock); // blocks concurrent writers!
var copy_from = get4Read();
var copy_to = deep_copy(copy_from);
copy_to.readers = 0;
return copy_to;
}
release4Write(data) {
lock(current_lock) { // exclusive access to current
var copy_from = current;
current = data;
}
release4Read(copy_from);
release_writelock(update_lock); // next write can come
}
To complete the answer regarding the actual data structure to use:
Given the fixed size of the data-entries (two integer tuple), also being quite small, i would use an array for storage and binary search for the lookup. (An alternative would be a balanced tree mentioned in the comment).
Talking about performance: As i understand, the 'address' and 'size' define ranges. Thus, the lookup for a given address being within such a range would involve an addition operation of 'address' + 'size' (for comparison of the queried address with the ranges upper bound) over and over again. It may be more performant to store start- and end-adress explicitely, instead of start-adress and size - to avoid this repeated addition.

Read the LMDB design papers at http://symas.com/mdb/ . An MVCC B+tree with lockless reads and copy-on-write writes. Reads are always zero-copy, writes may optionally be zero-copy as well. Can easily handle millions of reads per second in the C implementation. I believe you should be able to use this in your Delphi program without modification, since readers also do no memory allocation. (Writers may do a few allocations, but it's possible to avoid most of them.)

As a side note, here's a good read about memory barriers: Memory Barriers: a Hardware View for Software Hackers
This is just to answer a comment by #fast, the comment space is not big enough ...
#chill: Where do you see the need to place any 'memory barriers'?
Everywhere, where you access shared storage from two different cores.
For example, a writer comes, make a copy of the data and then calls
release4Write. Inside release4write, the writer does the assignment
current = data, to update the shared pointer with the location of the new
data, decrements the counter of the old copy to zero and proceeds with deleting it.
Now a reader intervenes and calls get4Read. And inside get4Read it does copy = current. Since there's no memory barrier, this happens to read the old value of current. For all we know, the write may be reordered after the delete call or the new value of current may still reside in the writer's store queue or the reader may not have yet
seen and processed a corresponding cache invalidation request and whatnot ...
Now the reader happily proceeds to search in that copy of the data
that the writer is deleting or has just deleted. Oops!
But, wait, there's more! :D
With propper use if the > get..() and release..() functions, where do you see the problems to access deleted data or multiple deletion?
See the following interleaving of reader and write operations.
Reader Shared data Writer
====== =========== ======
current = A:0
data = get4Read()
var copy = A:0
copy.readers++;
current = A:1
return A:1
data = A:1
... do the lookup
release4Read(copy == A:1):
--copy.readers current = A:0
0 == copy.readers -> true
data = get4Write():
aquire_writelock(update_lock)
var copy_from = get4Read():
var copy = A:0
copy.readers++;
current = A:1
return A:1
copy_from == A:1
var copy_to = deep_copy(A:1);
copy_to == B:1
return B:1
data == B:1
... alter the data
release4Write(data = B:1)
var copy_from = current;
copy_form == A:1
current = B:1
current = B:1
A:1 != B:1 -> true
delete A:1
!!! release4Read(A:1) !!!
And the writer accesses deleted data and then tries to delete it again. Double oops!

Related

User defined atomic less than

I've been reading and it seems that std::atomic doesn't support a compare and swap of the less/greater than variant.
I'm using OpenMP and need to safely update a global minimum value.
I was thinking this would be as easy as using a built-in API.
But alas, so instead I'm trying to come up with my own implementation.
I'm primarily concerned with the fact that I don't want to use an omp critical section to do a less than comparison every single time because it may incur significant synchronization overhead for very little gain in most cases.
But in those cases where a new global minima is potentially found (less often), the synchronization overhead is acceptable. I'm thinking I can implement it using the following method. Hoping for someone to advise.
Use an std::atomic_uint as the global minima.
Atomically read the value into thread local stack.
Compare it against the current value and if it's less, attempt to enter a critical section.
Once synchronized, verify that the atomic value is still less than the new one and update accordingly (the body of the critical section should be cheap, just update a few values).
This is for a homework assignment, so I'm trying to keep the implementation my own. Please don't recommend various libraries to accomplish this. But please do comment on the synchronization overhead that this operation can incur or if it's bad, elaborate on why. Thanks.

What you're looking for would be called fetch_min() if it existed: fetch old value and update the value in memory to min(current, new), exactly like fetch_add but with min().
This operation is not directly supported in hardware on x86, but machines with LL/SC could emit slightly more efficient asm for it than from emulating it with a CAS ( old, min(old,new) ) retry loop.
You can emulate any atomic operation with a CAS retry loop. In practice it usually doesn't have to retry, because the CPU that succeeded at doing a load usually also succeeds at CAS a few cycles later after computing whatever with the load result, so it's efficient.
See Atomic double floating point or SSE/AVX vector load/store on x86_64 for an example of creating a fetch_add for atomic<double> with a CAS retry loop, in terms of compare_exchange_weak and plain + for double. Do that with min and you're all set.
Re: clarification in comments: I think you're saying you have a global minimum, but when you find a new one, you want to update some associated data, too. Your question is confusing because "compare and swap on less/greater than" doesn't help you with that.
I'd recommend using atomic<unsigned> globmin to track the global minimum, so you can read it to decide whether or not to enter the critical section and update related state that goes with that minimum.
Only ever modify globmin while holding the lock (i.e. inside the critical section). Then you can update it + the associated data. It has to be atomic<> so readers that look at just globmin outside the critical section don't have data race UB. Readers that look at the associated extra data must take the lock that protects it and makes sure that updates of globmin + the extra data happen "atomically", from the perspective of readers that obey the lock.
static std::atomic<unsigned> globmin;
std::mutex globmin_lock;
static struct Extradata globmin_extra;
void new_min_candidate(unsigned newmin, const struct Extradata &newdata)
{
// light-weight early out check to avoid the critical section
// No ordering requirement as long as globmin is monotonically decreasing with time
if (newmin < globmin.load(std::memory_order_relaxed))
{
// enter a critical section. Use OpenMP stuff if you want, this is plain ISO C++
std::lock_guard<std::mutex> lock(globmin_lock);
// Check globmin again, after we've excluded other threads from modifying it and globmin_extra
if (newmin < globmin.load(std::memory_order_relaxed)) {
globmin.store(newmin, std::memory_order_relaxed);
globmin_extra = newdata;
}
// else leave the critical section with no update:
// another thread raced with use *outside* the critical section
// release the lock / leave critical section (lock goes out of scope here: RAII)
}
// else do nothing
}
std::memory_order_relaxed is sufficient for globmin: there's no ordering required with anything else, just atomicity. We get atomicity / consistency for the associated data from the critical section/lock, not from memory-ordering semantics of loading / storing globmin.
This way the only atomic read-modify-write operation is the locking itself. Everything on globmin is either load or store (much cheaper). The main cost with multiple threads will still be bouncing the cache line around, but once you own a cache line, each atomic RMW is maybe 20x more expensive than a simple store on modern x86 (http://agner.org/optimize/).
With this design, if most candidates aren't lower than globmin, the cache line will stay in the Shared state most of the time, so the globmin.load(std::memory_order_relaxed) outside the critical section can hit in L1D cache. It's just an ordinary load instruction, so it's extremely cheap. (On x86, even seq-cst loads are just ordinary loads (and release loads are just ordinary stores, but seq_cst stores are more expensive). On other architectures where the default ordering is weaker, seq_cst / acquire loads need a barrier.)

How/when to release memory in wait-free algorithms

I'm having trouble figuring out a key point in wait-free algorithm design. Suppose a data structure has a pointer to another data structure (e.g. linked list, tree, etc), how can the right time for releasing a data structure?
The problem is this, there are separate operations that can't be executed atomically without a lock. For example one thread reads the pointer to some memory, and increments the use count for that memory to prevent free while this thread is using the data, which might take long, and even if it doesn't, it's a race condition. What prevents another thread from reading the pointer, decrementing the use count and determining that it's no longer used and freeing it before the first thread incremented the use count?
The main issue is that current CPUs only have a single word CAS (compare & swap). Alternatively the problem is that I'm clueless about waitfree algorithms and data structures and after reading some papers I'm still not seeing the light.
IMHO Garbage collection can't be the answer, because it would either GC would have to be prevented from running if any single thread is inside an atomic block (which would mean it can't be guaranteed that the GC will ever run again) or the problem is simply pushed to the GC, in which case, please explain how the GC would figure out if the data is in the silly state (a pointer is read [e.g. stored in a local variable] but the the use count didn't increment yet).
PS, references to advanced tutorials on wait-free algorithms for morons are welcome.
Edit: You should assume that the problem is being solved in a non-managed language, like C or C++. After all if it were Java, we'd have no need to worry about releasing memory. Further assume that the compiler may generate code that will store temporary references to objects in registers (invisible to other threads) right before the usage counter increment, and that a thread can be interrupted between loading the object address and incrementing the counter. This of course doesn't mean that the solution must be limited to C or C++, rather that the solution should give a set of primitives that allowing the implementation of wait-free algorithms on linked data structures. I'm interested in the primitives and how they solve the problem of designing wait-free algorithms. With such primitives a wait-free algorithm can be implemented equally well in C++ and Java.

After some research I learned this.
The problem is not trivial to solve and there are several solutions each with advantages and disadvantages. The reason for the complexity comes from inter CPU synchronization issues. If not done right it might appear to work correctly 99.9% of the time, which isn't enough, or it might fail under load.
Three solutions that I found are 1) hazard pointers, 2) quiescence period based reclamation (used by the Linux kernel in the RCU implementation) 3) reference counting techniques. 4) Other 5) Combinations
Hazard pointers work by saving the currently active references in a well-known per thread location, so any thread deciding to free memory (when the counter appears to be zero) can check if the memory is still in use by anyone. An interesting improvement is to buffer request to release memory in a small array and free them up in a batch when the array is full. The advantage of using hazard pointers is that it can actually guarantee an upper bound on unreclaimed memory. The disadvantage is that it places extra burden on the reader.
Quiescence period based reclamation works by delaying the actual release of the memory until it's known that each thread has had a chance to finish working on any data that may need to be released. The way to know that this condition is satisfied is to check if each thread passed through a quiescent period (not in a critical section) after the object was removed. In the Linux kernel this means something like each task making a voluntary task switch. In a user space application it would be the end of a critical section. This can be achieved by a simple counter, each time the counter is even the thread is not in a critical section (reading shared data), each time the counter is odd the thread is inside a critical section, to move from a critical section or back all the thread needs to do is to atomically increment the number. Based on this the "garbage collector" can determine if each thread has had a chance to finish. There are several approaches, one simple one would be to queue up the requests to free memory (e.g. in a linked list or an array), each with the current generation (managed by the GC), when the GC runs it checks the state of the threads (their state counters) to see if each passed to the next generation (their counter is higher than the last time or is the same and even), any memory can be reclaimed one generation after it was freed. The advantage of this approach is that is places the least burden on the reading threads. The disadvantage is that it can't guarantee an upper bound for the memory waiting to be released (e.g. one thread spending 5 minutes in a critical section, while the data keeps changing and memory isn't released), but in practice it works out all right.
There is a number of reference counting solutions, many of them require double compare and swap, which some CPUs don't support, so can't be relied upon. The key problem remains though, taking a reference before updating the counter. I didn't find enough information to explain how this can be done simply and reliably though. So .....
There are of course a number of "Other" solutions, it's a very important topic of research with tons of papers out there. I didn't examine all of them. I only need one.
And of course the various approaches can be combined, for example hazard pointers can solve the problems of reference counting. But there's a nearly infinite number of combinations, and in some cases a spin lock might theoretically break wait-freedom, but doesn't hurt performance in practice. Somewhat like another tidbit I found in my research, it's theoretically not possible to implement wait-free algorithms using compare-and-swap, that's because in theory (purely in theory) a CAS based update might keep failing for non-deterministic excessive times (imagine a million threads on a million cores each trying to increment and decrement the same counter using CAS). In reality however it rarely fails more than a few times (I suspect it's because the CPUs spend more clocks away from CAS than there are CPUs, but I think if the algorithm returned to the same CAS on the same location every 50 clocks and there were 64 cores there could be a chance of a major problem, then again, who knows, I don't have a hundred core machine to try this). Another results of my research is that designing and implementing wait-free algorithms and data-structures is VERY challenging (even if some of the heavy lifting is outsourced, e.g. to a garbage collector [e.g. Java]), and might perform less well than a similar algorithm with carefully placed locks.
So, yeah, it's possible to free memory even without delays. It's just tricky. And if you forget to make the right operations atomic, or to place the right memory barrier, oh, well, you're toast. :-) Thanks everyone for participating.

I think atomic operations for increment/decrement and compare-and-swap would solve this problem.
Idea:
All resources have a counter which is modified with atomic operations. The counter is initially zero.
Before using a resource: "Acquire" it by atomically incrementing its counter. The resource can be used if and only if the incremented value is greater than zero.
After using a resource: "Release" it by atomically decrementing its counter. The resource should be disposed/freed if and only if the decremented value is equal to zero.
Before disposing: Atomically compare-and-swap the counter value with the minimum (negative) value. Dispose will not happen if a concurrent thread "Acquired" the resource in between.
You haven't specified a language for your question. Here goes an example in c#:
class MyResource
{
// Counter is initially zero. Resource will not be disposed until it has
// been acquired and released.
private int _counter;
public bool Acquire()
{
// Atomically increment counter.
int c = Interlocked.Increment(ref _counter);
// Resource is available if the resulting value is greater than zero.
return c > 0;
}
public bool Release()
{
// Atomically decrement counter.
int c = Interlocked.Decrement(ref _counter);
// We should never reach a negative value
Debug.Assert(c >= 0, "Resource was released without being acquired");
// Dispose when we reach zero
if (c == 0)
{
// Mark as disposed by setting counter its minimum value.
// Only do this if the counter remain at zero. Atomic compare-and-swap operation.
if (Interlocked.CompareExchange(ref _counter, int.MinValue, c) == c)
{
// TODO: Run dispose code (free stuff)
return true; // tell caller that resource is disposed
}
}
return false; // released but still in use
}
}
Usage:
// "r" is an instance of MyResource
bool acquired = false;
try
{
if (acquired = r.Acquire())
{
// TODO: Use resource
}
}
finally
{
if (acquired)
{
if (r.Release())
{
// Resource was disposed.
// TODO: Nullify variable or similar to let GC collect it.
}
}
}

I know this is not the best way but it works for me:
for shared dynamic data-structure lists I use usage counter per item
for example:
struct _data
{
DWORD usage;
bool delete;
// here add your data
_data() { usage=0; deleted=true; }
};
const int MAX = 1024;
_data data[MAX];
now when item is started to be used somwhere then
// start use of data[i]
data[i].cnt++;
after is no longer used then
// stop use of data[i]
data[i].cnt--;
if you want to add new item to list then
// add item
for (i=0;i<MAX;i++) // find first deleted item
if (data[i].deleted)
{
data[i].deleted=false;
data[i].cnt=0;
// copy/set your data
break;
}
and now in the background once in a while (on timer or whatever)
scann data[] an all undeleted items with cnt == 0 set as deleted (+ free its dynamic memory if it has any)
[Note]
to avoid multi-thread access problems implement single global lock per data list
and program it so you cannot scann data while any data[i].cnt is changing
one bool and one DWORD suffice for this if you do not want to use OS locks
// globals
bool data_cnt_locked=false;
DWORD data_cnt=0;
now any change of data[i].cnt modify like this:
// start use of data[i]
while (data_cnt_locked) Sleep(1);
data_cnt++;
data[i].cnt++;
data_cnt--;
and modify delete scan like this
while (data_cnt) Sleep(1);
data_cnt_locked=true;
Sleep(1);
if (data_cnt==0) // just to be sure
for (i=0;i<MAX;i++) // here scan for items to delete ...
if (!data[i].cnt)
if (!data[i].deleted)
{
data[i].deleted=true;
data[i].cnt=0;
// release your dynamic data ...
}
data_cnt_locked=false;
PS.
do not forget to play with the sleep times a little to suite your needs
lock free algorithm sleep times are sometimes dependent on OS task/scheduler
this is not really an lock free implementation
because while GC is at work then all is locked
but if ather than that multi access is not blocking to each other
so if you do not run GC too often you are fine

Golang: Best way to read from a hashmap w/ mutex

This is a continuation from here: Golang: Shared communication in async http server
Assuming I have a hashmap w/ locking:
//create async hashmap for inter request communication
type state struct {
*sync.Mutex // inherits locking methods
AsyncResponses map[string]string // map ids to values
}
var State = &state{&sync.Mutex{}, map[string]string{}}
Functions that write to this will place a lock. My question is, what is the best / fastest way to have another function check for a value without blocking writes to the hashmap? I'd like to know the instant a value is present on it.
MyVal = State.AsyncResponses[MyId]

Reading a shared map without blocking writers is the very definition of a data race. Actually, semantically it is a data race even when the writers will be blocked during the read! Because as soon as you finish reading the value and unblock the writers - the value may not exists in the map anymore.
Anyway, it's not very likely that proper syncing would be a bottleneck in many programs. A non-blocking lock af a {RW,}Mutex is probably in the order of < 20 nsecs even on middle powered CPUS. I suggest to postpone optimization not only after making the program correct, but also after measuring where the major part of time is being spent.

How are "nonblocking" data structures possible?

I'm having trouble understanding how any data structure can be "nonblocking".
Say you're making a "nonblocking" hashtable. At some point or another, your hashtable gets too full, so you have to re-hash into a larger table.
This implies you need to allocate memory, which is a global resource. So it seems that you must obtain some sort of lock to prevent global corruption of the heap... irrespective of possible problems with your data structure itself!
But then that means every other thread must block while you allocate your memory...
What am I missing here?
(How) can you allocate memory without blocking another thread which is doing the same?

Two examples for non blocking designs are optimistic design and Transactional Memory.
The idea of this is - in most of the cases, the blocking is redundant - since two OPs can concurrently occur without interrupting each other. However, sometimes when 2 OPs occur concurrently and the data becomes corrupted because of it - you can roll back to your previous state, and retry.
There might still be locks in these designs, but the time the data is locked is significantly shorter, and is limited only to the critical time where the affect of the OP is taking place.

Just for some definitions, additional information and to distinguish between non-blocking, lock-free and wait-free terms, I recommend reading the following article (I won't copy the relevant passages here as it's too long):
Definitions of Non-blocking, Lock-free and Wait-free

Most strategies have one fundamental pattern in common. They use a compare and swap (CAS) operation in a loop until it succeeds.
For example, lets consider a stack implemented with a linked list. I chose a linked list implementation because it is easy to make concurrent with a CAS, but there are other ways to do it. I will use C-like pseudocode.
Push(T item)
{
Node node = new Node(); // allocate node memory
Node initial;
do
{
initial = head;
node.Value = item;
node.Next = initial;
}
while (CompareAndSwap(head, node, initial) != initial);
}
Pop()
{
Node node;
Node initial;
do
{
initial = head;
node = initial.Next;
}
while (CompareAndSwap(head, node, initial) != initial);
T value = initial.Value;
delete initial; // deallocate node memory
return value;
}
In the above code CompareAndSwap is a non-blocking atomic operation that replaces the value in a memory address with a new value and returns the old value. If the old value does not match the expected value then you spin through the loop and try it all again.

All that non-blocking means is that you never wait indefinitely, not that you never wait at all. As long as your heap is also implemented using a non-blocking algorithm, you can implement other non-blocking algorithms on top of it.

Cassandra - unique constraint on row key

I would like to know whenever it is possible in Cassandra to specify unique constrain on row key. Something similar to SQL Server's ADD CONSTRAINT myConstrain UNIQUE (ROW_PK)
In case of insert with already existing row key, the existing data will be not overwritten, but I receive kind of exception or response that update cannot be performed due to constrain violation.
Maybe there is a workaround for this problem - there are counters which updates seams to be atomic.

Lightweight transactions?
http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_ltwt_transaction_c.html
INSERT INTO customer_account (customerID, customer_email)
VALUES (‘LauraS’, ‘lauras#gmail.com’)
IF NOT EXISTS;

Unfortunately, no, because Cassandra does not perform any checks on writes. In order to implement something like that, Cassandra would have to do a read before every write, to check whether the write is allowed. This would greatly slow down writes. (The whole point is that writes are streamed out sequentially without needing to do any disk seeks -- reads interrupt this pattern and force seeks to occur.)
I can't think of a way that counters would help, either. Counters are not implemented using an atomic test-and-set. Instead, they essentially store lots of deltas, which are added together when you read the counter value.

Cassandra - a unique constraint can be implemented with the help of a primary key constrain. You need to put all the columns as primary key, those you want to be unique. Cassandra will tackle the rest on it own.
CREATE TABLE users (firstname text, lastname text, age int,
email text, city text, PRIMARY KEY (firstname, lastname));
It means Cassandra will not insert two different rows in this users table when firstname and lastname are the same.

I feel good today and I will not downvote all the other posters for saying that it is not even remotely possible to create a lock with just and only a Cassandra cluster. I just implemented the Lamport's bakery algorithm¹ and it works just fine. No need for any other strange thing like zoos, cages, memory tables, etc.
Instead you can implement a poor's man multi-process / multi-computer locking mechanism as long as you can obtain a read and write with at least QUORUM consistency. That's all you really need to be able to properly implement this algorithm. (the QUORUM level can change depending on the type of lock that you need: local, rack, full network.)
My implementation will appear in version 0.4.7 of libQtCassandra (in C++). I already tested and it locks perfectly. There are a few more things I want to test and let you define a set of parameters that right now are hard coded. But the mechanism works perfectly.
When I found this thread I thought that something was wrong. I searched some more and found a page on Apache which I mention below. The page is not very advanced but their MoinMoin does not offer a discussion page... Anyway, I think that it was worth mentioning. Hopefully people will start implementing that locking mechanism in all sorts of languages such as PHP, Ruby, Java, etc. so it gets used and known that it works.
Source: http://wiki.apache.org/cassandra/Locking
¹ http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
The following is more or less the way I implemented my version. This is just a simplified synopsis. I may need to update it some more because I did some enhancements while testing the resulting code (also the real code uses RAII and includes a timeout capability on top of the TTL.) The final version will be found in the libQtCassandra library.
// lock "object_name"
void lock(QString object_name)
{
QString locks = context->lockTableName();
QString hosts_key = context->lockHostsKey();
QString host_name = context->lockHostName();
int host = table[locks][hosts_key][host_name];
pid_t pid = getpid();
// get the next available ticket
table[locks]["entering::" + object_name][host + "/" + pid] = true;
int my_ticket(0);
QCassandraCells tickets(table[locks]["tickets::" + object_name]);
foreach(tickets as t)
{
// we assume that t.name is the column name
// and t.value is its value
if(t.value > my_ticket)
{
my_ticket = t.value;
}
}
++my_ticket; // add 1, since we want the next ticket
table[locks]["tickets::" + object_name][my_ticket + "/" + host + "/" + pid] = 1;
// not entering anymore, by deleting the cell we also release the row
// once all the processes are done with that object_name
table[locks]["entering::" + object_name].dropCell(host + "/" + pid);
// here we wait on all the other processes still entering at this
// point; if entering more or less at the same time we cannot
// guarantee that their ticket number will be larger, it may instead
// be equal; however, anyone entering later will always have a larger
// ticket number so we won't have to wait for them they will have to wait
// on us instead; note that we load the list of "entering" once;
// then we just check whether the column still exists; it is enough
QCassandraCells entering(table[locks]["entering::" + object_name]);
foreach(entering as e)
{
while(table[locks]["entering::" + object_name].exists(e))
{
sleep();
}
}
// now check whether any other process was there before us, if
// so sleep a bit and try again; in our case we only need to check
// for the processes registered for that one lock and not all the
// processes (which could be 1 million on a large system!);
// like with the entering vector we really only need to read the
// list of tickets once and then check when they get deleted
// (unfortunately we can only do a poll on this one too...);
// we exit the foreach() loop once our ticket is proved to be the
// smallest or no more tickets needs to be checked; when ticket
// numbers are equal, then we use our host numbers, the smaller
// is picked; when host numbers are equal (two processes on the
// same host fighting for the lock), then we use the processes
// pid since these are unique on a system, again the smallest wins.
tickets = table[locks]["tickets::" + object_name];
foreach(tickets as t)
{
// do we have a smaller ticket?
// note: the t.host and t.pid come from the column key
if(t.value > my_ticket
|| (t.value == my_ticket && t.host > host)
|| (t.value == my_ticket && t.host == host && t.pid >= pid))
{
// do not wait on larger tickets, just ignore them
continue;
}
// not smaller, wait for the ticket to go away
while(table[locks]["tickets::" + object_name].exists(t.name))
{
sleep();
}
// that ticket was released, we may have priority now
// check the next ticket
}
}
// unlock "object_name"
void unlock(QString object_name)
{
// release our ticket
QString locks = context->lockTableName();
QString hosts_key = context->lockHostsKey();
QString host_name = context->lockHostName();
int host = table[locks][hosts_key][host_name];
pid_t pid = getpid();
table[locks]["tickets::" + object_name].dropCell(host + "/" + pid);
}
// sample process using the lock/unlock
void SomeProcess(QString object_name)
{
while(true)
{
[...]
// non-critical section...
lock(object_name);
// The critical section code goes here...
unlock(object_name);
// non-critical section...
[...]
}
}
IMPORTANT NOTE (2019/05/05): Although it was a great exercise to get the Lamport's Bakery implemented using Cassandra, it is an anti-pattern for a Cassandra database. This means it is likely to perform poorly on a heavy load. Since then I created a new lock system, still using the Lamport's Algorithm, but keeping all of the data in memory (it's very small) and still allowing multiple computers to participate in the lock so if one goes down, the lock system continues to work as expected (many of the other lock systems do not have that capability. When the master goes down, you lose your lock capability until another computer decides to itself become the new master...)

Obviously you cannot
In cassandra all your writes are reflected in
Commit log
Memtable
to scale million writes & durability
If we consider your case. Before doing this cassandra need to
Check for existence in Memtable
Check for existence in all sstables [If your key is flushed from Memtable]
In the case 2 all though, cassandra has implemented bloom filters it is going to be a overhead. Every write is going to be a read & write
But your request can reduce merge overhead in cassandra because at anytime the key is going to be there in only one sstable. But cassandra's architecture will have to be changed for that.
Jus check this video http://blip.tv/datastax/counters-in-cassandra-5497678 or download this presentation http://www.datastax.com/wp-content/uploads/2011/07/cassandra_sf_counters.pdf to see how counters have come in to cassandra's existence.

One possibility is to use Cages and ZooKeeper:
http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string