Why cassandra doesn't use logical clocks?

Why cassandra doesn't use logical clocks? - cassandra

From what I have read so far Cassandra is using timestamps provided by client or coordinator to resolve conflicts. If Cassandra receives write for cell which already exists it picks up the one with higher timestamp.
In case of clock skews, when there are no concurrent updates and even when using ALL consistency level, it still might be the case that client has updated value and received ACK from all servers. The actual value however was not updated since provided timestamp was older than existing value at this cell (due to clock skews). Such behaviour violates causal consistency, which AFAIK R+W>N was supposed to provide?
It seems to me that using logical clocks (lamport/vector clocks) to pick newest value and falling back to using actual timestamps (or other strategy that can provided by client) only when concurrent update was detected using read repair. Seems like a better solution and AFAIK this is more or less the approach that dynamo uses, right?
As I am probably missing something, can you let me know why Cassandra doesn't use such approach?

Cassandra is an eventually consistent system and when it was designed (at Facebook) the engineers had to decide how to handle conflicts. They had several options: Last Update Win, have a code handler to be used on conflict, delegate conflict resolution to clients, etc.
I guess they went with Last Update Win due to simplicity. It has several edge cases, but they were designing Cassandra for their purpose and that approach was working for them.
The approach you are talking about is valid - the system returns all conflicting values to a customer and the customer decides what to do about that. It does add extra complexity to client code, which may be not a desired property.
Edit based on comment: why wall clock and not logical clock
Logical clocks (vector) help to detect concurrent updates, but it won't help to actually decide how to resolve the conflict. E.g. if there are two updates to the same key, vector will detect them, but there is no way to decide which one to use.
Since Cassandra does not return conflicting versions (by design) and does not merge them, they need a way to decide which record to use. They decided to use Last Update Wins strategy. One of options for this strategy is to use wall clock to decide.
p.s. Lamport timestamp would provide total order, but it requires completely different flow of data in the system.

As per the CAP theorem, in case of network partitioning, strongly consistent system will have a downtime. We know, logical clocks are strongly consistent, so in case of partitioning they will have a downtime.
In a practical sense, when you implement a logical clock, you implement using one of the quorum based algorithm, which becomes unavailable to the side of network partition, which has lesser number of nodes. So during partitioning, in your example, with a logical clock, either A or B will take writes and the other node will not have access to the logical clock, becoming incapable of serving writes.
Cassandra developers had three choices:
Use logical clock and drop the availability during partitioning.
Use wall clock and run in to the problem you have called out.
Let client chose 1 or 2. Let the client pass a large enough integer which can be generated with 1 or 2 on the client side.
Casandra went with 3, but also provides server side default of 2 to simplify clients that don't need logical clock. How you can generate logical time on the client side that is same size integer as a clock time (in millis) is a separate (solved) problem.

Related

Why do we need total order across view changes in consensus protocols?

In their famous article, Miguel Castro and Barbara Liskov justify the commit phase of the PBFT consensus protocol like this:
This ensures that replicas agree on a total order for requests in the
same view but it is not sufficient to ensure a total order for
requests across view changes. Replicas may collect prepared
certificates in different views with the same sequence number and
different requests. The commit phase solves this problem as follows.
Each replica i multicasts <COMMIT, v, n, i>_{α_i} saying it has the
prepared certificate and adds this message to its log. Then each
replica collects messages until it has a quorum certificate with 2 f +
1 COMMIT messages for the same sequence number n and view v from
different replicas (including itself). We call this certificate the
committed certificate and say that the request is committed by the
replica when it has both the prepared and committed certificates.
But why exactly do we need to guarantee total order across view changes?
If a leader/primary replica fails and triggers a view change, wouldn't it suffice to discard everything from the previous view? What situation does the commit phase prevent that this solution does not?
Apologies if this is too obvious. I'm new to distributed systems and I haven't found any source which directly answers this question.

There is a conceptual reason for this. The system appears to a client as a black box. The whole idea of this box is to provide reliable access to some service, thus, it should mask the failures of a particular replica. Otherwise, if you discard everything at each view change, clients will constantly lose their data. So basically, your solution simply contradicts the specification. The commit phase is needed exactly to prevent such kind of situations. If the request is "accepted" only when there are 2f + 1 COMMIT messages, then, even if all f replicas are faulty, the remaining nodes can recover all committed requests, this provides durable access to the system.
There is also a technical reason. In theory the system is asynchronous, this means that you can't even guarantee that the view change will occur only as a result of a failure. Some replicas may only suspect that the leader is faulty and change the view. With your solution it is possible that the system discards everything it is accepted even if non of replicas is faulty.
If you're new to distributed systems I suggest you to have a look at the classic protocols tolerating non-Byzantine failures (e.g., Paxos), they are simpler but solves the problems in the similar way.
Edit
When I say "clients constantly lose their data" it is a bit more than it sounds. I'm talking about the impact of a particular client request to the system. Let's take a key-value store. A clinet A associates some value to some key via our "black box". The "black box" now orders this request with respect to any other concurrent (or simply parallel) requests. It then replicates it across all replicas and finally notifies A. Without commit phase there is no ordering and at two different views our "black box" can chose two different order of execution of client requests. That being said, the following is possible:
at a time t, A associates value to key and the "box" approves this,
at the time t+1, B associates value_2 to key and the "box" approves this,
at the time t+2, C reads value_2 from key,
view change (invisible to clients),
at the time t+3, D reads value from key.
Note that (5) is possible not because the "box" is not aware of value_2 (as you mentioned the value itself can be resubmitted) but because it is not aware that previously it first wrote value and then overwrote it with value_2. At the new view, the system needs somehow order those two requests but no luck, the decision is not coherent with the past.
The eventual synchrony is a way to guarantee liveness of the protocols, however, it cannot prevent the situations described above. Eventual synchrony states that eventually your system will behave much like the synchronous one, but you don't know when, before that time any kind of weird things can happen. If during the asynchronous period a safety property is violated, then obviously the whole system is not safe.

The output of PBFT should not be one log per view, but rather an ever-growing global log to which every view tries to contribute new 'blocks'.
The equivalent notion in a blockchain is that each block proposer, or block miner, must append to the current blockchain, instead of starting its new blockchain from scratch. I.e. new blocks must respect previous transactions, the same way new views must respect previous views.
If the total ordering is not consistent across views, then we lose the property above.
In fact if we force a view change after every sequence number in PBFT, it looks a lot like blockchain, but with a much more complicated recovery/safety mechanism (in part since PBFT blocks don't commit to the previous block, so we need to agree on each of them individually)

What are alternatives to Locks and critical sections in ZeroMQ (guide)

I am reading ZeroMQ-the guide, currently on Chapter 4, for those of you who know.
http://zguide.zeromq.org/page:all
I am working in python with the binding pyzmq.
The author says that we should forget everything we know about concurrent programming, never use locks and critical sections etc.
Right now I am doing a pet project for fun with ZeroMQ, I have a piece of data, which is shared between some threads (don't worry my threads, don't pass sockets). They are sharing a single Database,
my question is :
Should I put a lock around that piece of data, to avoid race conditions, like one normally would, in order to serialize access, or is this something to avoid when using ZeroMQ, because better alternatives exist?
I remember the author saying that one should always share data between threads using inproc:// or ipc:// ( for processes ), but I am not sure how that fits here.

A lot of thrilling FUN in doing this, indeed
Yes, Pieter HINJENS does not advice any step outside of his Zen-of-Zero. Share nothing, lock nothing ... as you have noticed many times in his great book already.
What is the cornerstone of the problem -- protecting the [TIME]-domain consistent reflections of the Database-served pieces of data.
In distributed-system, the [TIME]-domain problem spans a non-singular landscape, so more problems spring out of the box.
If we know, there is no life-saving technology connected to the system ( regulatory + safety validations happily do not apply here ), there could be a few tricks to solve the game with ZeroMQ and without any imperative locks.
Use .setsockopt( { ZMQ.CONFLATE | ZMQ.IMMEDIATE }, 1 ) infrastructure, with .setsockopt( ZMQ.AFFINITY, ... ) and .setsockopt( ZMQ.TOS, ... ) latency-tweaked hardware-infrastructure + O/S + kernel setups end-to-end, where possible.
A good point to notice is also that of the Python threading-model still goes under GIL-lock-stepping, so the principal collision-avoidance is already in-place.
An indeed very hard-pet-project :( ZeroMQ cannot and will not avoid problems )
Either
co-locate the decision-making process(es), so as to have an almost-zero latency on taking decisions on updated data,
or
permit a non-local decision taking, but make them equipped with some robust rules, based on latency of uncompensated ( principally impossible until a Managed Quantum Entanglement API gets published and works indeed at our service ) [TIME]-domain stamped event-notifications - thus having also a set of rules controlled chances to moderate DB data-reflection's consistency corner cases, where any "earlier"-served DB-READ has been delivered "near" or "after" a known DB-WRITE has changed its value - both visible to the remote-observer at an almost-the-same time.
Database's own data-consistency is maintained by the DBMS-engine per se. No need to care here.
Let's imagine the Database-accesses being mediated "through" ZeroMQ communication tools. The risk is not in the performance-scaling, where ZeroMQ enjoys an almost-linear scaling, the problem is in the said [TIME]-domain consistency of the served "answer" anywhere "behind" the DBMS-engine's perimeter, the more once we got into the distributed-system realms.
Why?
A DBMS-engine's "locally" data-consistent "answer" to a read-request, served by the DBMS-engine right at a moment of UTC:000,000,000.000 000 000 [s] will undergo a transport-class specific journey, along the intended distribution path, but -- due to principal reasons -- does not get delivered onto a "remote"-platform until UTC:000,000,000.??? ??? ??? [s] ( depending on the respective transport-class and intermediating platforms' workloads ).
Next, there may be and will be an additional principal inducted latency, caused from workloads of otherwise uncoordinated processes requests, principally concurrent in their respective appearance, that get later somehow aligned into a pure-serial queue, be it due to the ZeroMQ Context()'s job or someone else's one. Dynamics of queue-management and resources-(un-)availability during these phases add another layer of latency-uncertainties.
Put together, one may ( and ought ) fight as a herd of lions for any shaving-off of the latency costs alongside the { up | down }-the-paths, yet getting a minimum + predictable latency is more a wish, than any easily achievable low hanging fruit.
DB-READ-s may seem to be principally easy to serve, as they might appear as lock-free and un-orchestrated among themselves, yet the very first DB-WRITE-request may some of the already scheduled "answer"-s, that were not yet sent out onto the wire ( and each such piece of data ought have got updated / replaced by a DBMS-local [TIME]-domain " freshmost " piece of data --- As no one is willing to dog-fight, the less to further receive shots from a plane, that was already known to have been shot down a few moments ago, is she ... ? )
These were lectures already learnt during collecting pieces of distributed-system smart designs' experience ( Real Time Multiplayer / Massive Ops SIMs being the best of the best ).
inproc:// transport-class is the best tool for co-located decision taking, yet will not help you in this in Python-based ecosystems ( ref. GIL-lock-stepping has enforced a pure-serial execution and latency of GIL-release goes ~3~5 orders of magnitude above the almost-"direct"-RAM-read/write-s.
ipc:// transport-class sockets may span inter-process communications co-located on the same host, yet if one side goes in python, still the GIL-lock-stepping will "chop" your efforts of minimising the accumulated latencies as regular blocking will appear in GIL-interval steps and added latency-jitter is the worst thing a Real-Time distributed-system designer is dreaming about :o)

EventSourcing race condition

Here is the nice article which describes what is ES and how to deal with it.
Everything is fine there, but one image is bothering me. Here it is
I understand that in distributed event-based systems we are able to achieve eventual consistency only. Anyway ... How do we ensure that we don't book more seats than available? This is especially a problem if there are many concurrent requests.
It may happen that n aggregates are populated with the same amount of reserved seats, and all of these aggregate instances allow reservations.

I understand that in distributes event-based systems we are able to achieve eventual consistency only, anyway ... How to do not allow to book more seats than we have? Especially in terms of many concurrent requests?
All events are private to the command running them until the book of record acknowledges a successful write. So we don't share the events at all, and we don't report back to the caller, without knowing that our version of "what happened next" was accepted by the book of record.
The write of events is analogous to a compare-and-swap of the tail pointer in the aggregate history. If another command has changed the tail pointer while we were running, our swap fails, and we have to mitigate/retry/fail.
In practice, this is usually implemented by having the write command to the book of record include an expected position for the write. (Example: ES-ExpectedVersion in GES).
The book of record is expected to reject the write if the expected position is in the wrong place. Think of the position as a unique key in a table in a RDBMS, and you have the right idea.
This means, effectively, that the writes to the event stream are actually consistent -- the book of record only permits the write if the position you write to is correct, which means that the position hasn't changed since the copy of the history you loaded was written.
It's typical for commands to read event streams directly from the book of record, rather than the eventually consistent read models.
It may happen that n-AggregateRoots will be populated with the same amount of reserved seats, it means having validation in the reserve method won't help, though. Then n-AggregateRoots will emit the event of successful reservation.
Every bit of state needs to be supervised by a single aggregate root. You can have n different copies of that root running, all competing to write to the same history, but the compare and swap operation will only permit one winner, which ensures that "the" aggregate has a single internally consistent history.

There are going to be a couple of ways to deal with such a scenario.
First off, an event stream would have the current version as the version of the last event added. This means that when you would not, or should not, be able to persist the event stream if the event stream is not at the version when loaded. Since the very first write would cause the version of the event stream to be increased, the second write would not be permitted. Since events are not emitted, per se, but rather a result of the event sourcing we would not have the type of race condition in your example.
Well, if your commands are processed behind a queue any failures should be retried. Should it not be possible to process the request you would enter the normal "I'm sorry, Dave. I'm afraid I can't do that" scenario by letting the user know that they should try something else.
Another option is to start the processing by issuing an update against some table row to serialize any calls to the aggregate. Probably not the most elegant but it does cause a system-wide block on the processing.
I guess, to a large extent, one cannot really trust the read store when it comes to transactional processing.
Hope that helps :)

C# Algorithmic Stock Trading

We are working on a Algorithmic trading software in C#. We monitor Market Price and then based on certain conditions, we want to buy the stock.
User input can be taken from GUI (WPF) and send to back-end for monitoring.
Back - end receives data continuously from Stock Exchange and checks if user entered price is met with certain limits and conditions. If all are satisfied, then we will buy / sell the stock (in Futures FUT).
Now, I want to design my Back end service.
I need Task Parallel Library or Custom Thread Pool where I want to create my tasks / threads / pool when application starts (may be incremental or fixed say 5000).
All will be in waiting state.
Once user creates an algorithm, we will activate one thread from the pool and monitors price for each incoming string. If it matches, then buy / sell and then go into waiting state again. (I don't want to create and destroy the threads / tasks as it is time consuming).
So please can you guys help me in this regard? If the above approach is good or do we have any other approach?
I am struck with this idea and not able to go out of box to think on this.

The above approach is definitely not "good"
Given the idea above, the architecture is wrong in many cardinal aspects. If your Project aspires to survive in 2017+ markets, try to learn from mistakes already taken in 2007-2016 years.
The percentages demonstrate the NBBO flutter for all U.S. Stocks from 2007-01 ~ 2012-01. ( Lower values means better NBBO stability. Higher values: Instability ) ( courtesy NANEX )
Financial Markets operate on nanosecond scales
Yes, a few inches of glass-fibre signal propagation transport delay decide on PROFIT or LOSS.
If planning to trading in Stock Markets, your system will observe the HFT crowd, doing dirty practice of Quote Stuffing and Vacuum-Cleaning 'em right in front of your nose at such scales, that your single-machine multi-threaded execution will just move through thin-air of fall in gap already created many microseconds before your decision took place on your localhost CPU.
The rise of HFT from 2007-01 ~ 2012-01 ( courtesy NANEX ).
May read more about an illusion of liquidity here.
See the expansion of Quotes against the level of Trades:
( courtesy NANEX )
Even if one decides to trade in a single instrument, on FX, the times are prohibitively short ( more than 20% of the ToB Bids are changed in time less than 2 ms and do not arrive to your localhost before your trading algorithm may react accordingly ).
If your TAMARA-measurements are similar to this, at your localhost, simply forget to trade in any HF/MF/LF-HFT instruments -- you simply do not see the real market ( the tip of the iceberg ) -- as the +20% price-events happen in the very first column ( 1 .. 2 ms ), where you do not see any single event at all!

5000 threads is bad, don't do that ever, you'll degrade the performance with context switch loss much more than parallel execution timing improvement. Traditionally the number of threads for your application should be equal to the number of cores in your system, by default. There are other possible variants, but probably they aren't the best option for your.
So you can use a ThreadPool with some working item method there with infinite loop, which is very low level, but you have control on what is going on in your system. Callback function could update the UI so the user will be notified about the trading results.
However, if you are saying that you can use the TPL, I suggest to consider these two options for your case:
Use a collection of tasks running forever for checking the new trading request. You still should tune up the number of simultaneously running tasks because you probably don't want them to fight each other for a CPU time. As the LongRunning tasks are created with dedicated background thread, many of them will degrade your application performance as well. Maybe in this approach you should introduce a strategy pattern implementation for a algorithm being run inside the task.
Setup a TPL Dataflow process within your application. For such approach your should encapsulate the info about the algorithm inside a DTO-object, and introduce a pipeline:
BufferBlock for storing all the incoming requests. Maybe you can use here a BroadcastBlock, if you want to check the sell or buy options in parallel. You can link the block with a boolean predicate here so the different block will process different types of requests.
ActionBlock (maybe one block for each algorithm from user) for processing the algorithmic check for a pattern based on which you are providing the decision.
ActionBlock for storing all the buy / sell requests for a data successfully passed by the algorithm.
BufferBlock for UI reaction with a Reactive Extensions (Introductory book for Rx, if you aren't familiar with it)
This solution still has to be tuned up with a block creation options, and more informative for you how exactly your data flow across the trading algorithm, the speed of the decision making and overall performance. You should properly examine for a defaults for TPL Dataflow blocks, you can find them into the official documentation. Other good place to start is Stephen Cleary's introductory blog posts (Part 1, Part 2, Part 3) and the chapter #4 about this library in his book.

With C# 5.0, the natural approach is to use async methods running on top of the default thread pool.
This way, you are creating Tasks quite often, but the most prominent cost of that is in GC. And unless you have very high performance requirements, that cost should be acceptable.

I think you would be better with an event loop, and if you need to scale, you can always shard by stock.

Concurrent read and writers through cloned data structures?

I read this question but it didn't really help.
First and most important thing: time performances are the focus in the application that I'm developing
We have a client/server model (even distributed or cloud if we wish) and a data structure D hosted on the server. Each client request consists in:
Read something from D
Eventually write something on D
Eventually delete something on D
We can say that in this application the relation between the number of received operations can be described as delete<<write<<read. In addition:
Read ops cannot absolutely wait: they must be processed immediately
Write and delete can wait some time, but sooner is better.
From the description above, any lock-mechanism is not acceptable: this would imply that read operations could wait, which is not acceptable (sorry if I stress it so much, but it's really a crucial point).
Consistency is not necessary: if a write/delete operation has been performed and then a read operation doesn't see the write/delete effect it's not a big deal. It would be better, but it's not required.
The solution should be data-structure-independent, so it shouldn't matter if we write on a vector, list, map or Donald Trump's face.
The data structure could occupy a big amount of memory.
My solution so far:
We use two servers: the first server (called f) has Df, the second server (called s) has Ds updated.
f answers clients requests using Df and sends write/delete operations to s. Then s applies write/delete operations Ds sequentially.
At a certain point, all future client requests are redirected to s. At the same time, f copies s updated Ds into its Df.
Now, f and s roles are swapped: s will answer clients request using Ds and f will keep an updated version of Ds. The swapping process is periodically repeated.
Notice that I omitted on purpose A LOT of details for simplicity (for example, once the swap has been done, f has to finish all the pending client requests before applying the write/delete operations received from s in the meantime).
Why do we need two servers? Because the data structure is potentially too big to fit into one memory.
Now, my question is: there is some similar approach in literature? I came up with this protocol in 10 minutes, I find strange that no (better) solution similar to this one has been already proposed!
PS: I could have forgot some application specs, don't hesitate to ask for any clarification!

The scheme that you have works. I don't see any particular problem with it. This is basically like many databases run their HA solution. They apply a log of writes to replicas. This model affords a great deal of flexibility in how the replicas are formed, accessed and maintained. Failovers are easy, too.
An alternative technique is to use persistent datastructures. Each write returns you a new and independent version of the data. All versions can be read in a stable and lock-free way. Versions can be kept or discarded at will. Versions share as much of the underlying state as possible.
Usually, trees underlie such persistent datastructures because it is easy to update a small part of the tree and reuse most of the old tree.
A reason you might not have found a more sophisticated approach is that your problem is extremely general: You want this to work with any data structure at all and the data can be big.
SQL Server Hekaton uses a quite sophisticated data structure to achieve lock-free, readable, point in time snapshots of any database contents. Maybe it's worth a look how they are doing it (they released a paper describing every detail of the system). They also allow for ACID transactions, serializability and concurrent writes. All lock-free.
At the same time, f copies s updated Ds into its Df.
This copy will take a long time because the data is big. It will block readers. A better approach is to apply the log of writes to the writable copy before accepting new writes there. That way reads can be accepted continuously.
The switchover also is a short period where reads might have a slightly higher latency than normal.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string