Why do we need total order across view changes in consensus protocols?

Why do we need total order across view changes in consensus protocols? - protocols

In their famous article, Miguel Castro and Barbara Liskov justify the commit phase of the PBFT consensus protocol like this:
This ensures that replicas agree on a total order for requests in the
same view but it is not sufficient to ensure a total order for
requests across view changes. Replicas may collect prepared
certificates in different views with the same sequence number and
different requests. The commit phase solves this problem as follows.
Each replica i multicasts <COMMIT, v, n, i>_{α_i} saying it has the
prepared certificate and adds this message to its log. Then each
replica collects messages until it has a quorum certificate with 2 f +
1 COMMIT messages for the same sequence number n and view v from
different replicas (including itself). We call this certificate the
committed certificate and say that the request is committed by the
replica when it has both the prepared and committed certificates.
But why exactly do we need to guarantee total order across view changes?
If a leader/primary replica fails and triggers a view change, wouldn't it suffice to discard everything from the previous view? What situation does the commit phase prevent that this solution does not?
Apologies if this is too obvious. I'm new to distributed systems and I haven't found any source which directly answers this question.

There is a conceptual reason for this. The system appears to a client as a black box. The whole idea of this box is to provide reliable access to some service, thus, it should mask the failures of a particular replica. Otherwise, if you discard everything at each view change, clients will constantly lose their data. So basically, your solution simply contradicts the specification. The commit phase is needed exactly to prevent such kind of situations. If the request is "accepted" only when there are 2f + 1 COMMIT messages, then, even if all f replicas are faulty, the remaining nodes can recover all committed requests, this provides durable access to the system.
There is also a technical reason. In theory the system is asynchronous, this means that you can't even guarantee that the view change will occur only as a result of a failure. Some replicas may only suspect that the leader is faulty and change the view. With your solution it is possible that the system discards everything it is accepted even if non of replicas is faulty.
If you're new to distributed systems I suggest you to have a look at the classic protocols tolerating non-Byzantine failures (e.g., Paxos), they are simpler but solves the problems in the similar way.
Edit
When I say "clients constantly lose their data" it is a bit more than it sounds. I'm talking about the impact of a particular client request to the system. Let's take a key-value store. A clinet A associates some value to some key via our "black box". The "black box" now orders this request with respect to any other concurrent (or simply parallel) requests. It then replicates it across all replicas and finally notifies A. Without commit phase there is no ordering and at two different views our "black box" can chose two different order of execution of client requests. That being said, the following is possible:
at a time t, A associates value to key and the "box" approves this,
at the time t+1, B associates value_2 to key and the "box" approves this,
at the time t+2, C reads value_2 from key,
view change (invisible to clients),
at the time t+3, D reads value from key.
Note that (5) is possible not because the "box" is not aware of value_2 (as you mentioned the value itself can be resubmitted) but because it is not aware that previously it first wrote value and then overwrote it with value_2. At the new view, the system needs somehow order those two requests but no luck, the decision is not coherent with the past.
The eventual synchrony is a way to guarantee liveness of the protocols, however, it cannot prevent the situations described above. Eventual synchrony states that eventually your system will behave much like the synchronous one, but you don't know when, before that time any kind of weird things can happen. If during the asynchronous period a safety property is violated, then obviously the whole system is not safe.

The output of PBFT should not be one log per view, but rather an ever-growing global log to which every view tries to contribute new 'blocks'.
The equivalent notion in a blockchain is that each block proposer, or block miner, must append to the current blockchain, instead of starting its new blockchain from scratch. I.e. new blocks must respect previous transactions, the same way new views must respect previous views.
If the total ordering is not consistent across views, then we lose the property above.
In fact if we force a view change after every sequence number in PBFT, it looks a lot like blockchain, but with a much more complicated recovery/safety mechanism (in part since PBFT blocks don't commit to the previous block, so we need to agree on each of them individually)

Related

Why cassandra doesn't use logical clocks?

From what I have read so far Cassandra is using timestamps provided by client or coordinator to resolve conflicts. If Cassandra receives write for cell which already exists it picks up the one with higher timestamp.
In case of clock skews, when there are no concurrent updates and even when using ALL consistency level, it still might be the case that client has updated value and received ACK from all servers. The actual value however was not updated since provided timestamp was older than existing value at this cell (due to clock skews). Such behaviour violates causal consistency, which AFAIK R+W>N was supposed to provide?
It seems to me that using logical clocks (lamport/vector clocks) to pick newest value and falling back to using actual timestamps (or other strategy that can provided by client) only when concurrent update was detected using read repair. Seems like a better solution and AFAIK this is more or less the approach that dynamo uses, right?
As I am probably missing something, can you let me know why Cassandra doesn't use such approach?

Cassandra is an eventually consistent system and when it was designed (at Facebook) the engineers had to decide how to handle conflicts. They had several options: Last Update Win, have a code handler to be used on conflict, delegate conflict resolution to clients, etc.
I guess they went with Last Update Win due to simplicity. It has several edge cases, but they were designing Cassandra for their purpose and that approach was working for them.
The approach you are talking about is valid - the system returns all conflicting values to a customer and the customer decides what to do about that. It does add extra complexity to client code, which may be not a desired property.
Edit based on comment: why wall clock and not logical clock
Logical clocks (vector) help to detect concurrent updates, but it won't help to actually decide how to resolve the conflict. E.g. if there are two updates to the same key, vector will detect them, but there is no way to decide which one to use.
Since Cassandra does not return conflicting versions (by design) and does not merge them, they need a way to decide which record to use. They decided to use Last Update Wins strategy. One of options for this strategy is to use wall clock to decide.
p.s. Lamport timestamp would provide total order, but it requires completely different flow of data in the system.

As per the CAP theorem, in case of network partitioning, strongly consistent system will have a downtime. We know, logical clocks are strongly consistent, so in case of partitioning they will have a downtime.
In a practical sense, when you implement a logical clock, you implement using one of the quorum based algorithm, which becomes unavailable to the side of network partition, which has lesser number of nodes. So during partitioning, in your example, with a logical clock, either A or B will take writes and the other node will not have access to the logical clock, becoming incapable of serving writes.
Cassandra developers had three choices:
Use logical clock and drop the availability during partitioning.
Use wall clock and run in to the problem you have called out.
Let client chose 1 or 2. Let the client pass a large enough integer which can be generated with 1 or 2 on the client side.
Casandra went with 3, but also provides server side default of 2 to simplify clients that don't need logical clock. How you can generate logical time on the client side that is same size integer as a clock time (in millis) is a separate (solved) problem.

Do I need FIFO SQS for jira like board view app

Currently I am running a jira like board-stage-card management app on AWS ECS with 8 tasks. When a card is moved from one column/stage to another, I look for the current stage object for that card remove card from that stage and add card to the destination stage object. This is working so far because I am always looking for the actual card's stage in the Postgres database not base on what frontend think that card belongs to.
Question:
Is it safe to say that even when multiple users move the same card to different stages, but query would still happen one after the other and data will not corrupt? (such as duplicates)
If there is still a chance data can be corrupted. Is it a good option to use SQS FIFO to send message to a lambda and handle each card movement in sequence ?
Any other reason I should use SQS in this case ? or is SQS not applicable at all here?

The most important question here is: what do you want to happen?
Looking at the state of a card in the database, and acting on that is only "wrong" if it doesn't implement the behavior you want. It's true that if the UI can get out of sync with the database, then users might not always get the result they were expecting - but that's all.
Consider likelihood and consequences:
How likely is it that two or more people will update the same card, at the same time, to different stages?
And what is the consequence if they do?
If the board is being used by a 20 person project team, then I'd say the chances were 'low/medium', and if they are paying attention to the board they'll see the unexpected change and have a discussion - because clearly they disagree (or someone moved it to the wrong stage by accident).
So in that situation, I don't think you have a massive problem - as long as the system behavior is what you want (see my further responses below). On the other hand, if your board solution is being used to help operate a nuclear missile launch control system then I don't think your system is safe enough :)
Is it safe to say that even when multiple users move the same card to
different stages, but query would still happen one after the other and
data will not corrupt? (such as duplicates)
Yes the query will still happen - on the assumption:
That the database query looks up the card based on some stable identifier (e.g. CardID), and
that having successfully retrieved the card, your logic moves it to whatever destination stage is specified - implying there's no rules or state machine that might prohibit certain specific state transitions (e.g. moving from stage 1 to 2 is ok, but moving from stage 2 to 1 is not).
Regarding your second question:
If there is still a chance data can be corrupted.
It depends on what you mean by 'corruption'. Data corruption is when unintended changes occur in data, and which usually make it unusable (un-processable, un-readable, etc) or useless (processable but incorrect). In your case it's more likely that your system would work properly, and that the data would not be corrupted (it remains processable, and the resulting state of the data is exactly what the system intended it to be), but simply that the results the users see might not be what they were expecting.
Is it a good option
to use SQS FIFO to send message to a lambda and handle each card
movement in sequence ?
A FIFO queue would only ensure that requests were processed in the order in which they were received by the queue. Whether or not this is "good" depends on the most important question (first sentence of this answer).
Assuming the assumptions I provided above are correct: there is no state machine logic being enforced, and the card is found and processed via its ID, then all that will happen is that the last request will be the final state. E.g.:
Card State: Card.CardID = 001; Stage = 1.
3 requests then get lodged into the FIFO queue in this order:
User A - Move CardID 001 to Stage 2.
User B - Move CardID 001 to Stage 4.
User C - Move CardID 001 to Stage 3.
Resulting Card State: Card.CardID = 001; Stage = 3.
That's "good" if you want the most recent request to be the result.
Any other reason I should use SQS in this case ? or is SQS not
applicable at all here?
The only thing I can think of is that you would be able to store a "history", that way users could see all the recent changes to a card. This would do two things:
Prove that the system processed the requests correctly (according to what it was told to do, and it's logic).
Allow users to see who did what, and discuss.
To implement that, you just need to record all relevant changes to the card, in the right order. The thing is, the database can probably do that on it's own, so use of SQS is still debatable, all the queue will do is maybe help avoid deadlocks.
Update - RE Duplicate Cards
You'd have to check the documentation for SQS to see if it can evaluate queue items and remove duplicates.
Assuming it doesn't, you'll have to build something to handle that separately. All I can think of right now is to check for duplicates before adding them to the queue - because once that are there it's probably too late.
One idea:
Establish a component in your code which acts as the proxy/façade for the queue.
Make it smart in that it knows about recent card actions ("recent" is whatever you think it needs to be).
A new card action comes it, it does a quick check to see if it has any other "recent" duplicate card actions, and if yes, decides what to do.
One approach would be a very simple in-memory collection, and cycle out old items as fast as you dare to. "Recent", in terms of the lifetime of items in this collection, doesn't have to be the same as how long it takes for items to get through the queue - it just needs to be long enough to satisfy yourself there's no obvious duplicate.
I can see such a set-up working, but potentially being quite problematic - so if you do it, keep it as simple as possible. ("Simple" meaning: functionally as narrowly-focused as possible).
Sizing will be a consideration - how many items are you processing a minute?
Operational considerations - if it's in-memory it'll be easy to lose (service restarts or whatever), so design the overall system in such a way that if that part goes down, or the list is flushed, items still get added to the queue and things keep working regardless.

While you are right that a Fifo Queue would be best here, I think your design isn't ideal or even workable in some situation.
Let's say user 1 has an application state where the card is in stage 1 and he moves it to stage 2. An SQS message will indicate "move the card from stage 1 to stage 2". User 2 has the same initial state where card 1 is in stage 1. User 2 wants to move the card to stage 3, so an SQS message will contain the instruction "move the card from stage 1 to stage 3". But this won't work since you can't find the card in stage 1 anymore!
In this use case, I think a classic API design is best where an API call is made to request the move. In the above case, your API should error out indicating that the card is no longer in the state the user expected it to be in. The application can then reload the current state for that card and allow the user to try again.

How are the missing events replayed?

I am trying to learn more about CQRS and Event Sourcing (Event Store).
My understanding is that a message queue/bus is not normally used in this scenario - a message bus can be used to facilitate communication between Microservices, however it is not typically used specifically for CQRS. However, the way I see it at the moment - a message bus would be very useful guaranteeing that the read model is eventually in sync hence eventual consistency e.g. when the server hosting the read model database is brought back online.
I understand that eventual consistency is often acceptable with CQRS. My question is; how does the read side know it is out of sync with the write side? For example, lets say there are 2,000,000 events created in Event Store on a typical day and 1,999,050 are also written to the read store. The remaining 950 events are not written because of a software bug somewhere or because the server hosting the read model is offline for a few secondsetc. How does eventual consistency work here? How does the application know to replay the 950 events that are missing at the end of the day or the x events that were missed because of the downtime ten minutes ago?
I have read questions on here over the last week or so, which talk about messages being replayed from event store e.g. this one: CQRS - Event replay for read side, however none talk about how this is done. Do I need to setup a scheduled task that runs once per day and replays all events that were created since the date the scheduled task last succeeded? Is there a more elegant approach?

I've used two approaches in my projects, depending on the requirements:
Synchronous, in-process Readmodels. After the events are persisted, in the same request lifetime, in the same process, the Readmodels are fed with those events. In case of a Readmodel's failure (bug or catchable error/exception) the error is logged and that Readmodel is just skipped and the next Readmodel is fed with the events and so on. Then follow the Sagas, that may generate commands that generate more events and the cycle is repeated.
I use this approach when the impact of a Readmodel's failure is acceptable by the business, when the readiness of a Readmodel's data is more important than the risk of failure. For example, they wanted the data immediately available in the UI.
The error log should be easily accessible on some admin panel so someone would look at it in case a client reports inconsistency between write/commands and read/query.
This also works if you have your Readmodels coupled to each other, i.e. one Readmodel needs data from another canonical Readmodel. Although this seems bad, it's not, it always depends. There are cases when you trade updater code/logic duplication with resilience.
Asynchronous, in-another-process readmodel updater. This is used when I use total separation of the Readmodel from the other Readmodels, when a Readmodel's failure would not bring the whole read-side down; or when a Readmodel needs another language, different from the monolith. Basically this is a microservice. When something bad happens inside a Readmodel it necessary that some authoritative higher level component is notified, i.e. an Admin is notified by email or SMS or whatever.
The Readmodel should also have a status panel, with all kinds of metrics about the events that it has processed, if there are gaps, if there are errors or warnings; it also should have a command panel where an Admin could rebuild it at any time, preferable without a system downtime.
In any approach, the Readmodels should be easily rebuildable.
How would you choose between a pull approach and a push approach? Would you use a message queue with a push (events)
I prefer the pull based approach because:
it does not use another stateful component like a message queue, another thing that must be managed, that consume resources and that can (so it will) fail
every Readmodel consumes the events at the rate it wants
every Readmodel can easily change at any moment what event types it consumes
every Readmodel can easily at any time be rebuild by requesting all the events from the beginning
there order of events is exactly the same as the source of truth because you pull from the source of truth
There are cases when I would choose a message queue:
you need the events to be available even if the Event store is not
you need competitive/paralel consumers
you don't want to track what messages you consume; as they are consumed they are removed automatically from the queue

This talk from Greg Young may help.
How does the application know to replay the 950 events that are missing at the end of the day or the x events that were missed because of the downtime ten minutes ago?
So there are two different approaches here.
One is perhaps simpler than you expect - each time you need to rebuild a read model, just start from event 0 in the stream.
Yeah, the scale on that will eventually suck, so you won't want that to be your first strategy. But notice that it does work.
For updates with not-so-embarassing scaling properties, the usual idea is that the read model tracks meta data about stream position used to construct the previous model. Thus, the query from the read model becomes "What has happened since event #1,999,050"?
In the case of event store, the call might look something like
EventStore.ReadStreamEventsForwardAsync(stream, 1999050, 100, false)

Application doesn't know it hasn't processed some events due to a bug.
First of all, I don't understand why you assume that the number of events written on the write side must equal number of events processed by read side. Some projections may subscribe to the same event and some events may have no subscriptions on the read side.
In case of a bug in projection / infrastructure that resulted in a certain projection being invalid you might need to rebuild this projection. In most cases this would be a manual intervention that would reset the checkpoint of projection to 0 (begining of time) so the projection will pick up all events from event store from scratch and reprocess all of them again.

The event store should have a global sequence number across all events starting, say, at 1.
Each projection has a position tracking where it is along the sequence number. The projections are like logical queues.
You can clear a projection's data and reset the position back to 0 and it should be rebuilt.
In your case the projection fails for some reason, like the server going offline, at position 1,999,050 but when the server starts up again it will continue from this point.

Is there some algorithm for R/W lock graphs?

Suppose we have resources A,B,C and their dependencies not cyclic:
B->A
C->A
Means B strongly depends on A and C strongly depends on A. For example: B,C is precomputed resources from A. So if A updates, B,C should be updated too. But if B updated - nothing changes except B.
And for the problem: Considering the fact that each node of graph can be accessed for Read or Write or Read/Upgrade to Write in multi-threaded manner, how one supposed to manage locks in such graph? Is there generalization of this problem?
Update
Sorry for not clear question. Here is also one very important thing:
If for example A changes and will force B,C to be updated it means that the moment B and their dependencies updates - it will free write lock.

Your question is a blend of transaction - locking - concurrency - conflict resolution. Therefore models used in relational databases might serve your purpose.
There are many methods defined for concurrency control.
In your case some might apply depending of how optimistic or pessimistic your algorithm needs to be, how many reads or writes, and what is the amount of data per-transaction.
I can think of the two methods that can help in your case:
1. Strict Two-Phase Locking (SSPL or S2PL)
A transaction begins, A, B, C locks are being obtained and are kept until the end of the transaction. Because multiple locks are kept until the end of the transaction, while acquiring the locks a deadlock condition might be encountered. Locks can change during the transaction time.
This approach is serializable, meaning that all events come in order and no other party can make any changes while the transaction holds.
This approach is pessimistic and locks might hold for a good amount of time, thus resources and time will be spent.
2. Multiversion
Instead of placing locks on A, B, C, maintain version numbers and create a snapshot of each. All changes will be done to snapshots. At the end, all snapshots will replace the previous versions. If any version of A, B and C has changed then an error condition occurs and changes are discarded.
This approach does not place read or write locks meaning that will be fast. But in case of conflicts, if any version has changed in the interim, then data will be discarded.
This is optimistic but might spend much more resources in favor of speed.
Transaction log
In database systems there is also the concept of "transaction log". This means that any transaction being it completed or pending will be present in the "transaction log". So every operation done in any of the above methods is first done to the transaction log. Operations from the log will be materialized at the right moment in the main store. In case of failures the log is analyzed, completed transactions are materialized to the main store and the pending ones are just discarded.
This is used also in "log shipping" in order to ship the log to other servers for the purpose of replication.
Known Implementations
There are multiple in-memory databases that might prevent some hassle with implementing your own solution.
H2 provides also serializable isolation level that can match your use case.
go-memdb provides multiversion concurrency. This one uses an immutable radix tree algorithm, therefore you can look also into this one for details if you are searching to build your own solution.
Many more are defined here.

I am not aware of a specific pattern here; so my solution would go like this:
First of all, I would reverse the edges in your graph. You don't care that A is a dependency for B; meaning: the other direction is telling you what is required to lock on:
A->B
A->C
Because now you can say: if I want to do X on A, I need the X lock on A, and any object depending on A.
And now you can go; inspect A, and the objects depending on A; ... and so forth to determine the set of objects you need an X lock on.
Regarding your comment: Because X in this case is either Read or UpgradedWrite, and if A need Write it doesn't clearly mean that B needs it to. ... for me that translates to: the whole "graph idea" doesn't help. You see, such a graph is only useful to express such direct relations, such as "if a then b". If there is an edge between A and B, then that means that you would want to treat them "the same way". If you are now saying that your objects might or might not need to be both write locked - what would be the point of this graph then? Because then you end up with a lot of actually independent objects, and sometimes a write to A needs a write lock something else; and sometimes not.

EventSourcing race condition

Here is the nice article which describes what is ES and how to deal with it.
Everything is fine there, but one image is bothering me. Here it is
I understand that in distributed event-based systems we are able to achieve eventual consistency only. Anyway ... How do we ensure that we don't book more seats than available? This is especially a problem if there are many concurrent requests.
It may happen that n aggregates are populated with the same amount of reserved seats, and all of these aggregate instances allow reservations.

I understand that in distributes event-based systems we are able to achieve eventual consistency only, anyway ... How to do not allow to book more seats than we have? Especially in terms of many concurrent requests?
All events are private to the command running them until the book of record acknowledges a successful write. So we don't share the events at all, and we don't report back to the caller, without knowing that our version of "what happened next" was accepted by the book of record.
The write of events is analogous to a compare-and-swap of the tail pointer in the aggregate history. If another command has changed the tail pointer while we were running, our swap fails, and we have to mitigate/retry/fail.
In practice, this is usually implemented by having the write command to the book of record include an expected position for the write. (Example: ES-ExpectedVersion in GES).
The book of record is expected to reject the write if the expected position is in the wrong place. Think of the position as a unique key in a table in a RDBMS, and you have the right idea.
This means, effectively, that the writes to the event stream are actually consistent -- the book of record only permits the write if the position you write to is correct, which means that the position hasn't changed since the copy of the history you loaded was written.
It's typical for commands to read event streams directly from the book of record, rather than the eventually consistent read models.
It may happen that n-AggregateRoots will be populated with the same amount of reserved seats, it means having validation in the reserve method won't help, though. Then n-AggregateRoots will emit the event of successful reservation.
Every bit of state needs to be supervised by a single aggregate root. You can have n different copies of that root running, all competing to write to the same history, but the compare and swap operation will only permit one winner, which ensures that "the" aggregate has a single internally consistent history.

There are going to be a couple of ways to deal with such a scenario.
First off, an event stream would have the current version as the version of the last event added. This means that when you would not, or should not, be able to persist the event stream if the event stream is not at the version when loaded. Since the very first write would cause the version of the event stream to be increased, the second write would not be permitted. Since events are not emitted, per se, but rather a result of the event sourcing we would not have the type of race condition in your example.
Well, if your commands are processed behind a queue any failures should be retried. Should it not be possible to process the request you would enter the normal "I'm sorry, Dave. I'm afraid I can't do that" scenario by letting the user know that they should try something else.
Another option is to start the processing by issuing an update against some table row to serialize any calls to the aggregate. Probably not the most elegant but it does cause a system-wide block on the processing.
I guess, to a large extent, one cannot really trust the read store when it comes to transactional processing.
Hope that helps :)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string