When I research on couchDB's durability , I find that couchDB use crash-only design to get durability.But I don't know what's the relationship between crash-only and durability.
By reading the Wiki of CouchDB
The CouchDB file layout and commitment system features all Atomic Consistent Isolated Durable (ACID) properties. On-disk, CouchDB never overwrites committed data or associated structures, ensuring the database file is always in a consistent state. This is a “crash-only" design where the CouchDB server does not go through a shut down process, it's simply terminated.
The durability is given by the fact that the DB is always in a consistant state, and that this is given by the fact that the structure of the DB is append-only (CouchDB never overwrites committed data or associated structures). This makes the error handling quite easy: it can crash instantaneously if there is an error.
I don't think that it's the "crash-only" that gives the durability. I think that the durability permits the use of "crash-only".
Doing the opposite would mean trying to be clever and add error-recovery code. That requires you to correctly identify the error and being correct in your assumptions about the recovery algorithm. Every part of the recovery process may introduce bugs. You may think the error is of a certain type when it's really another, or new unexpected errors may happen while you're already doing recovery.
Error recovery also means not only trying to redo the failed transaction. You must also find the original error, which is probably from some unexpected program or hardware state, and fix that state. Otherwise the same error might happen again.
Crash-only makes the probability of bugs lower, you don't need to find all the edge cases where something went wrong and your system administrator can easily be notified about the error (which may be a hardware error!). With this in mind crash-only may be a sound software design principle in some cases. At least it makes it easier to guarantee your data integrity.
Related
From what I have read so far Cassandra is using timestamps provided by client or coordinator to resolve conflicts. If Cassandra receives write for cell which already exists it picks up the one with higher timestamp.
In case of clock skews, when there are no concurrent updates and even when using ALL consistency level, it still might be the case that client has updated value and received ACK from all servers. The actual value however was not updated since provided timestamp was older than existing value at this cell (due to clock skews). Such behaviour violates causal consistency, which AFAIK R+W>N was supposed to provide?
It seems to me that using logical clocks (lamport/vector clocks) to pick newest value and falling back to using actual timestamps (or other strategy that can provided by client) only when concurrent update was detected using read repair. Seems like a better solution and AFAIK this is more or less the approach that dynamo uses, right?
As I am probably missing something, can you let me know why Cassandra doesn't use such approach?
Cassandra is an eventually consistent system and when it was designed (at Facebook) the engineers had to decide how to handle conflicts. They had several options: Last Update Win, have a code handler to be used on conflict, delegate conflict resolution to clients, etc.
I guess they went with Last Update Win due to simplicity. It has several edge cases, but they were designing Cassandra for their purpose and that approach was working for them.
The approach you are talking about is valid - the system returns all conflicting values to a customer and the customer decides what to do about that. It does add extra complexity to client code, which may be not a desired property.
Edit based on comment: why wall clock and not logical clock
Logical clocks (vector) help to detect concurrent updates, but it won't help to actually decide how to resolve the conflict. E.g. if there are two updates to the same key, vector will detect them, but there is no way to decide which one to use.
Since Cassandra does not return conflicting versions (by design) and does not merge them, they need a way to decide which record to use. They decided to use Last Update Wins strategy. One of options for this strategy is to use wall clock to decide.
p.s. Lamport timestamp would provide total order, but it requires completely different flow of data in the system.
As per the CAP theorem, in case of network partitioning, strongly consistent system will have a downtime. We know, logical clocks are strongly consistent, so in case of partitioning they will have a downtime.
In a practical sense, when you implement a logical clock, you implement using one of the quorum based algorithm, which becomes unavailable to the side of network partition, which has lesser number of nodes. So during partitioning, in your example, with a logical clock, either A or B will take writes and the other node will not have access to the logical clock, becoming incapable of serving writes.
Cassandra developers had three choices:
Use logical clock and drop the availability during partitioning.
Use wall clock and run in to the problem you have called out.
Let client chose 1 or 2. Let the client pass a large enough integer which can be generated with 1 or 2 on the client side.
Casandra went with 3, but also provides server side default of 2 to simplify clients that don't need logical clock. How you can generate logical time on the client side that is same size integer as a clock time (in millis) is a separate (solved) problem.
In their famous article, Miguel Castro and Barbara Liskov justify the commit phase of the PBFT consensus protocol like this:
This ensures that replicas agree on a total order for requests in the
same view but it is not sufficient to ensure a total order for
requests across view changes. Replicas may collect prepared
certificates in different views with the same sequence number and
different requests. The commit phase solves this problem as follows.
Each replica i multicasts <COMMIT, v, n, i>_{α_i} saying it has the
prepared certificate and adds this message to its log. Then each
replica collects messages until it has a quorum certificate with 2 f +
1 COMMIT messages for the same sequence number n and view v from
different replicas (including itself). We call this certificate the
committed certificate and say that the request is committed by the
replica when it has both the prepared and committed certificates.
But why exactly do we need to guarantee total order across view changes?
If a leader/primary replica fails and triggers a view change, wouldn't it suffice to discard everything from the previous view? What situation does the commit phase prevent that this solution does not?
Apologies if this is too obvious. I'm new to distributed systems and I haven't found any source which directly answers this question.
There is a conceptual reason for this. The system appears to a client as a black box. The whole idea of this box is to provide reliable access to some service, thus, it should mask the failures of a particular replica. Otherwise, if you discard everything at each view change, clients will constantly lose their data. So basically, your solution simply contradicts the specification. The commit phase is needed exactly to prevent such kind of situations. If the request is "accepted" only when there are 2f + 1 COMMIT messages, then, even if all f replicas are faulty, the remaining nodes can recover all committed requests, this provides durable access to the system.
There is also a technical reason. In theory the system is asynchronous, this means that you can't even guarantee that the view change will occur only as a result of a failure. Some replicas may only suspect that the leader is faulty and change the view. With your solution it is possible that the system discards everything it is accepted even if non of replicas is faulty.
If you're new to distributed systems I suggest you to have a look at the classic protocols tolerating non-Byzantine failures (e.g., Paxos), they are simpler but solves the problems in the similar way.
Edit
When I say "clients constantly lose their data" it is a bit more than it sounds. I'm talking about the impact of a particular client request to the system. Let's take a key-value store. A clinet A associates some value to some key via our "black box". The "black box" now orders this request with respect to any other concurrent (or simply parallel) requests. It then replicates it across all replicas and finally notifies A. Without commit phase there is no ordering and at two different views our "black box" can chose two different order of execution of client requests. That being said, the following is possible:
at a time t, A associates value to key and the "box" approves this,
at the time t+1, B associates value_2 to key and the "box" approves this,
at the time t+2, C reads value_2 from key,
view change (invisible to clients),
at the time t+3, D reads value from key.
Note that (5) is possible not because the "box" is not aware of value_2 (as you mentioned the value itself can be resubmitted) but because it is not aware that previously it first wrote value and then overwrote it with value_2. At the new view, the system needs somehow order those two requests but no luck, the decision is not coherent with the past.
The eventual synchrony is a way to guarantee liveness of the protocols, however, it cannot prevent the situations described above. Eventual synchrony states that eventually your system will behave much like the synchronous one, but you don't know when, before that time any kind of weird things can happen. If during the asynchronous period a safety property is violated, then obviously the whole system is not safe.
The output of PBFT should not be one log per view, but rather an ever-growing global log to which every view tries to contribute new 'blocks'.
The equivalent notion in a blockchain is that each block proposer, or block miner, must append to the current blockchain, instead of starting its new blockchain from scratch. I.e. new blocks must respect previous transactions, the same way new views must respect previous views.
If the total ordering is not consistent across views, then we lose the property above.
In fact if we force a view change after every sequence number in PBFT, it looks a lot like blockchain, but with a much more complicated recovery/safety mechanism (in part since PBFT blocks don't commit to the previous block, so we need to agree on each of them individually)
I was wondering if anyone has done any perf tests around the effect calling EF Cores SaveChangesAsync() has on performance if there are no changes to be saved.
Essentially I am assuming it's basically nothing and therefore isn't a big deal to call it "just in case"?
(I am trying to do something with tracking user activity in middleware in asp net core and essentially on the way out I want to make sure save changes was called to persist the activity to the database. There is a chance that it has already been called on the context depending on the operation of the user and if that's the case I don't want to incur the cost of a second operation when the activity could be persisted as part of the normal transaction/round trip)
As you can see in implementation if there are no changes, nothing will be done. As far it has impact to performance, I don't know. But of course calling SaveChanges or SaveChangesAsync without any changes will have a performance impact in relation to don't call them.
That's the same behavior like EF6 has too.
Here is the nice article which describes what is ES and how to deal with it.
Everything is fine there, but one image is bothering me. Here it is
I understand that in distributed event-based systems we are able to achieve eventual consistency only. Anyway ... How do we ensure that we don't book more seats than available? This is especially a problem if there are many concurrent requests.
It may happen that n aggregates are populated with the same amount of reserved seats, and all of these aggregate instances allow reservations.
I understand that in distributes event-based systems we are able to achieve eventual consistency only, anyway ... How to do not allow to book more seats than we have? Especially in terms of many concurrent requests?
All events are private to the command running them until the book of record acknowledges a successful write. So we don't share the events at all, and we don't report back to the caller, without knowing that our version of "what happened next" was accepted by the book of record.
The write of events is analogous to a compare-and-swap of the tail pointer in the aggregate history. If another command has changed the tail pointer while we were running, our swap fails, and we have to mitigate/retry/fail.
In practice, this is usually implemented by having the write command to the book of record include an expected position for the write. (Example: ES-ExpectedVersion in GES).
The book of record is expected to reject the write if the expected position is in the wrong place. Think of the position as a unique key in a table in a RDBMS, and you have the right idea.
This means, effectively, that the writes to the event stream are actually consistent -- the book of record only permits the write if the position you write to is correct, which means that the position hasn't changed since the copy of the history you loaded was written.
It's typical for commands to read event streams directly from the book of record, rather than the eventually consistent read models.
It may happen that n-AggregateRoots will be populated with the same amount of reserved seats, it means having validation in the reserve method won't help, though. Then n-AggregateRoots will emit the event of successful reservation.
Every bit of state needs to be supervised by a single aggregate root. You can have n different copies of that root running, all competing to write to the same history, but the compare and swap operation will only permit one winner, which ensures that "the" aggregate has a single internally consistent history.
There are going to be a couple of ways to deal with such a scenario.
First off, an event stream would have the current version as the version of the last event added. This means that when you would not, or should not, be able to persist the event stream if the event stream is not at the version when loaded. Since the very first write would cause the version of the event stream to be increased, the second write would not be permitted. Since events are not emitted, per se, but rather a result of the event sourcing we would not have the type of race condition in your example.
Well, if your commands are processed behind a queue any failures should be retried. Should it not be possible to process the request you would enter the normal "I'm sorry, Dave. I'm afraid I can't do that" scenario by letting the user know that they should try something else.
Another option is to start the processing by issuing an update against some table row to serialize any calls to the aggregate. Probably not the most elegant but it does cause a system-wide block on the processing.
I guess, to a large extent, one cannot really trust the read store when it comes to transactional processing.
Hope that helps :)
I am load testing my node.js application. At some point I reach state where requests are pending and my best guess it's because of a locked transaction. This is the last log statement:
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
And in pg_lock I've got 4 rows with the above query which are GRANTED = true, with mode ExclusiveLock.
Where should I start looking for a bug?
If in this locking request I make there are a lot of insert and update operations, should the isolation level be REPEATABLE READ?
Is there any way to debug/process that kind of situations?
Is there any mechanism to timeout that locks so app can be easily/automatically released and not blocking further requests?
Side question (since I'm not looking for a tool directly): are there any tools to monitor and spot that kind of situations? (I was hoping to use Munin.)
I am using nodejs 4.2.1 with express 4.13.3, sequelize 3.19.3 as Postgres 9.4.1 ORM.
Welcome to PostgreSQL transaction locks hell :)
You can spend a lot of time trying to figure out where exactly the lock happens and why. But there is a very little chance that it will help you in resolving the situation.
The general recipe for solving this kind of situations is as follows:
Keep your transactions size to the bare minimum required by the business logic of your application. For example, avoid same-type inserts or updates, replacing them with multi-row analogues, because query IO is expensive
Do not use transactions while executing only a single query that modifies data, i.e. avoid unnecessary transactions.
Implement error handling that can determine a transaction lock and provide a repeated attempt at executing the transaction. Logging such repeats will help you understand weak spots of your system and how to redesign it better.
Even in a well-engineered system the last step often becomes a necessity, don't let it scare you ;)
I encountered a similar situation where I started 5 parallell transactions requesting the same update lock, and the first one also continued with work that required more postgres calls. The entire system deadlocks, and the first transaction is listed as idle in transaction in pg_stat_activity and granted access to all locks it has requested in pg_locks.
What I think is happening;
The first transaction got the lock granted, and then finished the query. After this it drops its connection to postgres.
The following 4 transactions open a connection each and blocks on the lock, that is held by the first transaction.
Since they are blocked, the first transaction gets to execute, when it tries to connect to postgres to make a query, it gets deadlocked, because sequiezlize has run out of connections.
When I changed my sequiezlize initialisation and added more connections to the pool, default being 5, the deadlock disappears.
I am not sure who is using the 5'th connection, or if the default happens to be 4 and not 5, for some reason, but still seem to tick all the boxes.
Another solution is to use the NOWAIT option in postgres, so a transaction abort when asking for a lock and not getting it, depending on your usecase.
Hope it helps if someone else gets encounters the same issue.