FoundationDb: What is meaning of FDBException: Transaction is too old to perform reads or be committed - foundationdb

I am trying to execute a getRange command in fdbCli but it fails with
FDBException: Transaction is too old to perform reads or be committed
What is the meaning of this particular exception?
Does it mean by query took more than 5 sec to complete?

Fdb keeps a list of the transaction started within 5 sec. Also, data nodes only keep versions of the last 5sec. So if the read version is smaller than the last version kept by dataNodes, the dataNodes have no way to answer the request. That's why fdb throws this exception. the trick to evade from such exceptions is to split one huge time taking transaction to many small transactions. I also noticed fdb performs really well if the transaction time < 300ms.

Firstly - yes, you are correct (your query took more than 5 seconds to complete).
If the read request’s timestamp is older than 5 seconds, the storage server may have already flushed the data from its in-memory multi-version data structure to its on-disk single-version data structure. This means the storage server does not have the data older than the 5 seconds. So the client will receive the error you've mentioned.
NB: You can avoid this problem via the use of a RecordCursor and by using passing a continuation to your query.
More on continuations here.

Related

Synchronicity of Azure stored procedures [duplicate]

Can documentDb stored procedures run in parallel and update the same object? Will documentDb process them sequentially?
Consider the following scenario.
I have an app and I have 10000 coins to give away to my users when they complete a task. And I have the following object
{
remainingPoints: 10000
}
I have a stored procedure that subtracts 10 points from this object and adds them to the users' points.
Now lets say 10 users complete the task at the same time and I call the stored procedure 10 times at the same time, will DocDb execute them sequentially? Or will I have to execute the stored procedures sequentially?
I had similar questions when I first started using DocumentDB and got good answers here and in email from the DocumentDB product managers. Quoting:
Stored procedures ... get an isolated snapshot of the database for transactional support. The snapshot reflects the current state of the world (no stale data) at the time the sproc begins execution (strongly consistent).
Caveat – since stored procedures are operating on a snapshot, you can still get a stale read in a sproc if a new write come in from the outside world during execution.
Also, stored procedures will ALWAYS read their owns writes.
Sprocs are DocumentDB’s mechanism for multi-document transactions. Sproc writes are committed when a sproc successfully complete execution. If an exception is thrown, all work done in a sproc gets rolled back.
So if two are sprocs are running concurrently, they won’t see eachother’s writes.
If both sprocs happen to write to the same document (replace) – then the 2nd one will fail due to an etag mismatch when it attempts to commit writes.
From that, I went forward with my design making sure to use ETags in my writes as #Julian suggests. I also automatically retry up to 3 times each sproc execution to handle the case where they fail due to parallel operations among other reasons. In practice, I've never exceed the 3 retries (except in cases where my sproc had a bug) and I rarely even get a single retry.
I assume from the behavior that I observe that it sends each new sproc execution to a different replica until it runs out of replicas and then it queues them for sequential execution, so it's a hybrid of parallel and serial execution.
One other tip that I learned through experimentation is that you are better off doing pure read operations (no writes and no significant aggregation) client-side rather than in a sproc when you are on a heavily loaded system. I assume the advantage is because DocumentDB can satisfy different reads from different replicas in parallel. I have modularized my sproc code using the expandScript functionality of documentdb-utils to make sure that I use the exact same code for write validation, intra-document consistency, and derived fields both client-side and server-side, which is possible using node.js. Even if you are mostly .NET, you may want to use expandScripts to build your sprocs in a modular DRY way. You'll still need to run node.js in your build process to pre-process your sprocs or use Edge.NET (node running inside of .NET) to do so on the fly.
It will depend on the consistency you have choose for your collection. But the idea is that DocumentDb handle concurrency using etag and executes stored procedure on a snapshot of a document version, and commit the result only if the execution succeed.
See: https://azure.microsoft.com/en-us/documentation/articles/documentdb-faq/#develop
This thread may help too: Atomically increment an integer in a document in Azure DocumentDB

DocumentDB: How to run a query without timing out

I am new to the documentDb. I wrote a stored procedure that checks all records and update them under certain circumstances.
Current scenario:
It would run 100 records at a time, updates them and after running few times( taking 100 records at a time and updating) it is timing out.
Expectation
Run the script on all the records without timing out.
The document has close to a million records. So, running the same script multiple times manually is not a the way I am looking for.
Can anyone please advise me how I can achieve that?
tl;dr; Keep calling the sproc with the query continuation token being passed back and forth.
A few thoughts:
There is no capacity of RUs for collections that will allow you to do all million in one call to the sproc.
Sprocs run in isolation on a single replica. This means that they can be transactional but their use will have lower throughput than a regular query that can use all replicas to satisfy the request, so unless you need it to be in a sproc, I recommend using direct queries for reads that don't need to be transactional with writes. Even then, with a million documents, your queries will max out and you'll have to run the query again with a continuation token.
If you must use a sproc... As you are probably aware since you have done the 100 at a time thing, each query returns a continuation token. You can actually add that to the package that you send back from your sproc when it times out. Then you can pass that back into another call to the same sproc and write your sproc to pick up where you left off. The documentdb-utils library for node.js automatically re-calls the sproc until done as long as you follow this pattern for writing your sprocs. If you are using node.js, you could use that (but it has not yet been upgraded to support partitioned collections) or you could write the equivalent in whatever platform you are using.

Replication acknowledgement in PostgreSQL + BDR

I'm using libpq C Library for testing PG + BDR replica set. I'd like to get acknowledgement of the CRUD operations' replication. My purpose is to make my own log of the replication time in milliseconds or if possible in microseconds.
The program:
Starts 10-20 threads witch separate connections, each thread makes 1000-5000 cycles of basic CRUD operations on three tables.
Which would be the best way?
Parsing some high verbosity logs if they have proper data with time stamp or in my C api I should start N thread (N = {number of nodes} - {the master I'm connected to}) after every CRUD op. and query the nodes for the data.
You can't get replay confirmation of individual xacts easily. The system keeps track of the log sequence number replayed by peer nodes but not what transaction IDs those correspond to, since it doesn't care.
What you seem to want is near-synchronous or semi-synchronous replication. There's some work coming there for 9.6 that will hopefully benefit BDR in time, but that's well in the future.
In the mean time you can see the log sequence number as restart_lsn in pg_replication_slots. This is not the position the replica has replayed to, but it's the oldest point it might have to restart replay at after a crash.
You can see the other LSN fields like replay_location only when a replica is connected in pg_stat_replication. Unfortunately in 9.4 there's no easy way to see which slot in pg_replication_slots is associated with which active connection in pg_stat_replication (fixed in 9.5, but BDR is based on 9.4 still). So you have to use the application_name set by BDR if you want to pick out individual nodes, and it's ... "interesting" to parse. Also often truncated.
You can get the current LSN of the server you committed an xact on after committing it by calling SELECT pg_current_xlog_location(); which will return a value like 0/19E0F060 or whatever. You can then look that value up in the pg_stat_replication of peer nodes until you see that the replay_location for the node you committed on has reached or passed the LSN you captured immediately after commit.
It's not perfect. There could be other work done between when you commit and when you capture the server's current LSN. There's no way around that, but at worst you wait slightly too long. If you're using BDR you shouldn't be caring about micro or even milliseconds anyway, since it's an asynchronous replication solution.
The principles are pretty similar to measuring replication lag for normal physical standby servers, so I suggest reading some docs on that. Except that pg_last_xact_replay_timestamp() won't work for logical replication, so you can't get lag using that, you have to use the LSNs and do your own timing client-side.

Custom in-memory cache

Imagine there's a web service:
Runs on a cluster of servers (nginx/node.js)
All data is stored remotely
Must respond within 20ms
Data that must be read for a response is split like this..
BatchA
Millions of small objects stored in AWS DynamoDB
Updated randomly at random times
Only consistent reads, can't be catched
BatchB
~2,000 records in SQL
Updated rarely, records up to 1KB
Can be catched for up to 60-90s
We can't read them all at once as we don't know which records to fetch from BatchB until we read from BatchA.
Read from DynamoDB takes up to 10ms. If we read BatchB from remote location, it would leave us with no time for calculations or we would have already been timed out.
My current idea is to load all BatchB records into memory of each node (that's only ~2MB). On startup, the system would connect to SQL server and fetch all records and then it would update them every 60 or 90 seconds. The question is what's the best way to do this?
I could simply read them all into a variable (array) in node.js and then use SetTimeout to update the array after 60-90s. But is the the best solution?
Your solution doesn't sound bad. It fits your needs. Go for it.
I suggest keeping two copies of the cache while in the process of updating it from remote location. While the 2MB are being received you've got yourself a partial copy of the data. I would hold on to the old cache until the new data is fully received.
Another approach would be to maintain only one cache set and update it as each record arrives. However, this is more difficult to implement and is error-prone. (For example, you should not forget to delete records from the cache if they are no longer found in the remote location.) This approach conserves memory, but I don't suppose that 2MB is a big deal.

Data in Cassandra not consistent even with Quorum configuration

I encountered a consistency problem using Hector and Cassandra when we have Quorum for both read and write.
I use MultigetSubSliceQuery to query rows from super column limit size 100, and then read it, then delete it. And start another around.
I found that the row which should be deleted by my prior query is still shown from next query.
And also from a normal Column Family, I updated the value of one column from status='FALSE' to status='TRUE', and the next time I queried it, the status was still 'FALSE'.
More detail:
It has not happened not every time (1/10,000)
The time between the two queries is around 500 ms (but we found one pair of queries in which 2 seconds had elapsed between them, still indicating a consistency problem)
We use ntp as our cluster time synchronization solution.
We have 6 nodes, and replication factor is 3
I understand that Cassandra is supposed to be "eventually consistent", and that read may not happen before write inside Cassandra. But for two seconds?! And if so, isn't it then meaningless to have Quorum or other consistency level configurations?
So first of all, is it the correct behavior of Cassandra, and if not, what data we need to analyze for further investment?
After check the source code with the system log, I found the root cause of the inconsistency.
Three factors cause the problem:
Create and update same record from different nodes
Local system time is not synchronized accurately enough (although we use NTP)
Consistency level is QUORUM
Here is the problem, take following as the event sequence
seqID NodeA NodeB NodeC
1. New(.050) New(.050) New(.050)
2. Delete(.030) Delete(.030)
First Create request come from Node C with local time stamp 00:00:00.050, assume requests first record in Node A and Node B, then later synchronized with Node C.
Then Delete request come from Node A with local time stamp 00:00:00.030, and record in node A and Node B.
When read request come, Cassandra will do version conflict merge, but the merge only depend on time stamp, so although Delete happened after Create, but the merge final result is "New" which has latest time stamp due to local time synchronization issue.
I also faced similar a issue. The issue occured because cassandra driver uses server timestamp by default to check which query is latest. However in latest version of cassandra driver they have changes it and now by default they are using client timestamp.
I have described the details of issue here
The deleted rows may be showing up as "range ghosts" because of the way that distributed deletes work: see http://wiki.apache.org/cassandra/FAQ#range_ghosts
If you are reading and writing individual columns both at CL_QUORUM, then you should always get full consistency, regardless of the time interval (provided strict ordering is still observed, i.e. you are certain that the read is always after the write). If you are not seeing this, then something, somewhere, is wrong.
To start with, I'd suggest a) verifying that the clients are syncing to NTP properly, and/or reproduce the problem with times cross-checked between clients somehow, and b) maybe try to reproduce the problem with CL_ALL.
Another thought - are your clients synced with NTP, or just the Cassandra server nodes? Remember that Cassandra uses the client timestamps to determine which value is the most recent.
I'm running into this problem with one of my clients/node. The other 2 clients I'm testing with (and 2 other nodes) run smoothly. I have a test that uses QUORUM in all reads and all writes and it fails very quickly. Actually some processes do not see anything from the others and others may always see data even after I QUORUM remove it.
In my case I turned on the logs and intended to test the feat with the tail -F command:
tail -F /var/lib/cassandra/log/system.log
to see whether I was getting some errors as presented here. To my surprise the tail process itself returned an error:
tail: inotify cannot be used, reverting to polling: Too many open files
and from another thread this means that some processes will fail opening files. In other words, the Cassandra node is likely not responding as expected because it cannot properly access data on disk.
I'm not too sure whether this is related to the problem that the user who posted the question, but tail -F is certainly a good way to determine whether the limit of files was reached.
(FYI, I have 5 relatively heavy servers running on the same machine so I'm not too surprise about the fact. I'll have to look into increasing the ulimit. I'll report here again if I get it fixed in this way.)
More info about the file limit and the ulimit command line option: https://askubuntu.com/questions/181215/too-many-open-files-how-to-find-the-culprit
--------- Update 1
Just in case, I first tested using Java 1.7.0-11 from Oracle (as mentioned below, I first used a limit of 3,000 without success!) The same error would popup at about the same time when running my Cassandra test (Plus even with the ulimit of 3,000 the tail -F error would still appear...)
--------- Update 2
Okay! That worked. I changed the ulimit to 32,768 and the problems are gone. Note that I had to enlarge the per user limit in /etc/security/limits.conf and run sudo sysctl -p before I could bump the maximum to such a high number. Somehow the default upper limit of 3000 was not enough even though the old limit was only 1024.

Resources