Replication acknowledgement in PostgreSQL + BDR - multithreading

I'm using libpq C Library for testing PG + BDR replica set. I'd like to get acknowledgement of the CRUD operations' replication. My purpose is to make my own log of the replication time in milliseconds or if possible in microseconds.
The program:
Starts 10-20 threads witch separate connections, each thread makes 1000-5000 cycles of basic CRUD operations on three tables.
Which would be the best way?
Parsing some high verbosity logs if they have proper data with time stamp or in my C api I should start N thread (N = {number of nodes} - {the master I'm connected to}) after every CRUD op. and query the nodes for the data.

You can't get replay confirmation of individual xacts easily. The system keeps track of the log sequence number replayed by peer nodes but not what transaction IDs those correspond to, since it doesn't care.
What you seem to want is near-synchronous or semi-synchronous replication. There's some work coming there for 9.6 that will hopefully benefit BDR in time, but that's well in the future.
In the mean time you can see the log sequence number as restart_lsn in pg_replication_slots. This is not the position the replica has replayed to, but it's the oldest point it might have to restart replay at after a crash.
You can see the other LSN fields like replay_location only when a replica is connected in pg_stat_replication. Unfortunately in 9.4 there's no easy way to see which slot in pg_replication_slots is associated with which active connection in pg_stat_replication (fixed in 9.5, but BDR is based on 9.4 still). So you have to use the application_name set by BDR if you want to pick out individual nodes, and it's ... "interesting" to parse. Also often truncated.
You can get the current LSN of the server you committed an xact on after committing it by calling SELECT pg_current_xlog_location(); which will return a value like 0/19E0F060 or whatever. You can then look that value up in the pg_stat_replication of peer nodes until you see that the replay_location for the node you committed on has reached or passed the LSN you captured immediately after commit.
It's not perfect. There could be other work done between when you commit and when you capture the server's current LSN. There's no way around that, but at worst you wait slightly too long. If you're using BDR you shouldn't be caring about micro or even milliseconds anyway, since it's an asynchronous replication solution.
The principles are pretty similar to measuring replication lag for normal physical standby servers, so I suggest reading some docs on that. Except that pg_last_xact_replay_timestamp() won't work for logical replication, so you can't get lag using that, you have to use the LSNs and do your own timing client-side.

Related

Dealing with eventual consistency in Cassandra

I have a 3 node cassandra cluster with RF=2. The read consistency level, call it CL, is set to 1.
I understand that whenever CL=1,a read repair would happen when a read is performed against Cassandra, if it returns inconsistent data. I like the idea of having CL=1 instead of setting it to 2, because then even if a node goes down, my system would run fine. Thinking by the way of the CAP theorem, I like my system to be AP instead of CP.
The read requests are seldom(more like 2-3 per second), but are very important to the business. They are performed against log-like data(which is immutable, and hence never updated). My temporary fix for this is to run the query more than once, say 3 times, instead of running it once. This way, I can be sure that that even if I don't get my data in the first read request, the system would trigger read repairs, and I would eventually get my data during the 2nd or 3rd read request. Ofcourse, these 3 queries happen one after the other, without any blocking.
Is there any way that I can direct Cassandra to perform read repairs in the background without having the need to actually perform a read request in order to trigger a repair?
Basically, I am looking for ways to tune my system in a way as to circumvent the 'eventual consistency' model, by which my reads would have a high probability of succeeding.
Help would be greatly appreciated.
reads would have a high probability of succeeding
Look at DowngradingConsistencyRetryPolicy this policy allows retry queries with lower CL than the initial one. With this policy your queries will have strong consistency when all nodes are available and you will not lose availability if some node is fail.

cassandra kafka connect source and eventual consistency

I am thinking about using Kafka connect to stream updates from Cassandra to a Kafka topic. The existing connector from StreamReactor seems to use a timestamp or uuidtimestamp to extract new changes since the last poll. The value of the timestamp is inserted using now() in the insert statement. The connector then saves the maximum time is received last time.
Since Cassandra is eventually consistent I am wondering what actually happens when doing repeated queries using a time range to get new changes. Is there not risk to miss rows inserted into Cassandra because it "arrived late" to the node queried when using WHERE create >= maxTimeFoundSoFar?
Yes it might happen that you have newer data in front of your "cursor" when you already went on with processing if you are using consistency level one for reading and writing, but even if you use higher consistency you might run into "problems" depending on the setup that you have. Basically there are a lot of things that can go wrong.
You can increase the chances of not doing this by using an old cassandra formula NUM_NODES_RESPONDING_TO_READ + NUM_NODES_RESPONDING_TO_WRITE > REPLICATION_FACTOR but since you are using now() from cassandra the node clocks might have millisecond offsets between them so you might even miss data if you have high frequency data. I know of some systems where people are actually using raspberry pi's with gps modules to keep the clock skew really tight :)
You would have to provide more about your use case but in reality yes you can totally skip some inserts if you are not "careful" but even then there is no 100% guarantee other then you process the data with some offset that would be enough for the new data to come in and settle.
Basically you would have to keep some moving time window in the past and then move it along plus making sure that you don't take into account anything newer than the let's say last minute. That way you are making sure the data is "settling".
I had some use cases where we processed sensory data that would came in with multiple days of delay. On some projects we simply ignored it on some the data was for reporting on the month level so we always processed the old data and added it to reporting database. i.e. we kept a time window 3 days back in history.
It just depends on your use case.

Using Cassandra as a Queue

Using Cassandra as Queue:
Is it really that bad?
Setup: 5 node cluster, all operations execute at quorum
Using DateTieredCompaction should significantly reduce the cost of TombStones, and allow entire SSTables to be dropped at once.
We add all messages to the queue with the same TTL
We partition messages based on time (say 1 minute intervals), and keep track of the read-position.
Messages consumed will be explicitly deleted. (only 1 thread extracts messages)
Some Messages may be explicitly deleted prior to being read (i.e. we may have tombstones after the read-position). (i.e. the TTL initially used is an upper limit) gc_grace would probably be set to 0, as quorum reads will do blocking-repair (i.e. we can have repair turned off, as messages only reside in 1 cluster (DC), and all operations a quorum))
Messages can be added/deleted only, no updates allowed.
In our use case, if a tombstone does not replicate its not a big deal, its ok for us to see the same message multiple times occasionally. (Also we would likely not run Repair on regular basis, as all operations are executing at quorum.)
Thoughts?
Generally, it is an anti-pattern, this link talks much of the impact on tombstone: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
My opinion is, try to avoid that if possible, but if you really understand the performance impact, and it is not an issue in your architecture, of course you could do that.
Another reason to not do that if possible is, the cassandra data structure is not designed for queues, it will always look ugly, UGLY!
Strongly suggest to consider Redis or RabbitMQ before making your final decision.

how to rapidly increment counters in Cassandra w/o staleness

I have a Cassandra question. Do you know how Cassandra does updates/increments of counters?
I want to use a storm bolt (CassandraCounterBatchingBolt from storm-contrib repo on github) which writes into cassandra. However, I'm not sure how some of the implementation of the incrementCounterColumn() method works .. and there is also the limitations with cassandra counters (from: http://wiki.apache.org/cassandra/Counters) which makes them useless for my scenario IMHO:
If a write fails unexpectedly (timeout or loss of connection to the coordinator node) the client will not know if the operation has been performed. A retry can result in an over count CASSANDRA-2495.
Counter removal is intrinsically limited. For instance, if you issue very quickly the sequence "increment, remove, increment" it is possible for the removal to be lost
Anyway, here is my scenario:
I update the same counter faster than the updates propagate to other Cassandra nodes.
Example:
Say I have 3 cassandra nodes. The counters on each of these nodes are 0.
Node1:0, node2:0, node3:0
An increment comes: 5 -> Node1:0, node2:0, node3:0
Increment starts at node 2 – still needs to propagate to node1 and node3
Node1:0, node2:5, node3:0
In the meantime, another increment arrives before previous increment
is propagated: 3 -> Node1:0, node2:5, node3:0
Assuming 3 starts at a different node than where 5 started we have:
Node1:3, node2:5, node3:0
Now if 3 gets propagated to the other nodes AS AN INCREMENT and not as a new value
(and the same for 5) then eventually the nodes would all equal 8 and this is what I want.
If 3 overwrites 5 (because it has a later timestamp) this is problematic – not what I want.
Do you know how these updates/increments are handled by Cassandra?
Note, that a read before a write is still susceptible to the same problem depending from which replica node the read executes (Quorum can still fail if propagation is not far along)
I'm also thinking that maybe putting a cache b/w my storm bolt and Cassandra might solve this issue but that's a story for another time.
Counters in C* have a complex internal representation that avoids most (but not all) problems of counting things in a leaderless distributed system. I like to think of them as sharded counters. A counter consists of a number of sub-counters identified by host ID and a version number. The host that receives the counter operation increments only its own sub-counter, and also increments the version. It then replicates its whole counter state to the other replicas, which merge it with their states. When the counter is read the node handling the read operation determines the counter value by summing up the total of the counts from each host.
On each node a counter increment is just like everything else in Cassandra, just a write. The increment is written to the memtable, and the local value is determined at read time by merging all of the increments from the memtable and all SSTables.
I hope that explanation helps you believe me when I say that you don't have to worry about incrementing counters faster than Cassandra can handle. Since each node keeps its own counter, and never replicates increment operations, there is no possibility of counts getting lost by race conditions like a read-modify-write scenario would introduce. If Cassandra accepts the write, your're pretty much guaranteed that it will count.
What you're not guaranteed, though, is that the count will appear correct at all times unless. If an increment is written to one node but the counter value read from another just after, there is not guarantee that the increment has been replicated, and you also have to consider what would happen during a network partition. This more or less the same with any write in Cassandra, it's in its eventually consistent nature, and it depends on which consistency levels you used for the operations.
There is also the possibility of a lost acknowledgement. If you do an increment and loose the connection to Cassandra before you can get the response back you can't know whether or not your write got though. And when you get the connection back you can't tell either, since you don't know what the count was before you incremented. This is an inherent problem with systems that choose availability over consistency, and the price you pay for many of the other benefits.
Finally, the issue of rapid remove, increment, removes are real, and something you should avoid. The problem is that the increment operation will essentially resurrect the column, and if these operations come close enough to each other they might get the same timestamp. Cassandra is strictly last-write-wins and determines last based on the timestamp of the operation. If two operations have the same time stamp, the "greater" one wins, which means the one which sorts after in a strict byte order. It's real, but I wouldn't worry too much about it unless you're doing very rapid writes and deletes to the same value (which is probably a fault in your data model).
Here's a good guide to the internals of Cassandra's counters: http://www.datastax.com/wp-content/uploads/2011/07/cassandra_sf_counters.pdf
The current version of counters are just not a good fit for a use case that requires guarantees of no over-counting and immediate consistency.
There are increment and decrement operations, and those will not collide with each other, and, barring any lost mutations or replayed mutations, will give you a correct result.
The rewrite of Cassandra counters (https://issues.apache.org/jira/browse/CASSANDRA-6504) might be interesting to you, and it should address all of the current concerns with getting a correct count.
In the meantime, if I had to implement this on top of a current version of Cassandra, and an accurate count was essential, I would probably store each increment or decrement as a column, and do read-time aggregation of the results, while writing back a checkpoint so you don't have to read back to the beginning of time to calculate subsequent results.
That adds a lot of burden to the read side, though it is extremely efficient on the write path, so it may or may not work for your use case.
To understand updates/increments i.e write operations, i will suggest you to go through Gossip, protocol used by Cassandra for communication. In Gossip every participant(node) maintains their state using the tuple σ(K) = (V*N) where σ(K) is the state of K key with V value and N as version number.
To maintain the single version of truth for a data packet Gossip maintains a Reconciliation mechanism namely Precise & Scuttlebutt(current). According to Scuttlebutt Reconciliation, before updating any tuple they communicate with each other to check who is holding the highest version (newest value) of the key. Whosoever is holding the highest version is responsible for the write operation.
For further information read this article.

alternative to polling database?

I have an application that works as follows: Linux machines generate 28 different types of letter to customers. The letters must be sent in .docx (Microsoft Word format). A secretary maintains MS Word templates, which are automatically used as necessary. Changing from using MS Word is not an option.
To coordinate all this, document jobs are placed into a database table and a python program running on each of the windows machines polls the database frequently, locking out jobs and running them as necessary.
We use a central database table for the job information to coordinate different states ("new", "processing", "finished", "printed")... as well to give accurate status information.
Anyway, I don't like the clients polling the database frequently, seeing as they aren't working most of the time. Clients hpoll every 5 seconds.
To avoid polling, I kind of want a broadcast "there's some work to do" or "check your database for some work to do" message sent to all the client machines.
I think some kind of publish/subscribe message queue would be up to the job, but I don't want any massive extra complexity.
Is there a zero or near zero config/maintenance piece of software that would achieve this? What are the options?
X
Is there any objective evidence that any significant load is being put on the server? If it works, I'd make sure there's really a problem to solve here.
It must be nice to have everything running so smoothly that you're looking at things that might only possibly be improved!
Is there a zero or near zero config/maintenance piece of software that would achieve this? What are the options?
Possibly, but what you would save in configuration and implementation time would likely hurt performance more than your polling service ever could. SQL Server isn't made to do a push really (not easily anyway). There are things that you could use to push data out (replication service, log shipping - icky stuff), but they would be more complex and require more resources than your simple polling service. Some options would be:
some kind of trigger which runs your executable using command-line calls (sp_cmdshell)
using a COM object which SQL Server could open and run
using a SQL Agent job to run a VBScript (which would again be considered "polling")
These options are a bit ridiculous considering what you have already done is much simpler.
If you are worried about the polling service using too many cycles or something - you can always throttle it back - polling every minute, every 10 minutes, or even just once a day might be more appropriate - this would be a business decision, so go ask someone in the business how fast it needs to be.
Simple polling services are fairly common, because they are, well... simple. In addition they are also low overhead, remotely stable, and error-tolerant. The down side is that they can hammer the database into dust if not carefully controlled.
A message queue might work well, as they're usually setup to be able to block for a while without wasting resources. But with MySQL, I don't think that's an option.
If you just want to reduce load on the DB, you could create a table with a single row: the latest job ID. Then clients just need to compare that to their last ID to see if they need to run a full poll against the real table. This way the overhead should be greatly reduced, if it's an issue now.
Unlike Postgres and SQL Server (or object stores like CouchDb), MySQL does not emit database events. However there are some coding patterns you can use to simulate this.
If you have one or more tables that you wish to monitor, you can create triggers on these tables that add a row to a "changes" table that records a queue of events to process. Your triggers filter the subset of data changes that you care about and create records in your changes table for each event you wish to perform. Because this pattern queues and persists events it works well even when the workers that process these events have outages.
You might think that MyISAM is the best choice for the changes table since it's mostly performing writes (or even MEMORY if you don't need to persist the events between database server outages). However, keep in mind that both Memory and MEMORY and MyISAM have only full-table locks so your trigger on an InnoDB table might hit a bottle neck when performing an insert into a MEMORY and MyISAM table. You may also require InnoDB for the changes table if you're using a ON DELETE CASCADE with another InnoDB table (requires both tables to use the same engine).
You might also use SHOW TABLE STATUS to check the last update time of you changes table to check if there's something to perform. This feature wont work for InnoDB tables.
These articles describes in more depth some of alternative ways to implement queues in MySQL and even avoid polling!
How to notify event listeners in MySQL
How to implement a queue in SQL
5 subtle ways you're using MySQL as a queue, and why it'll bite you

Resources