cassandra kafka connect source and eventual consistency - cassandra

I am thinking about using Kafka connect to stream updates from Cassandra to a Kafka topic. The existing connector from StreamReactor seems to use a timestamp or uuidtimestamp to extract new changes since the last poll. The value of the timestamp is inserted using now() in the insert statement. The connector then saves the maximum time is received last time.
Since Cassandra is eventually consistent I am wondering what actually happens when doing repeated queries using a time range to get new changes. Is there not risk to miss rows inserted into Cassandra because it "arrived late" to the node queried when using WHERE create >= maxTimeFoundSoFar?

Yes it might happen that you have newer data in front of your "cursor" when you already went on with processing if you are using consistency level one for reading and writing, but even if you use higher consistency you might run into "problems" depending on the setup that you have. Basically there are a lot of things that can go wrong.
You can increase the chances of not doing this by using an old cassandra formula NUM_NODES_RESPONDING_TO_READ + NUM_NODES_RESPONDING_TO_WRITE > REPLICATION_FACTOR but since you are using now() from cassandra the node clocks might have millisecond offsets between them so you might even miss data if you have high frequency data. I know of some systems where people are actually using raspberry pi's with gps modules to keep the clock skew really tight :)
You would have to provide more about your use case but in reality yes you can totally skip some inserts if you are not "careful" but even then there is no 100% guarantee other then you process the data with some offset that would be enough for the new data to come in and settle.
Basically you would have to keep some moving time window in the past and then move it along plus making sure that you don't take into account anything newer than the let's say last minute. That way you are making sure the data is "settling".
I had some use cases where we processed sensory data that would came in with multiple days of delay. On some projects we simply ignored it on some the data was for reporting on the month level so we always processed the old data and added it to reporting database. i.e. we kept a time window 3 days back in history.
It just depends on your use case.

Related

Does Watermark in Update output mode clean the stored state in Spark Structured Streaming?

I am working on a spark streaming application and while understanding about the sinks and watermarking logic, I couldn't find a clear answer as to if I use a watermark with say 10 min threshold while outputting the aggregations with update output mode, will the intermittent state maintained by spark be cleared off after the 10 min threshold has expired?
Watermark allows late arriving data to be considered for inclusion against already computed results for a period of time using windows. Its premise is that it tracks back to a point in time (threshold) before which it is assumed no more late events are supposed to arrive, but if they do, they are discarded.
As a consequence one needs to maintain the state of window / aggregate already computed to handle these potential late updates based on event time. However, this costs resources, and if done infinitely, this would blow up a Structured Streaming App.
Will the intermittent state maintained by spark be cleared off after the 10 min threshold has expired? Yes, it will. There is by design as there is no point holding any longer a state that can no longer be updated due to the threshold having been expired.
You need to run through some simple examples as I note it is easy to forget the subtlety of output.
See
Why does streaming query with update output mode print out all rows?
which gives an excellent example of update mode output as well. Also this gives an even better update example: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Even better - this blog with some good graphics: https://towardsdatascience.com/watermarking-in-spark-structured-streaming-9e164f373e9

What is the best way to query timeseries data with cassandra?

My table is a time series one. The queries are going to process the latest entries and TTL expire them after successful processing. If they are not successfully processed, TTL will not set.
The only query I plan to run on this is to select all entries for a given entry_type. They will be processed and records corresponding to processed entries will be expired.
This way every time I run this query I will get all records in the table that are not processed and processing will be done. Is this a reasonable approach?
Would using a listenablefuture with my own executor add any value to this considering that the thread doing the select is just processing.
I am concerned about the TTL and tombstones. But if I use clustering key of timeuuid type is this ok?
You are right one important thing getting in your way will be tombstones. By Default you will keep them around for 10 days. Depending on your access patter this might cause significant problems. You can lower this by setting the directly on the table or change it in the cassandra yaml file. Then it will be valid for all the newly created table gc_grace_seconds
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
It is very important that you make sure you are running the repair on whole cluster once within this period. So if you lower this setting to let's say 2 days, then within two days you have to have one full repair done on the cluster. This is very important because processed data will reaper. I saw this happening multiple times, and is never pleasant especially if you are using cassandra as a queue and it seems to me that you might be using it in your solution. I'll try to give some tips at the end of the answer.
I'm slightly worried about you setting the ttl dynamically depending on result. What would be the point of inserting the ttl-ed data that was successful and keeping forever the data that wasn't. I guess some sort of audit or something similar. Again this is a queue pattern, try to avoid this if possible. Also one thing to keep in mind is that you will almost always insert the data once in the beginning and then once again with the ttl should your processing be o.k.
Also getting all entries might be a bit tricky. For very moderate load 10-100 req/s this might be reasonable but if you have thousands per second getting all the requests every time might not be a good idea. At least not if you put them into single partition.
Separating the workload is also good idea. So yes using listenable future seems totally legit.
Setting clustering key to be timeuuid is usually the case with time series thata and I totally agree with you on this one.
In reality as I mentioned earlier you have to to take into account you will be saving 10 days worth of data (unless you tweak it) no matter what you do, it doesn't matter if you ttl it. It's still going to be ther, and every time cassandra will scan the partition will have to read the ttl-ed columns. In short this is just pain. I would seriously consider actually using something as kafka if I were you because what you are describing simply looks to me like a queue.
If you still want to stick with cassandra then please consider using buckets (adding date info to partitioning key and having a composite partitioning key). Depending on the load you are expecting you will have to bucket by month, week, day, hour even minutes. In some cases you might even want to add artificial columns to reduce load on the cluster. But then again this might be out of scope of this question.
Be very careful when using cassandra as a queue, it's a known antipattern. You can do it, but there are a lot of variables and it extremely depends on the load you are using. I once consulted a team that sort of went down the path of cassandra as a queue. Since basically using cassandra there was a must I recommended them bucketing the data by day (did some calculations that proved this is o.k. time unit) and I also had a look at this solution https://github.com/paradoxical-io/cassieq basically there are a lot of good stuff in this repo when using cassandra as a queue, data models etc. Basically this team had zombie rows, slow reading because of the tombstones etc. etc.
Also the way you described it it might happen that you have "hot rows" basically since you would just have one wide partition where all your data would go some nodes in the cluster might not even be that good utilised. This can be avoided by artificial columns.
When using cassandra as a queue it's very easy to mess a lot of things up. (But it's possible for moderate workloads)

How to store data in cassandra in a temporary location while sync is in progress?

We have a mysql server running which is serving application writes. To do some batch processing we have written a sync job to migrate data into cassandra cluster.
1. A daily sync job which transfers by updated timestamp for that day.
2. A complete sync job which transfers complete data, overriding existing ones.
Now there may be a possibility that the row was deleted from mysql, in that case using the above approach it will lie forever in cassandra.
To solve that problem we have given a TTL of 15 days for every row. So eventually it will get deleted, if it was not deleted then in next full sync the TTL will be over written again.
Its working fine as far as the use case is concerned but the issue is that in full sync complete data is over written and sstable is generated continuously with compactions happenning all the time, load averages shoot up with slowness and backup size increases (which could have been avoided).
Essentially we would want to replace the existing table data by new data but we dont want to truncate before starting the job but only after job completes.
Is there any way by which this can be solved other than creating a new table altogether and dropping past table when data is generated?
You can look at the double-run migration strategy I presented here: http://www.slideshare.net/doanduyhai/from-rdbms-to-cassandra-without-a-hitch
It has the advantage of allowing 100% uptime and possible rollback if things go wrong. The downside is the amount of work required in term of releases & codes

Replication acknowledgement in PostgreSQL + BDR

I'm using libpq C Library for testing PG + BDR replica set. I'd like to get acknowledgement of the CRUD operations' replication. My purpose is to make my own log of the replication time in milliseconds or if possible in microseconds.
The program:
Starts 10-20 threads witch separate connections, each thread makes 1000-5000 cycles of basic CRUD operations on three tables.
Which would be the best way?
Parsing some high verbosity logs if they have proper data with time stamp or in my C api I should start N thread (N = {number of nodes} - {the master I'm connected to}) after every CRUD op. and query the nodes for the data.
You can't get replay confirmation of individual xacts easily. The system keeps track of the log sequence number replayed by peer nodes but not what transaction IDs those correspond to, since it doesn't care.
What you seem to want is near-synchronous or semi-synchronous replication. There's some work coming there for 9.6 that will hopefully benefit BDR in time, but that's well in the future.
In the mean time you can see the log sequence number as restart_lsn in pg_replication_slots. This is not the position the replica has replayed to, but it's the oldest point it might have to restart replay at after a crash.
You can see the other LSN fields like replay_location only when a replica is connected in pg_stat_replication. Unfortunately in 9.4 there's no easy way to see which slot in pg_replication_slots is associated with which active connection in pg_stat_replication (fixed in 9.5, but BDR is based on 9.4 still). So you have to use the application_name set by BDR if you want to pick out individual nodes, and it's ... "interesting" to parse. Also often truncated.
You can get the current LSN of the server you committed an xact on after committing it by calling SELECT pg_current_xlog_location(); which will return a value like 0/19E0F060 or whatever. You can then look that value up in the pg_stat_replication of peer nodes until you see that the replay_location for the node you committed on has reached or passed the LSN you captured immediately after commit.
It's not perfect. There could be other work done between when you commit and when you capture the server's current LSN. There's no way around that, but at worst you wait slightly too long. If you're using BDR you shouldn't be caring about micro or even milliseconds anyway, since it's an asynchronous replication solution.
The principles are pretty similar to measuring replication lag for normal physical standby servers, so I suggest reading some docs on that. Except that pg_last_xact_replay_timestamp() won't work for logical replication, so you can't get lag using that, you have to use the LSNs and do your own timing client-side.

Cassandra - Distributing data and Multiple tables (Data modeling)

I am trying to learn cassandra. One thing I am not clear is how to ask Cassandra to distribute various tables. ie say I have time series data coming into table t1,t2,t3
T1 is heavily loaded ( ratio is 2000: 2:4 for num of rows).
I want the data of T1 for a given day to be not on the same machine as T2 or T3; so my queries are equally distributed ie not put too much load on one machine.
Also as the data gets older, its queried less, how can I take into account this factor.
regards
Cassandra is automatically distributed, you do not have a direct control on how the data gets distributed. In most cases, by default it makes use of an md5 on the row key and depending on that selects which nodes (computers) will be used to save the data.
What you are talking about would be more of a planning for a standard SQL database. However, if you generate extremely large amount of statistical data that is only to be used by some backend processes and users, you could have a separate cluster of 2 or 3 nodes. That way your other tables would not be affected by those statistics.
However, the true power of Cassandra is to be used with one large cluster. If it slows down, add nodes to it and do the necessary repair to spread the data properly. That's it... pretty much.
As for the way a table is used, you can use all the parameters defined on a table to tweak its setup. If you mainly do writes to a table, then you can tweak the parameters to get faster writes and slower reads. The other way around is also available: one write, many reads. And also many writes and many reads. To tweak those settings, in most cases you will need to have your software running and gather various stats and make changes as time passes.
Update:
There is actually a solution, thinking about it, just... I never use that mode so I did not think about it.
When you use a cluster which supports sorted rows, you can use a specific row name and the data will then go to a specific node. Again, you do not directly have control over what goes where, but if you really really want to do it that way, that's probably the solution you are looking for.
In this case, the row name would start with a number such as 0x0001 for T1 data, and 0x0100 and 0x0200 for T2 and T3. Since you do not know where the data really goes and how Cassandra decides to use it, it is rather complicated to obtain the right results here. And if you change your cluster (i.e. add nodes) then all your assumptions of where the data goes may very well go to the toilet! (and that's not speaking of upgrading to a new version of Cassandra...)

Resources