Cassandra count use case - cassandra

I'm trying to figure out an appropriate use case for Casandra's counter functionality. I thought of a situation and I was wondering if this would be feasible. I'm not quite sure because I'm still experimenting with Cassandra so any advice would be appreciated.
Lets say you had a small video service, you record the log of views in Cassandra while recording what video was played, which user played it, country, referer etc. You obviously want to show a count of how many times that video was played would incrementing a counter every time you insert a play event be a good solution to this? Or would there be a better alternative. Counting all the events on read every time would take a pretty big performance hit and even if you cached the results the cache would be invalidated pretty quickly if you had a busy site.
Any advice would be appreciated!

Counters can be used for whatever you need to count within an application -- both "frontend" data and "backend" one. I personally use them to store user's behaviour information (for backend analysis) and frontend ratings (each operation a user do in my platform give to the user some points). There is no real limitation on use case -- the limitation is given by few technical limitations, the bigger coming to my mind:
a counter cf can be made only by counters columns (except PK, obviously)
counters can't be reset: to set 0 value to a counter you need to read and calculate before writing (with no guarantee about the fact that someone else updated before you)
no ttl and no indexing/deletion
As far as your video service it all depends on how you choose to model data -- if you find a valid model to hit few partitions each time you write/read and you have a good key distribution I don't see any real problem in its implementation.
btw: you tagged Cassandra 2.0 but if you have to use counters you should think about 2.1 for the reasons described here

Related

Cassandra counter usage

I am finding some difficulties in the data modeling of an application which may involve the use of counters.
The app is basically a messaging app. Messages are bounded for free users, hence the initial plan of using a counter column to keep track of the total count.
I've discovered that batches (logged or not) cannot contain operations on both standard tables and counter ones. How do I ensure correctness if I cannot batch the operation I am trying to perform and the counter update together? Is the counter type really needed if there's basically no race condition on the column, being that associated to each individual user?
My second idea would be to use a standard int column to use only inside batches. Is this a viable option?
Thank you
If you can absolutely guarantee that each user will produce only one update at time then you could rely on plain ints to perform the job.
The problem however is that you will need to perform a read-before-write anti-pattern. You could solve this as well, eg skipping the read part by caching your ints and performing in-memory updates followed by writes only. This is viable by coupling your system with a caching server (e.g. Redis).
And thinking about it, you should still need to read these counters at some point, because if the number of messages a free user can send is bound to some value then you need to perform a check when they login/try to send a new message/look at the dashboard/etc and block their action.
Another option (if you store the messages sent by each user somewhere and don't want to add complexity to your system) could be to directly count them with a SELECT COUNT... type query, even if this could be become pretty inefficient very soon in the Cassandra world.

cassandra kafka connect source and eventual consistency

I am thinking about using Kafka connect to stream updates from Cassandra to a Kafka topic. The existing connector from StreamReactor seems to use a timestamp or uuidtimestamp to extract new changes since the last poll. The value of the timestamp is inserted using now() in the insert statement. The connector then saves the maximum time is received last time.
Since Cassandra is eventually consistent I am wondering what actually happens when doing repeated queries using a time range to get new changes. Is there not risk to miss rows inserted into Cassandra because it "arrived late" to the node queried when using WHERE create >= maxTimeFoundSoFar?
Yes it might happen that you have newer data in front of your "cursor" when you already went on with processing if you are using consistency level one for reading and writing, but even if you use higher consistency you might run into "problems" depending on the setup that you have. Basically there are a lot of things that can go wrong.
You can increase the chances of not doing this by using an old cassandra formula NUM_NODES_RESPONDING_TO_READ + NUM_NODES_RESPONDING_TO_WRITE > REPLICATION_FACTOR but since you are using now() from cassandra the node clocks might have millisecond offsets between them so you might even miss data if you have high frequency data. I know of some systems where people are actually using raspberry pi's with gps modules to keep the clock skew really tight :)
You would have to provide more about your use case but in reality yes you can totally skip some inserts if you are not "careful" but even then there is no 100% guarantee other then you process the data with some offset that would be enough for the new data to come in and settle.
Basically you would have to keep some moving time window in the past and then move it along plus making sure that you don't take into account anything newer than the let's say last minute. That way you are making sure the data is "settling".
I had some use cases where we processed sensory data that would came in with multiple days of delay. On some projects we simply ignored it on some the data was for reporting on the month level so we always processed the old data and added it to reporting database. i.e. we kept a time window 3 days back in history.
It just depends on your use case.

What is the best way to query timeseries data with cassandra?

My table is a time series one. The queries are going to process the latest entries and TTL expire them after successful processing. If they are not successfully processed, TTL will not set.
The only query I plan to run on this is to select all entries for a given entry_type. They will be processed and records corresponding to processed entries will be expired.
This way every time I run this query I will get all records in the table that are not processed and processing will be done. Is this a reasonable approach?
Would using a listenablefuture with my own executor add any value to this considering that the thread doing the select is just processing.
I am concerned about the TTL and tombstones. But if I use clustering key of timeuuid type is this ok?
You are right one important thing getting in your way will be tombstones. By Default you will keep them around for 10 days. Depending on your access patter this might cause significant problems. You can lower this by setting the directly on the table or change it in the cassandra yaml file. Then it will be valid for all the newly created table gc_grace_seconds
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
It is very important that you make sure you are running the repair on whole cluster once within this period. So if you lower this setting to let's say 2 days, then within two days you have to have one full repair done on the cluster. This is very important because processed data will reaper. I saw this happening multiple times, and is never pleasant especially if you are using cassandra as a queue and it seems to me that you might be using it in your solution. I'll try to give some tips at the end of the answer.
I'm slightly worried about you setting the ttl dynamically depending on result. What would be the point of inserting the ttl-ed data that was successful and keeping forever the data that wasn't. I guess some sort of audit or something similar. Again this is a queue pattern, try to avoid this if possible. Also one thing to keep in mind is that you will almost always insert the data once in the beginning and then once again with the ttl should your processing be o.k.
Also getting all entries might be a bit tricky. For very moderate load 10-100 req/s this might be reasonable but if you have thousands per second getting all the requests every time might not be a good idea. At least not if you put them into single partition.
Separating the workload is also good idea. So yes using listenable future seems totally legit.
Setting clustering key to be timeuuid is usually the case with time series thata and I totally agree with you on this one.
In reality as I mentioned earlier you have to to take into account you will be saving 10 days worth of data (unless you tweak it) no matter what you do, it doesn't matter if you ttl it. It's still going to be ther, and every time cassandra will scan the partition will have to read the ttl-ed columns. In short this is just pain. I would seriously consider actually using something as kafka if I were you because what you are describing simply looks to me like a queue.
If you still want to stick with cassandra then please consider using buckets (adding date info to partitioning key and having a composite partitioning key). Depending on the load you are expecting you will have to bucket by month, week, day, hour even minutes. In some cases you might even want to add artificial columns to reduce load on the cluster. But then again this might be out of scope of this question.
Be very careful when using cassandra as a queue, it's a known antipattern. You can do it, but there are a lot of variables and it extremely depends on the load you are using. I once consulted a team that sort of went down the path of cassandra as a queue. Since basically using cassandra there was a must I recommended them bucketing the data by day (did some calculations that proved this is o.k. time unit) and I also had a look at this solution https://github.com/paradoxical-io/cassieq basically there are a lot of good stuff in this repo when using cassandra as a queue, data models etc. Basically this team had zombie rows, slow reading because of the tombstones etc. etc.
Also the way you described it it might happen that you have "hot rows" basically since you would just have one wide partition where all your data would go some nodes in the cluster might not even be that good utilised. This can be avoided by artificial columns.
When using cassandra as a queue it's very easy to mess a lot of things up. (But it's possible for moderate workloads)

Cassandra - multiple counters based on timeframe

I am building an application and using Cassandra as my datastore. In the app, I need to track event counts per user, per event source, and need to query the counts for different windows of time. For example, some possible queries could be:
Get all events for user A for the last week.
Get all events for all users for yesterday where the event source is source S.
Get all events for the last month.
Low latency reads are my biggest concern here. From my research, the best way I can think to implement this is a different counter tables for each each permutation of source, user, and predefined time. For example, create a count_by_source_and_user table, where the partition key is a combination of source and user ID, and then create a count_by_user table for just the user counts.
This seems messy. What's the best way to do this, or could you point towards some good examples of modeling these types of problems in Cassandra?
You are right. If latency is your main concern, and it should be if you have already chosen Cassandra, you need to create a table for each of your queries. This is the recommended way to use Cassandra: optimize for read and don't worry about redundant storage. And since within every table data is stored sequentially according to the index, then you cannot index a table in more than one way (as you would with a relational DB). I hope this helps. Look for the "Data Modeling" presentation that is usually given in "Cassandra Day" events. You may find it on "Planet Cassandra" or John Haddad's blog.

Users last-access time with CouchDB

I am new to CouchDB, but that is not related to the problem. The question is simple, yet not clear to me.
For example: Boris was on the site 5 seconds ago and viewing his profile Ivan sees it.
How to correctly implement this feature (users last-access time)?
The problem is that, if we update users profile document in CouchDB, for ex. property last_access_time, each time a page is refreshed, than we will have the most relevant information (with MySQL we did it this way), but on the other hand, we will have _rev of the document somewhere about 100000++ by the end of the day.
So, how do you do that or do you have any ideas?
This is not a full answer but a possible optimization. It will work in addition to any other answers here.
Instead of storing the latest timestamp, update the timestamp only if it has changed by e.g. 5 seconds, or 60 seconds.
Assume a user refreshes every second for a day. That is 86,400 updates. But if you only update the timestamp at 5-second intervals, that is 17,280; for 60-seconds it is 1,440.
You can do this on the client side. When you want to update the timestamp, fetch the current document and check the old timestamp. If it is less than 5 seconds old, don't do anything. Otherwise, update it normally.
You can also do it on the server side. Write an _update function in CouchDB, which you can query like e.g. POST /db/_design/my_app/_update/last-access/the_doc_id?time=2011-01-31T05:05:31.872Z. The update function will do the same thing: check the old timestamp, and either do nothing, or update it, depending on the elapsed time.
If there was (a large) part of a document that is relatively static, and (a small) part that is highly dynamic, I would consider splitting it into two different documents.
Another option might be to use something more suited to the high write throughput of small pieces of data of that nature such as Redis or possibly MongoDB, and (if necessary) have a background task to occasionally write the info to CouchDB.
CouchDB has no problem with rapid document updates. Just do it, like MySQL. High _rev is no problem.
The only thing is, you have to be responsible about your couch from day 1. All CouchDB users must do this anyway, however you may have to do it sooner. (Applications with few updates have lower risk of a full disk, so developers can postpone this work.)
Poll your database and run compaction if it needs it (based on size, document count, seq_id number)
Poll your views and run compaction too
Always have enough disk capacity and i/o bandwidth to support compaction. Mathematical worst-case: you need 2x the database size, and 2x the write speed; however, most applications require less. Since you are updating documents, not adding them, you will need way less.

Resources