Cassandra distinct counting

Cassandra distinct counting - cassandra

I need to count bunch of "things" in Cassandra.
I need to increase ~100-200 counters every few seconds or so.
However I need to count distinct "things".
In order not to count something twice, I am setting a key in a CF, which program reads before increase the counter, e.g. something like:
result = get cf[key];
if (result == NULL){
set cf[key][x] = 1;
incr counter_cf[key][x];
}
However this read operation slows down the cluster a lot.
I tried to decrease reads, using several columns, e.g. something like:
result = get cf[key];
if (result[key1]){
set cf[key1][x] = 1;
incr counter_cf[key1][x];
}
if (result[key2]){
set cf[key2][x] = 1;
incr counter_cf[key2][x];
}
//etc....
Then I reduced the reads from 200+ to about 5-6, but it still slows down the cluster.
I do not need exact counting, but I can not use bit-masks, nor bloom-filters,
because there will be 1M+++ counters and some could go more than 4 000 000 000.
I am aware of Hyper_Log_Log counting, but I do not see easy way to use it with that many counters (1M+++) either.
At the moment I am thinking of using Tokyo Cabinet as external key/value store,
but this solution, if works, will not be as scalable as Cassandra.

Using Cassandra for the distinct counting is not ideal when the number of distinct values is big. Any time you need to do a read before a write you should ask yourself if Cassandra is the right choice.
If the number of distinct items is smaller you can just store them as column keys and do a count. A count is not free, Cassandra still has to assemble the row to count the number of columns, but if the number of distinct values is in the order of thousands it's probably going to be ok. I assume you've already considered this option and that it's not feasible for you, I just thought I'd mention it.
The way people typically do it is having the HLL's or Bloom filters in memory and then flushing them to Cassandra periodically. I.e. not doing the actual operations in Cassandra, just using it for persistance. It's a complex system, but there's easy way of counting distinct values, especially if you have a massive number of counters.
Even if you switched to something else, for example to something where you can do bit operations on values, you still need to guard against race conditions. I suggest that you simply bite the bullet and do all of your counting in memory. Shard the increment operations over your processing nodes by key and keep the whole counter state (both incremental and distinct) in memory on those nodes. Periodically flush the state to Cassandra and ack the increment operations when you do. When a node gets an increment operation for a key it does not have in memory it loads that state from Cassandra (or creates a new state if there's nothing in the database). If a node crashes the operations have not been acked and will be redelivered (you need a good message queue in front of the nodes to take care of this). Since you shard the increment operations you can be sure that a counter state is only ever touched by one node.

Related

Spark count dataframe to estimate output partitions, then write, efficiently without caching?

As my spark program runs on more data, I think I am crashing because I'm picking up the default number of output partitions for aggregation - namely the 200. I've learned how to control this, but it seems ideally, I would set the number of output partitions based on the amount of data I'm writing. Here in lies the conundrum - I need to first call count() on the dataframe, and then write it. That means I may re-ready it from S3 twice. I could cache and then count, but I've seen spark crash when I cache this data, caching seems to use the most resources, whereas if I just write it - it can do something more optimal.
So my questions are, if you think this is a decent approach - doing a count first (the count is a proxy to the size on disk) or should you just hard code some numbers, change them when you need? And if I am going to count first, is their some clever way to optimize things so that the count and write share work? Other than caching the whole dataframe?

Yes the count approach is actually correct way to go. Ideally you want your rdd partitions to be some considerable size like 50MB before writing. Otherwise you will end up with "small file problem".
Now if you have large data caching in memory could be hard. You could try MEMORY_AND_DISK but then the data will spill to disk and cause slowdown.
I have faced this predicament multiple times and every time I have chosen a "magic number" for the number of partitions. The number is parameterized so when I need to change I don't need to change the code, rather pass the different parameter.
If you know your datasize is generally in a particular range you could set the partition number hard coded. It is not ideal but gets the job done.
Also you could pump the metrics like size of the data in s3 and if that breaches some threshold raise an alarm then someone could change the partition number manually.
In generally if you keep the partition number moderately high like 5000 for approximately 500GB data that works for a large range i.e from 300GB to 1.2TB amount of data. This means probably you don't need to change the partition number too often if you have moderate inflow of data.

Cassandra, Counters, and Write Conflicts

We are exploring using Cassandra as a way to store time series type data, so this may be somewhat of a noob question. One of the use cases is to read data from a Kafka stream, look for matches, and incrementing a counter (e.g. 5 customers have clicked through link alpha on page beta, increment (beta, alpha) by 5). However, we expect a very wide degree of parallelism to keep up with the load, so there may be more than one consumer reading from Kafka at the same time.
My question is: How would Cassandra resolve multiple simultaneous writes to a given counter from multiple sources?
It's my understanding that multiple writes to the counter with different timestamps will be added to the counter in the timestamp order received. However, if there were to be a simultaneous write with exact same timestamp, would the LWW model of Cassandra throw out one of those counter increments?
If we were to have a large cluster (100+ nodes), ALL or QUORUM writes may not be sufficient performant to keep up with the messasge traffic. Writes with THREE would seem to be likely to result in a situation where process #1 writes to nodes A, B, and C, but process #2 might write to X, Y, and Z. Would LWT work here, or do they not play well with counter activity?

I would try out a proof of concept and benchmark it, it will most likely work just fine. Counters are not super performant in Cassandra though, especially if there will be a lot of contention.
Counters are not like the normal writes with a simple LWW, it uses paxos with some pessimistic locking and specialized caches. The partition lock contention will slow it down soome, and paxos is an expensive multiple network hop process with reads before writes.
Use quorum, don't try to do something funky with CL's with counters, especially before benchmarking to know if you need it. 100 node cluster should be able to handle a lot as long as your not trying to update all the same partitions constantly.

how to rapidly increment counters in Cassandra w/o staleness

I have a Cassandra question. Do you know how Cassandra does updates/increments of counters?
I want to use a storm bolt (CassandraCounterBatchingBolt from storm-contrib repo on github) which writes into cassandra. However, I'm not sure how some of the implementation of the incrementCounterColumn() method works .. and there is also the limitations with cassandra counters (from: http://wiki.apache.org/cassandra/Counters) which makes them useless for my scenario IMHO:
If a write fails unexpectedly (timeout or loss of connection to the coordinator node) the client will not know if the operation has been performed. A retry can result in an over count CASSANDRA-2495.
Counter removal is intrinsically limited. For instance, if you issue very quickly the sequence "increment, remove, increment" it is possible for the removal to be lost
Anyway, here is my scenario:
I update the same counter faster than the updates propagate to other Cassandra nodes.
Example:
Say I have 3 cassandra nodes. The counters on each of these nodes are 0.
Node1:0, node2:0, node3:0
An increment comes: 5 -> Node1:0, node2:0, node3:0
Increment starts at node 2 – still needs to propagate to node1 and node3
Node1:0, node2:5, node3:0
In the meantime, another increment arrives before previous increment
is propagated: 3 -> Node1:0, node2:5, node3:0
Assuming 3 starts at a different node than where 5 started we have:
Node1:3, node2:5, node3:0
Now if 3 gets propagated to the other nodes AS AN INCREMENT and not as a new value
(and the same for 5) then eventually the nodes would all equal 8 and this is what I want.
If 3 overwrites 5 (because it has a later timestamp) this is problematic – not what I want.
Do you know how these updates/increments are handled by Cassandra?
Note, that a read before a write is still susceptible to the same problem depending from which replica node the read executes (Quorum can still fail if propagation is not far along)
I'm also thinking that maybe putting a cache b/w my storm bolt and Cassandra might solve this issue but that's a story for another time.

Counters in C* have a complex internal representation that avoids most (but not all) problems of counting things in a leaderless distributed system. I like to think of them as sharded counters. A counter consists of a number of sub-counters identified by host ID and a version number. The host that receives the counter operation increments only its own sub-counter, and also increments the version. It then replicates its whole counter state to the other replicas, which merge it with their states. When the counter is read the node handling the read operation determines the counter value by summing up the total of the counts from each host.
On each node a counter increment is just like everything else in Cassandra, just a write. The increment is written to the memtable, and the local value is determined at read time by merging all of the increments from the memtable and all SSTables.
I hope that explanation helps you believe me when I say that you don't have to worry about incrementing counters faster than Cassandra can handle. Since each node keeps its own counter, and never replicates increment operations, there is no possibility of counts getting lost by race conditions like a read-modify-write scenario would introduce. If Cassandra accepts the write, your're pretty much guaranteed that it will count.
What you're not guaranteed, though, is that the count will appear correct at all times unless. If an increment is written to one node but the counter value read from another just after, there is not guarantee that the increment has been replicated, and you also have to consider what would happen during a network partition. This more or less the same with any write in Cassandra, it's in its eventually consistent nature, and it depends on which consistency levels you used for the operations.
There is also the possibility of a lost acknowledgement. If you do an increment and loose the connection to Cassandra before you can get the response back you can't know whether or not your write got though. And when you get the connection back you can't tell either, since you don't know what the count was before you incremented. This is an inherent problem with systems that choose availability over consistency, and the price you pay for many of the other benefits.
Finally, the issue of rapid remove, increment, removes are real, and something you should avoid. The problem is that the increment operation will essentially resurrect the column, and if these operations come close enough to each other they might get the same timestamp. Cassandra is strictly last-write-wins and determines last based on the timestamp of the operation. If two operations have the same time stamp, the "greater" one wins, which means the one which sorts after in a strict byte order. It's real, but I wouldn't worry too much about it unless you're doing very rapid writes and deletes to the same value (which is probably a fault in your data model).
Here's a good guide to the internals of Cassandra's counters: http://www.datastax.com/wp-content/uploads/2011/07/cassandra_sf_counters.pdf

The current version of counters are just not a good fit for a use case that requires guarantees of no over-counting and immediate consistency.
There are increment and decrement operations, and those will not collide with each other, and, barring any lost mutations or replayed mutations, will give you a correct result.
The rewrite of Cassandra counters (https://issues.apache.org/jira/browse/CASSANDRA-6504) might be interesting to you, and it should address all of the current concerns with getting a correct count.
In the meantime, if I had to implement this on top of a current version of Cassandra, and an accurate count was essential, I would probably store each increment or decrement as a column, and do read-time aggregation of the results, while writing back a checkpoint so you don't have to read back to the beginning of time to calculate subsequent results.
That adds a lot of burden to the read side, though it is extremely efficient on the write path, so it may or may not work for your use case.

To understand updates/increments i.e write operations, i will suggest you to go through Gossip, protocol used by Cassandra for communication. In Gossip every participant(node) maintains their state using the tuple σ(K) = (V*N) where σ(K) is the state of K key with V value and N as version number.
To maintain the single version of truth for a data packet Gossip maintains a Reconciliation mechanism namely Precise & Scuttlebutt(current). According to Scuttlebutt Reconciliation, before updating any tuple they communicate with each other to check who is holding the highest version (newest value) of the key. Whosoever is holding the highest version is responsible for the write operation.
For further information read this article.

Getting different row count for column family each time

I have created 4 nodes cluster, and pointed the cluster from my client. After some time, i didn't point the cluster anywhere. But the row count keep on varying, it is decreasing and increasing for all column families.
what could be the reason?

Counting number of rows in Cassandra is notoriously difficult (see my blog post on the issue).
It looks like your issue is consistency. The usual rules of consistency apply: if you require consistent reads, you need to make sure R + W > N (R=number of nodes required for read, W=number for writes, N=number of nodes). A common way to do this is to read and write at CL.QUORUM.
Note that counting rows is extremely expensive, since it reads through all your data. If this is a common operation you should find a different way of doing it, depending on your use case.

Table with heavy writes and some reads in Cassandra. Primary key searches taking 30 seconds. (Queue)

Have a table set up in Cassandra that is set up like this:
Primary key columns
shard - an integer between 1 and 1000
last_used - a timestamp
Value columns:
value - a 22 character string
Example if how this table is used:
shard last_used | value
------------------------------------
457 5/16/2012 4:56pm NBJO3poisdjdsa4djmka8k >-- Remove from front...
600 6/17/2013 5:58pm dndiapas09eidjs9dkakah |
...(1 million more rows) |
457 NOW NBJO3poisdjdsa4djmka8k <-- ..and put in back
The table is used as a giant queue. Very many threads are trying to "pop" the row off with the lowest last_used value, then update the last_used value to the current moment in time. This means that once a row is read, since last_used is part of the primary key, that row is deleted, then a new row with the same shard, value, and updated last_used time is added to the table, at the "end of the queue".
The shard is there because so many processes are trying to pop the oldest row off the front of the queue and put it at the back, that they would severely bottleneck each other if only one could access the queue at the same time. The rows are randomly separated into 1000 different "shards". Each time a thread "pops" a row off the beginning of the queue, it selects a shard that no other thread is currently using (using redis).
Holy crap, we must be dumb!
The problem we are having is that this operation has become very slow on the order of about 30 seconds, a virtual eternity.
We have only been using Cassandra for less than a month, so we are not sure what we are doing wrong here. We have gotten some indication that perhaps we should not be writing and reading so much to and from the same table. Is it the case that we should not be doing this in Cassandra? Or is there perhaps some nuance in the way we are doing it or the way that we have it configured that we need to change and/or adjust? How might be trouble-shoot this?
More Info
We are using the MurMur3Partitioner (the new random partitioner)
The cluster is currently running on 9 servers with 2GB RAM each.
The replication factor is 3
Thanks so much!

This is something you should not use Cassandra for. The reason you're having performance issues is because Cassandra has to scan through mountains of tombstones to find the remaining live columns. Every time you delete something Cassandra writes a tombstone, it's a marker that the column has been deleted. Nothing is actually deleted from disk until there is a compaction. When compacting Cassandra looks at the tombstones and determines which columns are dead and which are still live, the dead ones are thrown away (but then there is also GC grace, which means that in order to avoid spurious resurrections of columns Cassandra keeps the tombstones around for a while longer).
Since you're constantly adding and removing columns there will be enormous amounts of tombstones, and they will be spread across many SSTables. This means that there is a lot of overhead work Cassandra has to do to piece together a row.
Read the blog post "Cassandra anti-patterns: queues and queue-like datasets" for some more details. It also shows you how to trace the queries to verify the issue yourself.
It's not entirely clear from your description what a better solution would be, but it very much sounds like a message queue like RabbitMQ, or possibly Kafka would be a much better solution. They are made to have a constant churn and FIFO semantics, Cassandra is not.
There is a way to make the queries a bit less heavy for Cassandra, which you can try (although I still would say Cassandra is the wrong tool for this job): if you can include a timestamp in the query you should hit mostly live columns. E.g. add last_used > ? (where ? is a timestamp) to the query. This requires you to have a rough idea of the first timestamp (and don't do a query to find it out, that would be just as costly), so it might not work for you, but it would take some of the load off of Cassandra.

The system appears to be under stress (2GB or RAM may be not enough).
Please have nodetool tpstats run and report back on its results.

Use RabbitMQ. Cassandra is probably a bad choice for this application.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string