Getting different row count for column family each time - cassandra

I have created 4 nodes cluster, and pointed the cluster from my client. After some time, i didn't point the cluster anywhere. But the row count keep on varying, it is decreasing and increasing for all column families.
what could be the reason?

Counting number of rows in Cassandra is notoriously difficult (see my blog post on the issue).
It looks like your issue is consistency. The usual rules of consistency apply: if you require consistent reads, you need to make sure R + W > N (R=number of nodes required for read, W=number for writes, N=number of nodes). A common way to do this is to read and write at CL.QUORUM.
Note that counting rows is extremely expensive, since it reads through all your data. If this is a common operation you should find a different way of doing it, depending on your use case.

Related

Align Dataset partitioning to table partitioning scheme

I am writing to a table partitioned by month. I know that my data is ≈100MB per partition, no skew - it is going to fit within single HDFS block and I want to ensure that every partition gets a single file written. I also know the exact number of months in my dataset (which is something between 1 and 10), therefore:
ds.repartition(nMonths, $"month").write.<options>.insertInto(<...>)
This works. However I'm thinking from here... As Spark uses key's hash to determine the partition, I have no guarantee that every partition will receive a single month's data. The more partitions I have, the less likely this actually is - right?
Does it make sense then to increase the number of partitions above number of distinct keys?
ds.repartition(nMonths * 3, $"month").write.<options>.insertInto(<...>)
Lots of partitions will be empty, but this shouldn't be that much of a pain (should it?) and we're reducing the probability that some unlucky partitions get 3x/4x data, increasing overall execution time. Does this make sense? Is there any rule of thumb regarding the factor? Or any other approach to achieve the same?
If you want to be super-safe you can use range partitioning, something like:
ds.repartitionByRange(nMonths,$"month").write...
This way you also won't be having empty partitions, which in turn means you won't produce zero-size files in HDFS too.

Order Partitionning vs Random Partitioning

According to most articles on internet Random Partitioning(RP) is better than Ordered Partitioning(OP) cause of the data distribution.
in fact, I think, that cause of data replication even if we are using the OP the data will be well distributed ! so is the first assumption is still true ?
what about reading performance ? is OP better than RP when trying to read data between two value in the same range ?
thanks a lot
I can't answer really answer confidently for HBase (which only supports Ordered Partitioning to my knowledge), but for Cassandra I would strongly discourage the use of OrderPreservingPartitioner and ByteOrderedPartitioner unless you have a very specific use case that requires it (like if you need to do range scans across keys). It is not very common for Ordered Partitioner to be used
in fact, I think, that cause of data replication even if we are using the OP the data will be well distributed ! so is the first assumption is still true ?
Not particularly, it is much more likely for hotspots to be encountered with an Ordered Partitioner vs. a Random Partitioner. As described from the Partitioners page on the Cassandra Wiki:
globally ordering all your partitions generates hot spots: some partitions close together will get more activity than others, and the node hosting those will be overloaded relative to others. You can try to mitigate with active load balancing but this works poorly in practice; by the time you can adjust token assignments so that less hot partitions are on the overloaded node, your workload often changes enough that the hot spot is now elsewhere. Remember that preserving global order means you can't just pick and choose hot partitions to relocate, you have to relocate contiguous ranges.
There are other problems with Ordered Partitioning that are described well here:
Difficult load balancing:
More administrative overhead is required to load balance the cluster. An ordered partitioner requires administrators to manually calculate partition ranges based on their estimates of the partition key distribution. In practice, this requires actively moving node tokens around to accommodate the actual distribution of data once it is loaded.
Uneven load balancing for multiple tables:
If your application has multiple tables, chances are that those tables have different row keys and different distributions of data. An ordered partitioner that is balanced for one table may cause hot spots and uneven distribution for another table in the same cluster.
With regards to:
what about reading performance ? is OP better than RP when trying to read data between two value in the same range ?
You will definitely achieve better performance for range scans (i.e. get all data between this key and that key).
So it really comes down to the kind of queries you are making. Are range scan queries between keys vital to you? In that case HBase may be a more appropriate solution for you. If it is not as important, there are reasons to consider C* instead. I won't add much more to that as I don't want my answer to devolve into comparing the two solutions :).

Cassandra - Distributing data and Multiple tables (Data modeling)

I am trying to learn cassandra. One thing I am not clear is how to ask Cassandra to distribute various tables. ie say I have time series data coming into table t1,t2,t3
T1 is heavily loaded ( ratio is 2000: 2:4 for num of rows).
I want the data of T1 for a given day to be not on the same machine as T2 or T3; so my queries are equally distributed ie not put too much load on one machine.
Also as the data gets older, its queried less, how can I take into account this factor.
regards
Cassandra is automatically distributed, you do not have a direct control on how the data gets distributed. In most cases, by default it makes use of an md5 on the row key and depending on that selects which nodes (computers) will be used to save the data.
What you are talking about would be more of a planning for a standard SQL database. However, if you generate extremely large amount of statistical data that is only to be used by some backend processes and users, you could have a separate cluster of 2 or 3 nodes. That way your other tables would not be affected by those statistics.
However, the true power of Cassandra is to be used with one large cluster. If it slows down, add nodes to it and do the necessary repair to spread the data properly. That's it... pretty much.
As for the way a table is used, you can use all the parameters defined on a table to tweak its setup. If you mainly do writes to a table, then you can tweak the parameters to get faster writes and slower reads. The other way around is also available: one write, many reads. And also many writes and many reads. To tweak those settings, in most cases you will need to have your software running and gather various stats and make changes as time passes.
Update:
There is actually a solution, thinking about it, just... I never use that mode so I did not think about it.
When you use a cluster which supports sorted rows, you can use a specific row name and the data will then go to a specific node. Again, you do not directly have control over what goes where, but if you really really want to do it that way, that's probably the solution you are looking for.
In this case, the row name would start with a number such as 0x0001 for T1 data, and 0x0100 and 0x0200 for T2 and T3. Since you do not know where the data really goes and how Cassandra decides to use it, it is rather complicated to obtain the right results here. And if you change your cluster (i.e. add nodes) then all your assumptions of where the data goes may very well go to the toilet! (and that's not speaking of upgrading to a new version of Cassandra...)

Cassandra distinct counting

I need to count bunch of "things" in Cassandra.
I need to increase ~100-200 counters every few seconds or so.
However I need to count distinct "things".
In order not to count something twice, I am setting a key in a CF, which program reads before increase the counter, e.g. something like:
result = get cf[key];
if (result == NULL){
set cf[key][x] = 1;
incr counter_cf[key][x];
}
However this read operation slows down the cluster a lot.
I tried to decrease reads, using several columns, e.g. something like:
result = get cf[key];
if (result[key1]){
set cf[key1][x] = 1;
incr counter_cf[key1][x];
}
if (result[key2]){
set cf[key2][x] = 1;
incr counter_cf[key2][x];
}
//etc....
Then I reduced the reads from 200+ to about 5-6, but it still slows down the cluster.
I do not need exact counting, but I can not use bit-masks, nor bloom-filters,
because there will be 1M+++ counters and some could go more than 4 000 000 000.
I am aware of Hyper_Log_Log counting, but I do not see easy way to use it with that many counters (1M+++) either.
At the moment I am thinking of using Tokyo Cabinet as external key/value store,
but this solution, if works, will not be as scalable as Cassandra.
Using Cassandra for the distinct counting is not ideal when the number of distinct values is big. Any time you need to do a read before a write you should ask yourself if Cassandra is the right choice.
If the number of distinct items is smaller you can just store them as column keys and do a count. A count is not free, Cassandra still has to assemble the row to count the number of columns, but if the number of distinct values is in the order of thousands it's probably going to be ok. I assume you've already considered this option and that it's not feasible for you, I just thought I'd mention it.
The way people typically do it is having the HLL's or Bloom filters in memory and then flushing them to Cassandra periodically. I.e. not doing the actual operations in Cassandra, just using it for persistance. It's a complex system, but there's easy way of counting distinct values, especially if you have a massive number of counters.
Even if you switched to something else, for example to something where you can do bit operations on values, you still need to guard against race conditions. I suggest that you simply bite the bullet and do all of your counting in memory. Shard the increment operations over your processing nodes by key and keep the whole counter state (both incremental and distinct) in memory on those nodes. Periodically flush the state to Cassandra and ack the increment operations when you do. When a node gets an increment operation for a key it does not have in memory it loads that state from Cassandra (or creates a new state if there's nothing in the database). If a node crashes the operations have not been acked and will be redelivered (you need a good message queue in front of the nodes to take care of this). Since you shard the increment operations you can be sure that a counter state is only ever touched by one node.

Cassandra multiget performance

I've got a cassandra cluster with a fairly small number of rows (2 million or so, which I would hope is "small" for cassandra). Each row is keyed on a unique UUID, and each row has about 200 columns (give or take a few). All in all these are pretty small rows, no binary data or large amounts of text. Just short strings.
I've just finished the initial import into the cassandra cluster from our old database. I've tuned the hell out of cassandra on each machine. There were hundreds of millions of writes, but no reads. Now that it's time to USE this thing, I'm finding that read speeds are absolutely dismal. I'm doing a multiget using pycassa on anywhere from 500 to 10000 rows at a time. Even at 500 rows, the performance is awful sometimes taking 30+ seconds.
What would cause this type of behavior? What sort of things would you recommend after a large import like this? Thanks.
Sounds like you are io-bottlenecked. Cassandra does about 4000 reads/s per core, IF your data fits in ram. Otherwise you will be seek-bound just like anything else.
I note that normally "tuning the hell" out of a system is reserved for AFTER you start putting load on it. :)
See:
http://spyced.blogspot.com/2010/01/linux-performance-basics.html
http://www.datastax.com/docs/0.7/operations/cache_tuning
Is it an option to split up the multi-get into smaller chunks? By doing this you would be able to spread your get across multiple nodes, and potentially increase your performance, both by spreading the load across nodes and having smaller packets to deserialize.
That brings me to the next question, what is your read consistency set to? In addition to an IO bottleneck as #jbellis mentioned, you could also have a network traffic issue if you are requiring a particularly high level of consistency.

Resources