Cassandra performance on querying fewer or more nodes - cassandra

Consider a growing number of data, let's choose from two extreme choices:
Evenly distribute all data across all nodes in the cluster
We pack them to as few nodes as possible
I prefer option 1 because as the volume of data grows, we can scatter it with all nodes, so that when each node is queried, it has the lowest load.
However, some resources state that we shouldn't query all the nodes because that will slow down the query. Why would that slow the query? Isn't that just a normal scatter and gather? They even claim this hurts linear scalability as adding more nodes will further drag down the query.
(Maybe I am missing on how Cassandra performs the query, some background reference is appreciated).
On the contrary, some resources state that we should go with option 2 because it queries the least number of nodes.
Of course there is no black and white choices here; everything must have a tradeoff.
I want to know, what's the real difference between option 1 and option 2. Plus, regarding the network querying, why option 1 would be slow.

I prefer option 1 because as the volume of data grows, we can scatter it with all nodes, so that when each node is queried, it has the lowest load.
You definitely want to go with option #1. This is also preferable, in that new or replacement nodes will stream much faster than a cluster made of fewer, dense nodes.
However, some resources state that we shouldn't query all the nodes because that will slow down the query.
And those resources are absolutely correct. First of all, if you read through the resources which Alex posted above you'll discover how to build your tables so that your queries can be served by a single node. Running queries which only hit a single node is the best way around that problem.
Why would that slow the query?
Because in a distributed database environment, query time becomes network time. There are many people out there who like to run multi-key or unbound queries against Cassandra. When that happens, and the query is unable to find a single node with the data, Cassandra picks one node to designate as a "coordinator."
That node builds the result set with data from the other nodes. Which means in a 30 node cluster, that one node is now pulling data from the other 29. Assuming that these requests don't time-out, the likelihood that the coordinator will crash due to trying to manage too much data is very high.
The bottom line, is that this is one of those tradeoffs between a CA relational database and an AP partitioned row store. Build your tables to support your queries, store data together which is queried together, and Cassandra will perform just fine.


Select All Performance in Cassandra

I'm current using DB2 and planning to use cassandra because as i know cassandra have a read performance greater than RDBMS.
May be this is a stupid question but I have experiment that compare read performance between DB2 and Cassandra.
Testing with 5 million records and same table schema.
With query SELECT * FROM customer. DB2 using 25-30s and Cassandra using 40-50s.
But query with where condition SELECT * FROM customer WHERE cusId IN (100,200,300,400,500) DB2 using 2-3s and Cassandra using 3-5ms.
Why Cassandra faster than DB2 with where condition? So i can't prove which database is greater with SELECT * FROM customer right?
Cassandra: RF=3 and CL=1 with 3 nodes each node run on 3 computers (VM-Ubuntu)
DB2: Run on windows
Table schema:
cusId int PRIMARY KEY, cusName varchar
If you look at the types of problems that Cassandra is good at solving, then the reasons behind why unbound ("Select All") queries suck become quite apparent.
Cassandra was designed to be a distributed data base. In many Cassandra storage patterns, the number of nodes is greater than the replication factor (I.E., not all nodes contain all of the data). Therefore, limiting the number of network hops becomes essential to modeling high-performing queries. Cassandra performs very well with specific queries (which utilize the partition/clustering key structure), because it can quickly locate the node primarily responsible for the data.
Unbound queries (A.K.A. multi-key queries) incur the extra network time because a coordinator node is required. So one node acts as the coordinator, queries all other nodes, collates data, and returns the result set. Specifying a WHERE clause (with at least a partition key) and while using a "Token Aware" load balancing policy, performs well for two reasons:
A coordinator node is not required.
The node primarily responsible for the range is queried, returning the result set in a single netowrk hop.
Querying Cassandra with an unbound query, causes it to incur a lot of extra processing and network time that it normally wouldn't have to do, had the query been specified with a WHERE clause.
Even as a troublesome query like a no-condition range query, 40-50s is pretty extreme for C*. Is the coordinator hitting GCs with the coordination? Can you include code used for your test?
When you make a select * vs millions of records, it wont fetch them all at once, it will grab the fetchSize at a time. If your just iterating through this, the iterator will actually block even if you used executeAsync initially. This means that every 10k (default) records it will issue a new query that you will block on. The serialized nature of this will take time just from a network perspective. explains how to do it in a non-blocking way. You can use this to to kick off the next page fetch while processing the current which would help.
Decreasing the limit or fetch size could also help, since the coordinator may walk token ranges (parallelism is possible here but its heuristic is not perfect) one at a time until it has read enough. If it has to walk too many nodes to respond it will be slow, this is why empty tables can be very slow to do a select * on, it may serially walk every replica set. With 256 vnodes this can be very bad.

DCE Cassandra 3.9 slow secondary index creation during joining existing cluster

We have cassandra cluster with 32 nodes, average node size is about 1TB. Node configuration 1xIntel Xeon E3-1271v3, 32GB ram, 2x3TB HDD.
We have one DB with some small tables and one big table, that holds is about 90-95% of total cluster size.
I try to add additional nodes to this cluster, but suddenly find out, that adding one node to existing cluster take is about 13-14 days for joining to cluster. Build secondary indexes take most of this time and all this time i see that all compactor threads take all available CPU.
I have changed cassandra config to extends limits:
concurrent_compactors: 4
compaction_throughput_mb_per_sec: 0
Cassandra full config
Is about 1 year ago we also add new nodes to this cluster and extend it from 16 nodes to 32 nodes cluster, average node size was 1TB before cluster extends. Cassandra version was 2.1. One node joining time was 1-1.5days.
So the question how can we speed up this process ? Did we miss something ?
This one is a bit longer so I can't put it into comment ... sorry.
I know that this sounds a bit strange, especially for a later stage of
your project, but the thing is with the indexes the situation won't get
any better over time. I would strongly recommend to start making your own
tables instead of just putting index on following stuff. Depending on
how often the data is accessed you can use "inverted indexes".
CREATE INDEX links_by_author_url_idx ON keyspace.links_by_author (url);
CREATE INDEX docs_url_idx ON (url);
CREATE INDEX om_master_object_id_idx ON (master_object_id);
CREATE INDEX actions_pday_idx ON keyspace.actions (pday);
CREATE INDEX authors_yauid_idx ON keyspace.authors (yauid);
CREATE INDEX authors_login_lr_idx ON keyspace.authors (login_lr);
CREATE INDEX authors_login_idx ON keyspace.authors (login);
CREATE INDEX authors_email_idx ON keyspace.authors (email);
CREATE INDEX authors_name_idx ON keyspace.authors (name);
Basically every index that you have here enables you to "search" over base
entities to find them by some condition. Most of the conditions are
actually pretty narrow which is a good news. But the thing is the indexes
will become massive (already did), especially on docs and authors. But I guess
doc's is more problematic.
You should consider making separate tables for this. Every index that
you create will be there on every node in the cluster and in the end
you will hold far more data than you really need too because under
the hood data is multiplied per node. When you add replication factor to this
system is using a lot of space without you even being aware.
The problem with joining nodes is that when they receive new data all
the data in the cluster needs to be rebuilt ... for every single node
in the cluster and this is costing you a lot of time. So basically you loose
all the benefits of "easy node joining" that cassandra has.
Now you might think that space will become problem when you write the data
into your new schema that is denormalized ....
If space is the problem you can use a technique called inverted indexes
where you just put the id of the information into the search table and
then you make second load in the main table. I used this on some project
where space was the issue but since you have all the main stuff indexed
space will probably not be a problem because you are already using a lot
more than you think. (my bet would be that you will also probably save significantly on space)
Anyway all the indexes should become tables ... if consistency is problem,
use batches (don't use materialized views yet because you might loose data).
My honest tip is that you stay away from indexes. I know
it's hell to refactor this plus it's hard to get the time to refactor :( But
I think it should be manageable.

Query in DataStax

I have 2 questions related to DataStax queries:
I have a installed DataStax Enterprise 4.6 on 3 nodes of exactly the same configuration with regards to CPU,RAM,Storage etc. I then created a keyspace with RF=3, created a CF within the keyspace and inserted about 10 million rows in it. Now when I login to Node1 and execute a count query, it returns about 1.5 million in about 1mt 15 secs. But when I login to Node2 and execute the exact same query, it take about 1mt 35 secs. Similarly, when I login to Node3 and execute, it takes about 1mt 20 secs. Why is there a difference in the query execution times on the 3 nodes?
I shut down DSE (service dse stop) on Node2 & Node3 and ran the query on Node1. Since all required data is available on Node1, it ran successfully and took 1mt 15sec. I then brought DSE up on Node2 and ran the query again. With tracing on, I see that data is being fetched from Node2 as well but the time taken to execute the query is more than 1mt 15sec. Should it not be less, since 2 nodes are being used? Similarly, when Node3 is also brought up and the query is executed, it takes more time compared to when 2 nodes are up. My understanding is that Cassandra/DataStax is linearly scalable.
Any help/pointers is much appreciated ..
Sounds like normal behavior to me. There is always some overhead when multiple nodes are coordinating and interacting with each other, and things are not necessarily going to behave in a perfectly symmetric way.
Even if all the data is local, there's still some interaction with the other nodes going on, and some of that will be non deterministic in time. You have network latencies that vary, different queueing orders of things, variable seeks times on disks, etc.
When you take two of the nodes down, the remaining node knows that they are down and so it doesn't bother trying to do any reads or interactions with them. That's why that scenario is the fastest. As you bring the other nodes back online, the extra coordination with them will slow things down a little. That's the price you pay for redundancy.
The performance scales by not keeping a copy of the data on every node. You are using RF=3 and only have three nodes. If you added a fourth node, then not all the data would be on every node. Now you have added capacity since not every write goes to all nodes and different writes will hit a different set of machines.
Your question is simple to answer. It is a matter of Consistency: You can tune your select queries with a Consistency of One, then C* does not need to check if your data (RF=3) across all the nodes matches up.
In most use cases a Consistency of One for reads should be sufficient.
As for the time differences: The machines are involved in many different things beside serving queries. So normal behaviour to have different response times per node. There is a similar question/answer here : How do I set the consistency level of an individual CQL query in CQL3?
Basically go and play with consistency and see how response times change.

Choosing a NoSQL database

I need a NoSQL database that will run on Windows Azure that works well for the following parameters. Right now Azure Table Storage, HBase and Cassandra seems to be the most promising options.
1 billion entities
up to 100 reads per second, though caching will mostly make it much less
around 10 - 50 writes per second
Strong consistency would be a plus, so perhaps HBase would be better than Cassandra in that regard.
Querying will often be done on a secondary in-memory database with various indexes in addition to ElasticSearch or Windows Azure Search for fulltext search and perhaps some filtering.
Azure Table Storage looks like it could be nice, but from what I can tell, the big difference between Azure Table Storage and HBase is that HBase supports updating and reading values for a single property instead of the whole entity at once. I guess there must be some disadvantages to HBase however, but I'm not sure what they would be in this case.
I also think looks like it could be interesting, but I wonder if there might be unforseen problems.
Anyone have any other ideas of the advantages and disadvantages of the different databases in this case, and if any of them are really unsuited for some reason?
I currently work with Cassandra and I might help with a few pros and cons.
Cassandra can easily handle those 3 requirements. It was designed to have fast reads and writes. In fact, Cassandra is blazing fast with writes, mostly because you can write without doing a read.
Also, Cassandra keeps some of its data in memory, so you could even avoid the secondary database.
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. Normally you use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (You can have other machine delivering the older information while consistency was not achieved).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered.
Both this options are the ones recommended because they avoid single points of failure. If all machines had to accept, if one node was down or busy, you wouldn't be able to query.
Cassandra is the solution for performance, linear scalability and avoid single points of failure (You can have machines down, the others will take the work). And it does most of its management work automatically. You don't need to manage the data distribution, replication, etc.
The downsides of Cassandra are in the modeling and queries.
With a relational database you model around the entities and the relationships between them. Normally you don't really care about what queries will be made and you work to normalize it.
With Cassandra the strategy is different. You model the tables to serve the queries. And that happens because you can't join and you can't filter the data any way you want (only by its primary key).
So if you have a database for a company with grocery stores and you want to make a query that returns all products of a certain store (Ex.: New York City), and another query to return all products of a certain department (Ex.: Computers), you would have two tables "ProductsByStore" and "ProductsByDepartment" with the same data, but organized differently to serve the query.
Materialized Views can help with this, avoiding the need to change in multiple tables, but it is to show how things work differently with Cassandra.
Denormalization is also common in Cassandra for the same reason: Performance.

What does it mean when we say cassandra is scalable?

I have created two node Cassandra cluster and try to perform load test. I find that one node or two node not making much difference in the through put I have supposed if 1 node can provide me 2000 tps for insert the two node should double the amount. Is it work like that?
if it is not then what actually Scaling means and how can I relate with it latency or throughput.
Cassandra is scalable. Just your case is a bit simplified since two nodes is not really the case of high scalability. You should be aware or the token partitioning algorithm used by Cassandra. As soon as you understand it, there should not be any quesitons. There is plenty of presentations about that. E.g. this one:
In case of replication factor 1 everything is simple:
Each key-value pair you save/read from/to Cassandra is a query to one of Cassandra nodes in the cluster. Data is evenly distributed among nodes (see details of partitioning algorithm). So you always have total load evenly distributed among all nodes -> more nodes you have more load they can carry (and it is linear). In this case the system should of course be configured in a right way to avoid different kinds of network bottlenecks.
In case of replication factor more than 1 the situation is a bit more complicated, however the principle is the same.
There are lot of factors that contribute to this result.
A) check your replication factor. Although not desirable, in your case you can set it to 1
B) look into the shard in your primary key. If in your tests you are not changing it, then you are loading the data skewed and that the table is not scaling out to 2 nodes.
What does it mean when we say Casssandra is scalable?
There are basically two ways to scale a database.
Vertical scaling: Increasing the resources of the existing nodes in your cluster (more RAM, faster HDDs, more cores).
Horizontal scaling: Adding additional nodes to your cluster.
Vertical scaling tends to be more of a "band-aid" or temporary solution, because it has very finite limits. Your machines will only support so much RAM or so many cores, and once you max that out you really don't have anywhere to go.
Cassandra is "scalable" because it simplifies horizontal scaling. If you find that your existing nodes are maxing-out their available resources, you can simply add another node(s), adjust your replication factor, and run a nodetool repair. If you have had to do this with other database products, you will appreciate how (relatively) easy Cassandra makes it.
In your case, it's hard to know what exactly is going on without (a lot) more detail. But if your load tests are being adequately handled by your first node, then I can see why you wouldn't notice much of a difference by adding another.
If you haven't already, check out the Cassandra Stress Tool.
Additionally, be sure to check your current methods against this article, which is appropriately titled: How not to benchmark Cassandra
