How to speedup node joining process in cassandra cluster - cassandra

I have a cluster 4 cassandra nodes. I have recently added a new node but data processing is taking too long. Is there a way to make this process faster ? output of nodetool

Less data per node. Your screenshot shows 80TB per node, which is insanely high.
The recommendation is 1TB per node, 2TB at most. The logic behind this is bootstrap times get too high (as you have noticed). A good Cassandra ring should be able to rapidly recover from node failure. What happens if other nodes fail while the first one is rebuilding?
Keep in mind that the typical model for Cassandra is lots of smaller nodes, in contrast to SQL where you would have a few really powerful servers. (Scale out vs scale up)
So, I would fix the problem by growing your cluster to have 10X - 20X the number of nodes.
https://groups.google.com/forum/m/#!topic/nosql-databases/FpcSJcN9Opw

Related

How long will it take to remove one node from a Cassandra cluster?

I'm new to Cassandra. I would like to know how long it takes me to remove a node from the cluster. What factors influence and is there a formula in general?
There is no difference how many nodes there are in the cluster, for example, 4 nodes with 20 GB of workload.
If you remove one node, how long will Cassandra work out this process?
How long it takes to decommission a node from a Cassandra cluster depends on a lot of moving parts which include (but not limited to):
data density (size of data to stream to neighbouring nodes)
hardware capacity (CPU, RAM, disk type, etc)
how busy the cluster is (access patterns)
network IO bandwidth, capacity
data model
disk IO bandwidth, capacity
The quick answer is -- it depends on your environment.
The only way to find out is to do your own tests on identical hardware, identical infrastructure configuration, and as close to production workloads as possible. Cheers!

Cassandra Read Timeouts on Specific Servers

We have a five node Cassandra cluster with replication factor 3. We are experiencing a lot of Read Timeouts in our application. When we checked tpstats on each Cassandra node, we see that three of the nodes have a lot of Read request drops and a high CPU utilisation, whereas on the other two nodes Read request drops are zero and CPU utilisation is moderate. Note that the total number of Read requests on all servers are almost same.
After taking thread dump we found out that the reason for high CPU utilisation is that Parallel GC is running a lot on the three nodes compared to the other two nodes, which is causing CPU utilisation to go high. What we are not able to understand is why GC should be running more on three nodes and less on two nodes, when the distribution of our partition key and our queries is almost uniform.
Cassandra version is 2.2.3.

Does Spark incur the same amount of overhead as Hadoop for vnodes?

I just read https://stackoverflow.com/a/19974621/260805. Does Spark (specifically Datastax's Cassandra Spark connector) incur the same amount of overhead as Hadoop when reading from a Cassandra cluster? I know Spark uses threads more heavily than Hadoop does.
Performance with vnodes and without in the connector should be basically the same. With hadoop each vnode split generated it's own task which created a large amount of overhead.
With Spark, tasks contain the token ranges from multiple vnodes and are merged into a single task and the overall task overhead is lower. There is a slight locality issue where it becomes difficult to get balanced numbers of tasks for all the nodes in the C* cluster with smaller data sizes. This issue is being worked on in SPARKC-43.
I'll give three separate answers. I apologize for the rather unstructured answer, but it's been building up over time:
A previous answer:
Here's one potential answer: Why not enable virtual node in an Hadoop node?. I quote:
Does this also apply to Spark?
No, if you're using the official DataStax spark-cassandra-connector. It can process multiple token ranges in a single Spark task. There is still some minor performance hit, but not as huge as with Hadoop.
A production benchmark
We ran a Spark job against a vnode-enabled Cassandra (Datastax Enterprise) datacenter with 3 nodes. The job took 9.7 hours. Running the same job on for slightly less data, using 5 non-vnode nodes, a couple of weeks back took 8.8 hours.
A controlled benchmark
To further test the overhead we ran a controlled benchmark on a Datastax Enterprise node in a single-node cluster. For both vnode enabled/disabled the node was 1) reset, 2) X number of rows were written and then 3) SELECT COUNT(*) FROM emp in Shark was executed a couple of times to get a cold vs. hot cache times. X tested were 10^{0-8}.
Assuming that Shark is not dealing with vnodes in any way, the average (quite stable) overhead for vnodes were ~28 seconds for cold Shark query executions and 17 seconds for hot executions. The latency difference did generally not vary with data size.
All the numbers for the benchmark can be found here. All scripts used to run the benchmark (see output.txt for usage) can be found here.
My only guess why there was a difference between "Cold diff" and "Hot diff" (see spreadsheet) is that it took Shark some time to create metadata, but this is simply speculation.
Conclusion
Our conclusion is that the overhead of vnodes is a constant time between 13 and 30 seconds, independent of data size.

Cassandra cluster slow reads

I'm doing some prototyping/benchmarking on Titan, a NoSQL graph database. Titan uses Cassandra as back-end.
I've got one Titan-Cassandra VM running and two cassandra VM's.
Each of them owning roughly 33% of the data (replication factor 1):
All the machines have 4GB of RAM and 4 i7 cores (shared).
I'm interested in all adjacent nodes, so I call Rexter (a REST API) with: http://192.168.33.10:8182/graphs/graph/vertices/35082496/both
These are the results (in seconds):
Note that with the two nodes test, the setup was the exact same as described above, except there is one Cassandra node less. The two nodes (titan-casssandra and Cassandra) both owned 50% of the data.
Titan is the fastest with 1 node and performance tend to degrade when more nodes are added. This is the opposite of what distribution should accomplish, so obviously I'm doing something wrong, right?
These is my Cassandra config:
Cassandra YAML: http://pastebin.com/ZrsRdtuD
Node 2 and node 3 have the exact same YAML file. The only difference is the listen_address (this is equal to the node's IP)
How to improve this performance?
If you need any additional information, don't hesitate to reply.
Thanks

Cassandra vnodes performance overhead and changing the number of vnodes

We have a test cluster of 4 nodes, and we've turned on vnodes. It seems that reading out is somewhat slower than the old method (initial_token). Is there some performance overhead by using vnodes? Do we have to increase/decrease the default num_tokens (256) if we only have 4 physical nodes?
Another scenario we would like to test is to change the num_tokens of the cluster on the fly. Is it possible, or do we have to recreate the whole cluster? If possible, how can we accomplish that?
We're using Cassandra 2.0.4.
It really depends on your application, but if you are running Spark queries on top of Cassandra, then a high number of vnodes can significantly slow down your queries, by at least 2x or 5x. This is because Spark cannot subdivide queries across vnodes, and each vnode results in one Spark partition, and a high number of partitions slows down low latency queries.
The recommended number of vnodes is more like 16. This lets you split a two node cluster in theory to 32 nodes max, which is more than enough of an expansion ratio for most folks.

Resources