How long will it take to remove one node from a Cassandra cluster? - cassandra

I'm new to Cassandra. I would like to know how long it takes me to remove a node from the cluster. What factors influence and is there a formula in general?
There is no difference how many nodes there are in the cluster, for example, 4 nodes with 20 GB of workload.
If you remove one node, how long will Cassandra work out this process?

How long it takes to decommission a node from a Cassandra cluster depends on a lot of moving parts which include (but not limited to):
data density (size of data to stream to neighbouring nodes)
hardware capacity (CPU, RAM, disk type, etc)
how busy the cluster is (access patterns)
network IO bandwidth, capacity
data model
disk IO bandwidth, capacity
The quick answer is -- it depends on your environment.
The only way to find out is to do your own tests on identical hardware, identical infrastructure configuration, and as close to production workloads as possible. Cheers!

Related

In GCP Dataproc, what is the maximum number of worker nodes we can use in a cluster?

I am about to train a 5 million rows of data containing 7 categorical variables (string), but soon will train a 31 million rows of data.
I am wondering what the maximum number of worker nodes we can use in a cluster, because even if I type something like: 2,000,000, it doesn't show any indication of an error.
Another question would be, what would be the best way to determine how many worker nodes needed?
Thank you in advance!
Max cluster size
Dataproc does not limit number of the nodes in the cluster, but other software can have limitations. For example, it's known that there are YARN cluster deployments that have 10k nodes, so going above that may not work for Spark on YARN that Dataproc runs.
Also, you need to take into account GCE limitations like different quotas (CPU, RAM, Disk, external IPs, etc) and QPS limits and make sure that you have enough of these for such a large cluster.
I think that 1k nodes is a reasonable size to start from for a large Dataproc cluster if you need it, and you can upscale it further to add more nodes as necessary after cluster creation.
Cluster size estimation
You should determine how many nodes you need based on your workload and VM size that you want to use. For your use case it seems that you need to find a guide on how to estimate cluster size for ML training.
Or alternatively you can just do a binary search until you satisfied with a training time. For example, you can start from 500 8-core nodes cluster and if training time is too long then increase cluster size to 600-750 nodes and see if training time decreases as you expect - you can repeat this until you satisfied with training time or until it does not scale/improve anymore.

How to speedup node joining process in cassandra cluster

I have a cluster 4 cassandra nodes. I have recently added a new node but data processing is taking too long. Is there a way to make this process faster ? output of nodetool
Less data per node. Your screenshot shows 80TB per node, which is insanely high.
The recommendation is 1TB per node, 2TB at most. The logic behind this is bootstrap times get too high (as you have noticed). A good Cassandra ring should be able to rapidly recover from node failure. What happens if other nodes fail while the first one is rebuilding?
Keep in mind that the typical model for Cassandra is lots of smaller nodes, in contrast to SQL where you would have a few really powerful servers. (Scale out vs scale up)
So, I would fix the problem by growing your cluster to have 10X - 20X the number of nodes.
https://groups.google.com/forum/m/#!topic/nosql-databases/FpcSJcN9Opw

Hadoop/Spark : How replication factor and performance are related?

Without discussing all other performance factors, the disk space and the Name node objects, how can replication factor emproves the performance of MR, Tez and Spark.
If we have for example 5 datanades, does it better for the execution engine to set the replication to 5 ? Whats the best and the worst value ?
How this can be good for aggregations, joins, and map-only jobs ?
One of the major tenants of Hadoop is moving the computation to the data.
If you set the replication factor approximately equal to the number of datanodes, you're guaranteed that every machine will be able to process that data.
However, as you mention, namenode overhead is very important and more files or replicas causes slow requests. More replicas also can saturate your network in an unhealthy cluster. I've never seen anything higher than 5, and that's only for the most critical data of the company. Anything else, they left at 2 replicas
The execution engine doesn't matter too much other than Tez/Spark outperforming MR in most cases, but what matters more is the size of your files and what format they are stored in - that will be a major drive in execution performance

How dataproc works with google cloud storage?

I am searching for working of google dataproc with GCS. I am using pyspark of dataproc. Data is read from and written to GCS.But unable to figure out best machine types for my use case. Questions
1) Does spark on dataproc copies data to local disk? e.g. If I am processing 2 TB of data, is it ok If I use 4 machine node with 200GB hdd? OR I should at least provide disk that can hold input data?
2) If the local disk is not at all used then is it ok to use high memory low disk instances?
3) If local disk is used then which instance type is good for processing 2 TB of data with minimum possible number of nodes? I mean is good to use SSD ?
Thanks
Manish
Spark will read data directly into memory and/or disk depending on if you use RDD or DataFrame. You should have at least enough disk to hold all data. If you are performing joins, then amount of disk necessary grows to handle shuffle spill.
This equation changes if you discard significant amount of data through filtering.
Whether you use pd-standard, pd-ssd, or local-ssd comes down to cost and if your application is CPU or IO bound.
Disk IOPS is proportional to disk size, so very small disks are inadvisable. Keep in mind that disk (relative to CPU) is cheap.
Same advice goes for network IO: more CPUs = more bandwidth.
Finally, default Dataproc settings are a reasonable place to start experimenting and tweaking your settings.
Source: https://cloud.google.com/compute/docs/disks/performance

Cassandra cluster slow reads

I'm doing some prototyping/benchmarking on Titan, a NoSQL graph database. Titan uses Cassandra as back-end.
I've got one Titan-Cassandra VM running and two cassandra VM's.
Each of them owning roughly 33% of the data (replication factor 1):
All the machines have 4GB of RAM and 4 i7 cores (shared).
I'm interested in all adjacent nodes, so I call Rexter (a REST API) with: http://192.168.33.10:8182/graphs/graph/vertices/35082496/both
These are the results (in seconds):
Note that with the two nodes test, the setup was the exact same as described above, except there is one Cassandra node less. The two nodes (titan-casssandra and Cassandra) both owned 50% of the data.
Titan is the fastest with 1 node and performance tend to degrade when more nodes are added. This is the opposite of what distribution should accomplish, so obviously I'm doing something wrong, right?
These is my Cassandra config:
Cassandra YAML: http://pastebin.com/ZrsRdtuD
Node 2 and node 3 have the exact same YAML file. The only difference is the listen_address (this is equal to the node's IP)
How to improve this performance?
If you need any additional information, don't hesitate to reply.
Thanks

Resources