What's to think of when increasing disk size on Cassandra nodes? - cassandra

I run a 10-node Cassandra cluster in production. 99% writes; 1% reads, 0% deletes. The nodes have 32 GB RAM; C* runs with 8 GB heap. Each node has a SDD for commitlog and 2x4 TB spinning disks for data (sstables). The schema uses key caching only. C* version is 2.1.2.
It can be predicted that the cluster will run out of free disk space in not too long. So its storage capacity needs to be increased. The client prefers increasing disk size over adding more nodes. So a plan is to take the 2x4 TB spinning disks in each node and replace by 3x6 TB spinning disks.
Are there any obvious pitfalls/caveats to be aware of here? Like:
Can C* handle up to 18 TB data size per node with this amount of RAM?
Is it feasible to increase the disk size by mounting a new (larger) disk, copy all SS tables to it, and then mount it on the same mount point as the original (smaller) disk (to replace it)?

I would recommend adding nodes instead of increasing the data size of your current nodes. Adding nodes would take advantage of Cassandra's distribution feature by having small easily replaceable nodes.
Furthermore the recommended size of a single node in a cluster for a spinning disk is around 1 TB. Once you go higher than that, I can only image that performance will decrease significantly.
Not to mention if a node loses its data, it will take a long time to recover it as it has to stream a huge amount of data from the other nodes.
Can C* handle up to 18 TB data size per node with this amount of RAM?
This depends heavily on your workload.
Is it feasible to increase the disk size by mounting a new (larger) disk, copy all SS tables to it, and then mount it on the same mount point as the original (smaller) disk (to replace it)?
I don't see a reason why it would not work.

It's an anti-pattern in Cassandra. Distributed database is key feature of Cassandra

Related

How to rebalance and reclaim disk space after adding a Cassandra node

I have a 12 node cassandra cluster which is high on data load and disc space is almost nearing full capacity. I have expanded the cluster by adding 1 node and planning to add couple more.
I could find that the data load got reduced after adding the new node. However, the disc space has not reduced.
I fear running nodetool repair as this may require additional disc space and the available space may not be sufficient.
There are suggestions to use nodetool cleanup, looks like this will also cause temporary increase in disk space.
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/tools/toolsCleanup.html
Please suggest if there are better ways to cleanup old data from other nodes to reclaim disc space
Unfortunately, nodetool cleanup is the only way you could evict data that a node no longer owns after nodes are added to a cluster in order to reclaim disk space.
In order for cleanup to work, it temporarily uses more space since it needs to re-compact SSTables to new ones. This can be problematic if you have really large SSTables that are several GBs in size and don't have a lot of disk space left.
You can workaround this problem for large SSTables which are configured with SizeTieredCompactionStrategy by splitting them into smaller files on another server using the sstablesplit tool. I've documented the instructions in https://community.datastax.com/questions/6415/. Cheers!

Cassandra compaction: does replication factor have any influence?

Let’s assume that the total disk usage of all keyspaces is 100GB before replication. The replication factor is 3. Making the total physical disk usage = 100GB x 3 = 300GB.
We use the default compaction strategy (size-tiered) and let’s assume the worse case where Cassandra needs as much free space as the the data to complete the compaction. Does Cassandra needs 100GB (before replication) or 300GB (100GB x3 with replication)?
In other words, when Cassandra needs free disk space for performing compaction, does the replication factor has any influence?
Compaction in Cassandra is local to a Node.
Now let's say you have a 3 node cluster, replication factor is also 3, and the original data size is 100GB. This means that each node has 100GB worth of data.
Hence on each node, I will need 100GB free space to compact the data present on that node.
TLDR; Free space required for Compaction is equal to the total data present on the node.
Because data is replicated between the nodes, every node will need to have up to 100Gb free space - so it's total of the 300Gb, but not on one node...

Will spark load data into in-memory if data is 10 gb and RAM is 1gb

If i have cluster of 5 nodes, each node having 1gb ram, now if my data file is 10gb distributed in all 5 nodes, let say 2gb in each node, now if i trigger
val rdd = sc.textFile("filepath")
rdd.collect
will spark load data into the ram and how spark will deal with this scenario
will it straight away deny or will it process it.
Lets understand the question first #intellect_dp you are asking, you have a cluster of 5 nodes (here the term "node" I am assuming machine which generally includes hard disk,RAM, 4 core cpu etc.), now each node having 1 GB of RAM and you have 10 GB of data file which is distributed in a manner, that 2GB of data is residing in the hard disk of each node. Here lets assume that you are using HDFS and now your block size at each node is 2GB.
now lets break this :
each block size = 2GB
RAM size of each node = 1GB
Due to lazy evaluation in spark, only when "Action API" get triggered, then only it will load your data into the RAM and execute it further.
here you are saying that you are using "collect" as an action api. Now the problem here is that RAM size is less than your block size, and if you process it with all default configuration (1 block = 1 partition ) of spark and considering that no further node will going to add up, then in that case it will give you out of memory exception.
now the question - is there any way spark can handle this kind of large data with the given kind of hardware provisioning?
Ans - yes, first you need to set default minimum partition :
val rdd = sc.textFile("filepath",n)
here n will be my default minimum partition of block, now as we have only 1gb of RAM, so we need to keep it less than 1gb, so let say we take n = 4,
now as your block size is 2gb and minimum partition of block is 4 :
each partition size will be = 2GB/4 = 500mb;
now spark will process this 500mb first and will convert it into RDD, when next chunk of 500mb will come, the first rdd will get spill to hard disk (given that you have set the storage level "MEMORY_AND_DISK_ONLY").
In this way it will process your whole 10 GB of data file with the given cluster hardware configuration.
Now I personally will not recommend the given hardware provisioning for such case,
as it will definitely process the data, but there are few disadvantages :
firstly it will involve multiple I/O operation making whole process very slow.
secondly if any lag occurs in reading or writing to the hard disk, your whole job will get discarded, you will go frustrated with such hardware configuration. In addition to it you will never be sure that will spark process your data and will be able to give result when data will increase.
So try to keep very less I/O operation, and
Utilize in memory computation power of spark with an adition of few more resources for faster performance.
When you use collect all the data send is collected as array only in driver node.
From this point distribution spark and other nodes does't play part. You can think of it as a pure java application on a single machine.
You can determine driver's memory with spark.driver.memory and ask for 10G.
From this moment if you will not have enough memory for the array you will probably get OutOfMemory exception.
In the otherhand if we do so, Performance will be impacted, we will not get the speed we want.
Also Spark store only results in RDD, so I can say result would not be complete data, any worst case if we are doing select * from tablename, it will give data in chunks , what it can affroad....

How dataproc works with google cloud storage?

I am searching for working of google dataproc with GCS. I am using pyspark of dataproc. Data is read from and written to GCS.But unable to figure out best machine types for my use case. Questions
1) Does spark on dataproc copies data to local disk? e.g. If I am processing 2 TB of data, is it ok If I use 4 machine node with 200GB hdd? OR I should at least provide disk that can hold input data?
2) If the local disk is not at all used then is it ok to use high memory low disk instances?
3) If local disk is used then which instance type is good for processing 2 TB of data with minimum possible number of nodes? I mean is good to use SSD ?
Thanks
Manish
Spark will read data directly into memory and/or disk depending on if you use RDD or DataFrame. You should have at least enough disk to hold all data. If you are performing joins, then amount of disk necessary grows to handle shuffle spill.
This equation changes if you discard significant amount of data through filtering.
Whether you use pd-standard, pd-ssd, or local-ssd comes down to cost and if your application is CPU or IO bound.
Disk IOPS is proportional to disk size, so very small disks are inadvisable. Keep in mind that disk (relative to CPU) is cheap.
Same advice goes for network IO: more CPUs = more bandwidth.
Finally, default Dataproc settings are a reasonable place to start experimenting and tweaking your settings.
Source: https://cloud.google.com/compute/docs/disks/performance

Cassandra didn't read data from ssttable

Test the cassandra with YCSB and using the workloadc(read100%) .
And iostat always show 0 with read.
Configurations:
data is on sdb, 24G data , 8G heap size, default memtable size,
disable row-cache and key-cache.
As my thought, uniform request would cause the memtable miss, and lookup the data on ssttable,
so the data dir iostat should not be zero.
How could 8G heap's memtable store all the 24G data?
Anybody hit the same problem?
There's no magic going on here. Your request workload must not be as random as you thought.
I happen to have a copy of YCSB checked out and workloadc uses requestdistribution=zipfian which is NOT uniform.
how much total memory on the machine? If you have 32GB or more of RAM on the machine then it could also be the OS page cache - which would be outside of the Cassandra process (e.g. not the heap). In scenrios like that, the OS (assuming its linux) will wind up caching the entire 24GB in memory and you'll get little disk activity.

Resources