Cassandra upgrade to 6.0 version from 3.2.4 - cassandra

We have a production environment, where we have a 4 node Cassandra cluster and this environment has a data with TTL of 730 days and the data has accumulated to a very large amount (14 TB of data). We know this is not ideal. We have a spring based java application using JDBC. The writes are around 1000 recs/sec.
The current activity we want to do as part of maintenance upgrades to Cassandra 6.0 from 3.2.4 so that in the new cluster we want to follow ideal Cassandra node configuration of having 1tb of data per node. What would be the ideal way of migration to 6.0 version of Cassandra? Without affecting the latency in the application. Also with ZDT(zero downtime) in Cassandra. 12 TB is a huge amount of data and compaction is a daunting task. We want to rectify this.
One solution we came up was using an offline and online model where old database 3.2.4 would still remain and new cluster Cassandra 6.0 would have smaller TTL. The only concern is that we want to avoid is latency in the application. Can Replication across DC with different versions of Cassandra help ?
Don't know the design decisions made during the development phase. But we want to rectify as part of maintenance.
Correct me if our understanding is wrong.

Related

Migrate Data from one Riak cluster to another

I have a situation where we need to migrate data from one Riak cluster to another and then remove the old cluster. The ring size will be same, even the region will be the same. We need to do this to upgrade the instances to AL2. Is there a clean approach to do so on Prod, without realtime data loss?
The answer to this may be tied to your version of Riak KV. If you have the open source version of Riak KV 2.2.3 or earlier, this will require an in-situ upgrade to Riak KV 2.2.6 before progressing. See https://www.tiot.jp/riak-docs/riak/kv/2.2.6/setup/upgrading/version/ with packages at https://files.tiot.jp/riak/kv/2.2/2.2.6/
For an Enterprise Editions of Riak KV 2.2.3 and earlier or the open source edition of Riak KV 2.2.6 or higher, you can use multi-data centre replication (MDC).
Use both of these at the same time for proper replication and to prevent data loss:
fullsync replication will copy across all stored data on its first run and then any missing data on subsequent runs.
realtime replication will replicate all transactions in almost realtime.
If you then set this up as bidirectional replication (get each cluster to replicate to the other for both fullsync and realtime) then you will be able to seemlessly switch your production environment from one cluster to the other without any issues. Once you are happy everything is working as expected, you can kill the old cluster.
Please see the documentation for replication at https://www.tiot.jp/riak-docs/riak/kv/2.2.6/using/cluster-operations/v3-multi-datacenter/

Cassandra(with Hadoop) performance with Spark

We are running Spark/Hadoop on a different set of nodes than Cassandra. We have 10 Cassandra nodes and multiple spark cores but Cassandra is not running on Hadoop. Performance in fetching data from Cassandra through spark(in yarn client mode) is not very good and bulk data reads from HDFS are faster(6 mins in Cassandra to 2 mins in HDFS). Changing Spark-Cassandra parameters is not helping much also.
Will deploying Hadoop on top of Cassandra solve this issue and majorly impact read performance ?
Without looking at your code, bulk reads in an analytics/Spark capacity, are always going to be faster when directly going to the file VS. reading from a database. The database offers other advantages such as schema enforcement, availability, distribution control, etc but I think the performance differences you're seeing are normal.

Cassandra cluster planning

I would like to know about hardware limitations in cluster planning (in TBs) specific to my use case. I have read few threads and documents related to it but some content seem to be over 5 years old. Thought of giving it a shot again:
Use case: Building a time-series cassandra cluster where there is from time-to-time bulk loading from data sources which are in Gigabytes. However, the end-user will majorly be focused in reading the data from the cluster. Quite rarely will be some update or delete on the rows
I have an initial hardware configuration with me to setup Cassandra cluster:
2*12 Cores
128 GB RAM
HDD SAS 3.27 TB
This is the initial plan that I come up with:
When I now speculate over the setup, and after reading the post:
should I further divide my nodes with lesser RAM, vCPUs and HDD?
If yes, what would be the good fit wrt my case?

planning for graphite components for big cassandra cluster monitoring

I am planning to setup a 80 nodes cassandra cluster (current version 2.1 but will upgrade to 3 in future).
I have gone though http://graphite.readthedocs.io/en/latest/tools.html which has list of tools that graphite supports.
I want to decide which tools to choose as listener and storage so that it could scale.
As a listener should i use the default carbon or should i choose graphite-ng ?
However as storage component, i am confused that whether default whisper is enough? Or should I look at ohter option (like Influxdata,cynite or some rdms db (postgres/mysql))?
As gui component i have finalized to use grafana for better visulization.
I think datadog + grafana will work fine but datadog is not opensource.So Please suggest an opensource scalable up to 100 cassandra nodes alternative.
I have 35 Cassandra nodes (different clusters) monitored without any problems with graphite + carbon + whisper + grafana. But i have to tell that re-configuring collection and aggregations windows with whisper is a pain.
There's many alternatives today for this job, you can use influxdb (+ telegraf) stack for example.
Also with datadog you don't need grafana, they're also a visualizing platform. I've worked with it some time ago, but they have some misleading names for some metrics in their plugin, and some metrics were just missing. As a pros for this platform, it's really easy to install and use.
We have a cassandra cluster of 36 nodes in production right now (we had 51 but migrated the instance type since then so we need less C* servers now), monitored using a single graphite server. We are also saving data for 30 days but in a 60s resolution. We excluded the internode metrics (e.g. open connections from a to b) because of the scaling of the metric count, but keep all other. This totals to ~510k metrics, each whisper file being ~500kb in size => ~250GB. iostat tells me, that we have write peaks to ~70k writes/s. This all is done on a single AWS i3.2xlarge instance which include 1.9TB nvme instance storage and 61GB of RAM. To fully utilize the power of the this disk type we increased the number of carbon caches. The cpu usage is very low (<20%) and so is the iowait (<1%).
I guess we could get away with a less beefy machine, but this gives us a lot of headroom for growing the cluster and we are constantly adding new servers. For the monitoring: Be prepared that AWS will terminate these machines more often than others, so backup and restore are more likely a regular operation.
I hope this little insight helped you.

Configure cassandra to use different network interfaces for data streaming and client connection?

I have a cassandra cluster deployed with 3 cassandra nodes with replication factor of 3. I have a lot of data being written to cassandra on daily basis (10-15GB). I have provisioned these cassandra on commodity hardware as suggested by "Big data community" and I am expecting the nodes to go down frequently which is handled using redundancy provided by cassandra.
My problem is, I have observed cassandra to slow down with writes when a new node is provisioned and the data is being streamed while bootstrapping. So, to overcome this hurdle, We have decided to have a separate network interface for inter-node communication and for client application to write data to cassandra. My question is how can this be configured, if at all this is possible ?
Any help is appreciated.
I think you are chasing the wrong solution.
I am confused by the fact that you only have 3 nodes, yet your concern is around slow writes while bootstrapping. Why? Are you planning to grow your cluster regularly? What is your consistency level on write, as this has a big impact on performance? Obviously if you only have 2 or 3 nodes and you're trying to bootstrap, you will see a slowdown, because you're tying up a significant percentage of your cluster to do the streaming.
Note that "commodity hardware" doesn't mean cheap, low-performance hardware. It just means you don't need the super high-end database-class machines used for databases like Oracle. You should still use really good commodity hardware. You may also need more nodes, as setting RF equal to cluster size is not typically a great idea.
Having said that, you can set your listen_address to the inter-node interface and rpc_address to the client address if you feel that will help.

Resources