How to change cassandra standalone mode to distributed - cassandra

I have installed Cassandra 2.1 stand alone mode in two nodes seperately.
Is there any way to change both to distributed or make both the node used in one cluster.
please help.

This is what you're looking for: https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_add_node_to_cluster_t.html
I also suggest taking a look at this hands on training course: https://academy.datastax.com/courses/ds201-cassandra-core-concepts
It's free and definitely worth your time if you're thinking about using Cassandra in production.

Related

How to add sstablesplit to an existing Cassandra cluster?

I am a beginner in Cassandra and currently have a small cluster with replication factor 3 and most of the parameters being set to default.
What I noticed the other day is that the SSTables have become absolutely massive (>1TB) and the logs are now starting to complain that they cannot perform a compaction anymore. I've looked into it and decided to switch to the LevelCompactionStrategy, as well as performing an sstablesplit on my existing SSTables.
However, at that point I noticed that sstablesplit did not come with my installation of Cassandra. Is there a way of installing just that tool? All the guides I've seen talk about installing the entire Datastax tech stack, which would probably invalidate my existing cluster or require a great deal of reinstalling which at the moment I cannot do. The Cassandra installation was not set up by me.
At the same time, LCS is complaining that it cannot perform re-compaction because it's trying to recompact all SSTables at once, which, since they now take slightly more than 50% of hard drive space, it can't find enough space to do.
If sstablesplit is impossible (or inadvisable), is there any other way to resolve my issue of having several SSTables which are too massive to be re-compacted into more manageable chunks?
Thanks!
sstablesplit is part of the cassandra codebase, you can use it even without it being packaged. The cassandra-all jar and lib jars to classpath have everything to run it. This is all the sstablesplit script does: https://github.com/apache/cassandra/blob/trunk/tools/bin/sstablesplit.
Is this in AWS or some cloud platform where you can get larger hosts temporarily? Easiest is to replace the hosts with new hosts with 2x the disk space or something, migrate to LCS then switch back for costs.

YCSB for Cassandra 3.0 Benchmarking

I have a cassandra ubuntu visual cluster and need to benchmark it.
I try to do it with yahoo's ycsb (without use of maven if possible).
I use cassandra 3.0.1 but I cant find a suitbale version of ycsb.
I dont want to change to an oldest version of cassandra (ycsb latest cassandra-binding is for cassandra 2.x)
What should I do?
As suggested here, despite Cassandra 3.x is not officially supported, you can use the cassandra-cql binding.
For instance:
/bin/ycsb load cassandra-cql -threads 4 -P workloads/workloada
I just tested it on Cassandra 3.11.0 and it works for both load and run.
That said, the benchmark software to use depends on your test schedule. If you want to benchmark only Cassandra, then #gsteiner 's solution might be the best. If you want to benchmark different databases using the same tool to avoid variability, then YCSB is the right one.
I would recommend using Cassandra-stress to perform a load/performance test on your Cassandra cluster. It is very customizable, to the point that you can test distributions with different data models as well as specify how hard you want to push your cluster.
Here is a link to the Datastax documentation for it that goes into how to use the tool in depth.
https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsCStress_t.html

does Cassandra's CCM tool only support one keyspace?

I am working with a cluster I created with ccm. We are using 3 tables in 2 keyspaces, so 6 tables in total. I was having a problem that it let me create one table in one keyspace and 2 in the other but even when I removed my
IF NOT EXISTS
check then it would give me an error saying it already exists. It seems that the create is ignoring the fact that these are supposed to be in 2 separate keyspaces;
These are the same cql script files that we run against our dev cloud Cassandra cluster, so I know its not an issue with the scripts. That, and the create statements are pretty simple and straightforward.
So does CCM only support one keyspace? If so, that seems like a pretty big limitation and makes it much much less useful, if we can even use it at all for our local dev and testing purposes.
Thanks!
The answer to your question is: No, CCM doesn't support only one keyspace.
CCM doesn't have any restrictions at all built into it. Under the covers it is just a set of python scripts for configuring and launching a cassandra cluster on a single machine.

Using Spark in conjunction with Cassandra?

In our current infrastructure we use a Cassandra cluster as our backend database, and via Solr we use a web UI for our customers to perform read queries on our database as necessary.
I've been asked to look into Spark as something that we could implement in the future, but I'm having trouble understanding how it will improve what we currently do.
So my basic questions are:
1) Is Spark something that would replace Solr for querying the database, like when a user is looking something up on our site?
2) Just a general idea, what type of infrastructure would be necessary to improve our current situation (5 Cassandra nodes, all of which also run Solr).
In other words, we would simple be looking at building another cluster of just Spark nodes?
3) Can Spark nodes run on the same physical machine as Cassandra? I'm guessing it would be a bad idea due to memory constraints as my very basic understanding of Spark is that it does everything in memory.
4) Any good quick/basic resources I can use to start figuring out how Spark might benefit us? I have access to Datastax Academy courses so I'm going through those, just wondering if there is anything else to help with my research.
Basically once I figure out what it is, and more importantly how/if it is something we can use to our advantage I'll start playing with some test instances, but I should probably familiarize myself with the basics first.
1) No, Spark is a batch processing system and Solr is live indexing solution. Latency on solr is going to be sub second, Spark jobs are meant to take minutes (or more). There should really be no situation where Spark can be a drop in replacement for Solr.
2) I generally recommend a second Datacenter running both C* and Spark on the same machines. This will have the data from the first Datacenter via replication.
3) Spark Does not do everything in memory. Depending on your use case it can be a great idea to run on the same machines as C*. This can allow for data locality in reading from C* and help out significantly on table scan times. I usually also recommend colocating Spark Executors and C* nodes.
4) DS Academy 320 course is probably the best resource out there atm. https://academy.datastax.com/courses/getting-started-apache-spark

Regarding upgrade from 2.0.3 to 2.0.7

I am currently planning for an upgrade to 2.0.7 cassandra version . My base version is 2.0.3. I have not done an upgrade so far and hence want to be absolutely sure about what am doing . Can someone explain what needs to be done apart front this.
Do a nodetool drain to stop all writes to the particular node.
Stop the cassandra node(I have a 8 node , 2 data center network topology. I am bringing down one node in DC1)
Change the cassandra.yaml accordingly in the new binary tarball.
Make the required changes for the new node(using gossiping property file snitch. So , making changes for that)
Start off the new cassandra binary(2.0.7)
Question striking me the most
Do I have to copy the data from 2.0.3 to 2.0.7?
2.Even if it's a rolling upgrade , I think the following steps will do( Except moving from one version to another ) . My assumption is right?
Am going to do this operation on a running application. Am planning to have the application running while doing this as I have enough replicas in local quorum to satisfy reads and writes. Does this idea have any disadvantages ? I loved cassandra for this kind of operation but would like to know of there are any potential problems ?
I will be having the existing 2.0.3 in my running machine while doing this. If there is a problem in 2.0.7 , I shall start off 2.0.3 version again right? Just wanted to know whether there will be any data conflicts with other nodes in the cluster? Or having a snapshot to recover the data is the best option?
5.Apart from this, any other thing I have bear in mind?
Do I have to copy the data from 2.0.3 to 2.0.7? 2.Even if it's a rolling upgrade , I think the following steps will do( Except moving from one version to another ) . My assumption is right?
If you just upgrade the binaries, you can leave all of the data in place and it will use it automatically.
Am going to do this operation on a running application. Am planning to have the application running while doing this as I have enough replicas in local quorum to satisfy reads and writes. Does this idea have any disadvantages ? I loved cassandra for this kind of operation but would like to know of there are any potential problems ?
Normal read and write operations are fine. While you are temporarily running a mixed-version cluster, it's best to avoid doing anything that involves streaming (repairs) or topology changes (bootstrapping or decommissioning nodes). They might work, but they're not officially supported and you're more likely to have problems.
I will be having the existing 2.0.3 in my running machine while doing this. If there is a problem in 2.0.7 , I shall start off 2.0.3 version again right? Just wanted to know whether there will be any data conflicts with other nodes in the cluster? Or having a snapshot to recover the data is the best option?
You want to have a snapshot to recover from. Newer versions of Cassandra may use new SSTable or commitlog formats which the older version will not be able to read.

Resources