Cassandra keyspace for counters - cassandra

I am trying to create a table for keeping counters to different hits on my APIs. I am using Cassandra 2.0.6, and aware that there have been some performance improvements to counters starting 2.1.0, but cant upgrade at this moment.
The documentation i read on datastax always starts with creating a separate keyspace like these:
http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_counter_t.html
http://www.datastax.com/documentation/cql/3.1/cql/cql_using/use_counter_t.html
From documentation:
Create a keyspace on Linux for use in a single data center, single node cluster. Use the default data center name from the output of the nodetool status command, for example datacenter1.
CREATE KEYSPACE counterks WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 1 };
Question:
1)Does it mean that i should keep my counters in a separate keyspace
2)If yes, should i declare the keyspace as defined in documentation examples, or thats just an example and i can set my own replication strategy - specifically replicate across data centers.
Thanks

Sorry you had trouble with the instructions. The instructions need to be changed to make it clear that this is just an example and improved by changing RF to 3, for example.
Using a keyspace for a single data center and single node cluster is not a requirement. You need to keep counters in separate tables, but not separate keyspaces; however, keeping tables in separate keyspaces gives you the flexibility to change the consistency and replication from table to table. Normally you have one keyspace per application. See related single vs mutliple keyspace discussion on http://grokbase.com/t/cassandra/user/145bwd3va8/effect-of-number-of-keyspaces-on-write-throughput.

Related

Replication without partitioning in Cassandra

In Mongo we can go for any of the below model
Simple replication(without shard where one node will be working as master and other as slaves) or
Shard(where data will be distributed on different shard based on partition key)
Both 1 and 2
My question - Can't we have Cassandra just with replication without partitioning just like model_1 in mongo ?
From Cassandra vs MongoDB in respect of Secondary Index?
In case of Cassandra, the data is distributed into multiple nodes based on the partition key.
From above it looks like it is mandatory to distribute the data based on some p[artition key when we have more than one node ?
In Cassandra, replication factor defines how many copies of data you have. Partition key is responsible for distributing of data between nodes. But this distribution may depend on the amount of nodes that you have. For example, if you have 3 nodes cluster & replication factor equal to 3, then all nodes will get data anyway...
Basically your intuition is right: The data is always distributed based on the partition key. The partition key is also called row key or primary key, so you can see: you have one anyway. The 1. case of your mongo example is not doable in cassandra, mainly because cassandra does not know the concept of masters and slaves. If you have a 2 node cluster and a replication factor of 2, then the data will be held on 2 nodes, like Alex Ott already pointed out. When you query (read or write), your client will decide to which to connect and perform the operation. To my knowledge, the default here would be a round robin load balancing between the two nodes, so either of them will receive somewhat the same load. If you have 3 nodes and a replication factor of 2, it becomes a little more tricky. The nice part is though, that you can determine the set of nodes which hold your data in the client code, thus you don't lose any performance by connecting to a "wrong" node.
One more thing about partitions: you can configure some of this, but this would be per server and not per table. I've never used this, and personally i wouldn't recommend to do so. Just stick to the default mechanism of cassandra.
And one word about the secondary index thing. Use materialized views

Error during inserting data: NoHostAvailable:

I try to learn basics of Apache Cassandra. I found this simple example of application at https://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_music_service_c.html
So I created a keyspace, then I created a table, and now I am trying to add some data to the database.
But when I try to Insert data I have got an error: "NoHostAvailable:" That's it. No more information.
So far I've tried to update python driver (NoHostAvailable exception connecting to Cassandra from python) but it didn't work.
What I do wrong? Or is it a problem with cqlsh?
OK. I've found the answer. The NetworkTopologyStrategy is not suited for running on a single node. After changing replication strategy on SimpleStrategy everything started to work.
Just met same problem. check the keyspace's replication setting, if using NetworkTopologyStrategy, ensure the dc name is correct.
To change replication strategy (from NetworkTopologyStrategy) to SimpleStrategy (which is proper for single node) run the following query:
ALTER KEYSPACE yourkeyspaceName WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} ;
For me, it happened as one of the instance went down. Restarted the second instance and error gone. But in the schema table, I am seeing the Topology as Simple for my keyspace. That is confusing.
Let's clear the air here...
You ABSOLUTELY can use NetworkTopologyStrategy in a single-node configuration. I currently have five versions of Cassandra installed on my local, and they are all configured that way, and they work just fine.
Albeit, it is not as simple as just using SimpleStrategy, so there are some steps that need to be taken:
Start by setting the GossipingPropertyFileSnitch in the cassandra.yaml:
endpoint_snitch: GossipingPropertyFileSnitch
That tells Cassandra to use the cassandra-rackdc.properties file to name logical data centers and racks:
$ cat conf/cassandra-rackdc.properties | grep -v "#"
dc=dc1
rack=rack1
If you have a new cluster, you can change those. If you have an existing cluster, leaving them is the best idea. But you'll want to reference the dc name, because you'll need that in your keyspace definition.
Now, if you define your keyspace like this:
CREATE KEYSPACE stackoverflow WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '1'};
With this configuration, NetworkTopologyStrategy can be used just fine.
Opinions will differ on this, but I do not recommend using SimpleStrategy. It's a good idea to practice getting used to using NetworkTopologyStrategy on your local. I say this, because I have seen the opposite happen: folks accidentally deploy a SimpleStrategy keyspace into a MDHA cluster in production, and then wonder why their configured application consistency cannot be met.
It happens some times when the node is down who is having the desired partition ranges for that insert. Please check your all nodes from below commands and try to run query again.
nodetool status
nodetool describecluster.

Cassandra Error: "Unable to complete request: one or more nodes were unavailable."

I am a complete newbie at Cassandra and am just setting it up and playing around with it and testing different scenarios using cqlsh.
I currently have 4 nodes in 2 datacenters something like this (with proper IPs of course):
a.b.c.d=DC1:RACK1
a.b.c.d=DC1:RACK1
a.b.c.d=DC2:RACK1
a.b.c.d=DC2:RACK1
default=DCX:RACKX
Everything seems to make sense so far except that I brought down a node on purpose just to see the resulting behaviour and I notice that I can no longer query/insert data on the remaining nodes as it results in "Unable to complete request: one or more nodes were unavailable."
I get that a node is unavailable (I did that on purpose), but isnt one of the main points of distributed DB is to continue to support functionalities even as some nodes go down? Why does bringing one node down put a complete stop to everything?
What am I missing?
Any help would be greatly appreciated!!
You're correct in assuming that one node down should still allow you to query the cluster, but there are a few things to consider.
I'm assuming that "nodetool status" returns the expected results for that DC (i.e. "UN" for the UP node, "DN" for the DOWNed node)
Check the following:
Connection's Consistency level (default is ONE)
Keyspace replication strategy and factor (default is Simple, rack/dc unaware)
In cqlsh, "describe keyspace "
Note that if you've been playing around with replication factor you'll need to run a "nodetool repair" on the nodes.
More reading here
Is it possible that you did not set the replication factor on your keyspace with a value greater than 1? For example:
CREATE KEYSPACE "Excalibur"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 2, 'dc2' : 2};
Will configure your keyspace such that data is replicated to 2 nodes in each dc1 and dc2 datacenters.
If your replication factor is 1 and a node goes down that owns the data you are querying you will not be able to retrieve the data and C* will fail fast with an unavailable error. In general if C* detects that the consistency level cannot be met to service your query it will fail fast.

Placing data in specific nodes in Cassandra

In Cassandra, can we "fix" the node in which a specific partition key resides to optimize fetches?
This is optimization for a specific keyspace and table where data written by one data center is never read by clients on a different data center. If a particular partition key will be queried only in specific data center, is it possible to avoid network delays by "fixing" it to nodes of same data center where it was written?
In other words, this is a use case where the schema is common across all data centers, but the data is never accessed across data centers. One way of doing this is to make the data center id as the partition key. However, a specific data center's data need/should not be placed in other data centers. Can we optimize by somehow specifying cassandra the partition key to data center mapping?
Is a custom Partitioner the solution for this kind of use case?
You should be able to use Cassandra's "datacenter awareness" to solve this. You won't be able to get it to enforce that awareness at the row level, but you can do it at the keyspace level. So if you have certain keyspaces that you know will be accessed only by certain localities (and served by specific datacenters) you can configure your keyspace to replicate accordingly.
In the cassandra-topology.properties file you can define which of your nodes is in which rack and datacenter. Then, make sure that you are using a snitch (in your cassandra.yaml) that will respect the topology entries (ex: propertyFileSnitch).
Then when you create your keyspace, you can define the replication factor on a per-datacenter basis:
CREATE KEYSPACE "Excalibur"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2};
To get your client applications to only access certain datacenters, you can specify a LOCAL read consistency (ex: LOCAL_ONE or LOCAL_QUORUM). This way, your client apps in one area will only read from a particular datacenter.
a specific data center's data need/should not be placed in other data
centers.
While this solution won't solve this part of your question, unless you have disk space concerns (which in this day and age, you shouldn't) having extra replicas of your data can save you in an emergency. If you should lose one or all nodes in a particular datacenter and have to rebuild them, a cluster-wide repair will restore your data. Otherwise if keeping the data separate is really that important, you may want to look into splitting the datacenters into separate clusters.
Cassandra determines which node at which to store a row using a partioner strategy. Normally you use a partitioner, such as the Murmur3 partitioner, that distribute rows effectively randomly and thus uniformly. You can write and use your own partitioner, in Java. That said, you should be cautious about doing this. Do you really want to assign a row to a specific node.
Data is too volumninous to be replicated across all data centers. Hence I am resorting to creating a keyspace per data center.
CREATE KEYSPACE "MyLocalData_dc1"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 1, dc3:0, dc4: 0};
CREATE KEYSPACE "MyLocalData_dc2"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 0, 'dc2' : 3, dc3:1, dc4: 0};
This way, MyLocalData generated by datacenter 1 has one backup in datacenter 2. And data generated by datacenter2 is backed up in data center 3. Data is "fixed" in the data center it is written in and accessed from. Network latencies are avoided.

Is it possible to have a Cassandra table that doesn't replicate?

There is one particular table in the system that I actually want to stay unique on a per server basis.
i.e. http://server1.example.com/some_stuff.html and http://server2.example.com/some_stuff.html should store and show data unique to that particular server. If one of the server dies, that table and its data goes with it.
I think CQL does not support table-level replication factors (see available create table options). One alternative is to create a key-space with a replication factor = 1:
CREATE KEYSPACE <ksname>
WITH replication = {'class':'<strategy>' [,'<option>':<val>]};
Example:
To create a keyspace with SimpleStrategy and "replication_factor" option
with a value of "1" you would use this statement:
CREATE KEYSPACE <ksname>
WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};
then all tables created in that key-space will have no replication.
If you would like to have a table for each node, then I think Cassandra does not directly support that. One work-around is to start an additional Cassandra cluster for each node where each cluster only have one node.
I am not sure why you would want to do that, but here is my take on this:
Distribution of the actual your data among the nodes in your Cassandra cluster is determined by the row key.
Just setting the replication factor to 1 will not put all data from one column family/table on 1 node. The data will still be split/distributed according to your row key.
Exactly WHERE your data will be stored is determined by the row key along with the partitioner as documented here. This is an inherent part of a DDBS and there is no easy way to force this.
The only way I can think of to have all the data for one server phisically in one table on one node, is:
use one row key per server and create (very) wide rows, (maybe using composite column keys)
and trick your token selection so that all row key tokens map to the node you are expecting (http://wiki.apache.org/cassandra/Operations#Token_selection)

Resources