I am trying to understand data replication in Cassandra. In my case, I have to store a huge number of records into a single table based on yymmddhh primary key partition.
I have two data centers (DC1 and DC2) and I created a keyspace using below CQL.
CREATE KEYSPACE db1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'DC1' : 1, 'DC2' : 1 };
And then created a new table tbl_data using below CQL
CREATE TABLE db1.tbl_data (
yymmddhh varchar,
other_details text,
PRIMARY KEY (yymmddhh)
) WITH read_repair_chance = 0.0;
Now, I can see that the above keyspace "db1" and table "tbl_data" created successfully. I have few millions of rows to insert, I am assuming that all rows will be stored on both servers i.e. DC1 and DC2 since replication factor is 1 of both data centers.
Suppose, after some time I need to add more nodes since number of records can increase to billions, so in that case one data center can't handle that huge number of records due to disk space limitation.
a) So, how can I divide data into different nodes and can add new nodes on demand?
b) Do I need to alter keyspace "db1" to put name of new data centers in the list?
c) How the current system will work horizontally?
d) I am connecting Cassandra using nodejs driver by using below code. Do I need to put ip address of all nodes here in code? What If I keep increasing the number of nodes on demand, do I need to change the code every time?
var client = new cassandra.Client({ contactPoints: ['ipaddress_of_node1'], keyspace: 'db1' });
From all above examples you can see that my basic requirement is to store a huge number of records into a single table spreading data to different servers where I should be able to add new servers if data volume increases.
a) If you add new nodes to the data center, the data will be automatically shared between the nodes. With replication factor 1 and default settings, it should be ~50% on each node, though it might take a bit to redistribute data between the nodes after adding a new node. 'nodetool status ' can show you which node owns how much of that keyspace.
b) Yes, I do believe you have to (though not 100% on this).
c) Horizontally with your setup it'll scale linearly (assuming the machines are equal and have the same num_tokens value) by distributing data as according to 1 divided on number of nodes (1 node = 100%, 2 = 50%, 3 = 33%, etc.), both throughput and storage capacity will scale.
d) No, assuming the nodejs driver works like the C++ and Python drivers of Cassandra (it should!), after connecting to Cassandra it'll be aware of the other nodes in the cluster.
Answer by dbrats answers most of your concerns.
Do I need to alter keyspace "db1" to put name of new data centers in the list?
Not needed. You want to alter only if you add a new Data center or change replication factor.
Do I need to put ip address of all nodes here in code?
Not needed. But adding more than one contact point ensure higher availability.
In case your contact point is down, the driver can connect to the other. Once it connects, it can get all the list of nodes.
Related
In Mongo we can go for any of the below model
Simple replication(without shard where one node will be working as master and other as slaves) or
Shard(where data will be distributed on different shard based on partition key)
Both 1 and 2
My question - Can't we have Cassandra just with replication without partitioning just like model_1 in mongo ?
From Cassandra vs MongoDB in respect of Secondary Index?
In case of Cassandra, the data is distributed into multiple nodes based on the partition key.
From above it looks like it is mandatory to distribute the data based on some p[artition key when we have more than one node ?
In Cassandra, replication factor defines how many copies of data you have. Partition key is responsible for distributing of data between nodes. But this distribution may depend on the amount of nodes that you have. For example, if you have 3 nodes cluster & replication factor equal to 3, then all nodes will get data anyway...
Basically your intuition is right: The data is always distributed based on the partition key. The partition key is also called row key or primary key, so you can see: you have one anyway. The 1. case of your mongo example is not doable in cassandra, mainly because cassandra does not know the concept of masters and slaves. If you have a 2 node cluster and a replication factor of 2, then the data will be held on 2 nodes, like Alex Ott already pointed out. When you query (read or write), your client will decide to which to connect and perform the operation. To my knowledge, the default here would be a round robin load balancing between the two nodes, so either of them will receive somewhat the same load. If you have 3 nodes and a replication factor of 2, it becomes a little more tricky. The nice part is though, that you can determine the set of nodes which hold your data in the client code, thus you don't lose any performance by connecting to a "wrong" node.
One more thing about partitions: you can configure some of this, but this would be per server and not per table. I've never used this, and personally i wouldn't recommend to do so. Just stick to the default mechanism of cassandra.
And one word about the secondary index thing. Use materialized views
Is there any cloud storage system (i.e Cassandra, Hazelcast, Openstack Swift) where we can change the replication factor of selected objects? For instance lets say, we have found out hotspot objects in the system so we can increase the replication factor as a solution?
Thanks
In Cassandra the replication factor is controlled based on keyspaces. So you first define a keyspace by specifying the replication factor the keyspace should have in each of your data centers. Then within a keyspace, you create database tables, and those tables are replicated according to the keyspace they are defined in. Objects are then stored in rows in a table using a primary key.
You can change the replication factor for a keyspace at any time by using the "alter keyspace" CQL command. To update the cluster to use the new replication factor, you would then run "nodetool repair" for each node (most installations run this periodically anyway for anti-entropy).
Then if you use for example the Cassandra java driver, you can specify the load balancing policy to use when accessing the cluster, such as round robin, and token aware policy. So if you have multiple replicas of the the table holding the objects, then the load of accessing the object could be set to round robin on just the nodes that have a copy of the row you are accessing. If you are using a read consistency level of ONE, then this would spread out the read load.
So the granularity of this is not at the object level, but at the table level. If you had all your objects stored in one table, then changing the replication factor would change it for all objects in that table and not just one. You could have multiple keyspaces with different replication factors and keep high demand objects in a keyspace with a high RF, and less frequently accessed objects in a keyspace with a low RF.
Another way you could reduce the hot spot for an object in Cassandra is to make additional copies of it by inserting it into additional rows of a table. The rows are accessed on nodes by the compound partition key, so one field of the partition key could be a "copy_number" value, and when you go to read the object, you randomly set a copy_number value (from 0 to the number of copy rows you have) so that the load of reading the object will likely hit a different node for each read (since rows are hashed across the cluster based on the partition key). This approach would give you more granularity at the object level compared to changing the replication factor for the whole table, at the cost of more programming work to manage randomly reading different rows.
In Infinispan, you can also set number of owners (replicas) on each cache (equivalent to Hazelcast's map or Cassandra's table), but not for one specific entry. Since the routing information (aka consistent hash table) does not contain all keys but splits the hashCode() 32-bit range to variable amount of segments, and then specifies the distribution only for these segments, there's no way to specify the number of replicas per entry.
Theoretically, with specially forged keys and custom consistent hash table factory, you could achieve something similar even in one cache (certain sorts of keys would be replicated different amount of times), but that would require coding with deep understanding of the system.
Anyway, the reader would have to know the number of replicas in advance as this would be part of the routing information (cache in simple case, special keys as described above), therefore, it's not really practical unless the reader can know that.
I guess you want to use the replication factor for the sake of speeding up reads.
The regular Map (IMap) implementation, uses a master slave(s) setup, so all reads will go through the master. But there is a special setting available, so you are also allowed to read from backups. So if you have a 10 node cluster, and have a backup count of 5, there will be in total 6 members that have the information stored. 5 members in the cluster will hit the master, and 5 members in the cluster will hit the backup (since they have the backup locally available).
There also is a fully replicated map available, here every item is send to every machine. So in a 10 node cluster, all reads will be local since every machine has the same data.
In case of the IMap, we don't provide control on the number of backups on the key/value level. So the whole map is configured with a certain backup-count.
In Cassandra, can we "fix" the node in which a specific partition key resides to optimize fetches?
This is optimization for a specific keyspace and table where data written by one data center is never read by clients on a different data center. If a particular partition key will be queried only in specific data center, is it possible to avoid network delays by "fixing" it to nodes of same data center where it was written?
In other words, this is a use case where the schema is common across all data centers, but the data is never accessed across data centers. One way of doing this is to make the data center id as the partition key. However, a specific data center's data need/should not be placed in other data centers. Can we optimize by somehow specifying cassandra the partition key to data center mapping?
Is a custom Partitioner the solution for this kind of use case?
You should be able to use Cassandra's "datacenter awareness" to solve this. You won't be able to get it to enforce that awareness at the row level, but you can do it at the keyspace level. So if you have certain keyspaces that you know will be accessed only by certain localities (and served by specific datacenters) you can configure your keyspace to replicate accordingly.
In the cassandra-topology.properties file you can define which of your nodes is in which rack and datacenter. Then, make sure that you are using a snitch (in your cassandra.yaml) that will respect the topology entries (ex: propertyFileSnitch).
Then when you create your keyspace, you can define the replication factor on a per-datacenter basis:
CREATE KEYSPACE "Excalibur"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2};
To get your client applications to only access certain datacenters, you can specify a LOCAL read consistency (ex: LOCAL_ONE or LOCAL_QUORUM). This way, your client apps in one area will only read from a particular datacenter.
a specific data center's data need/should not be placed in other data
centers.
While this solution won't solve this part of your question, unless you have disk space concerns (which in this day and age, you shouldn't) having extra replicas of your data can save you in an emergency. If you should lose one or all nodes in a particular datacenter and have to rebuild them, a cluster-wide repair will restore your data. Otherwise if keeping the data separate is really that important, you may want to look into splitting the datacenters into separate clusters.
Cassandra determines which node at which to store a row using a partioner strategy. Normally you use a partitioner, such as the Murmur3 partitioner, that distribute rows effectively randomly and thus uniformly. You can write and use your own partitioner, in Java. That said, you should be cautious about doing this. Do you really want to assign a row to a specific node.
Data is too volumninous to be replicated across all data centers. Hence I am resorting to creating a keyspace per data center.
CREATE KEYSPACE "MyLocalData_dc1"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 1, dc3:0, dc4: 0};
CREATE KEYSPACE "MyLocalData_dc2"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 0, 'dc2' : 3, dc3:1, dc4: 0};
This way, MyLocalData generated by datacenter 1 has one backup in datacenter 2. And data generated by datacenter2 is backed up in data center 3. Data is "fixed" in the data center it is written in and accessed from. Network latencies are avoided.
This is my configuration for 4 Data Centers of Cassandra:
create KEYSPACE mySpace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : 1, 'DC2' : 1, 'DC3' : 1, 'DC4' : 1};
In this configuration (Murmur3Partitioner + 256 tokens) , each DC is storing roughly 25% of the key space. And this 25% are replicated 3 times on each other DC. Meaning that every single row has 4 copies over all.
For instance if my data base is to big to keep 4 complete copies of it, how can I configure cassandra so that each DC is replicated only once or twice (instead of total number of DCs (x3)).
For example: 25% of the key space that is stored on DC1 I want to replicate once on DC2 only. I am not looking for selecting any particular DC for replication neither I care if 25% of DC1 will be split over multiple DC1,2,3 I just want to use NetworkTopologyStrategy but reduce storage costs.
Is it possible ?
Thank you
Best Regards
Your keyspace command shows that each of the DCs hold 1 copy of the data. This means that if you have 1 node in each DC, then each node will have 100% of your data. So, I am not sure how you concluded that each of your DCs store only 25% of keys as it is obvious they are storing 100%. Chances are when you run nodetool command you are not specifying the keyspace so the command shows you load which is based on the token range assigned to each node which would be misleading for NetworkTopology setup. Try running it with your keyspace name and see if you notice the difference.
I don't think there is a way to shift data around DCs using any of existing Snitches the way you want it. If you really wanted to have even distribution and you had equal number of nodes in each DC with initial tokens spaced evenly, you could have used SimpleSnitch to achieve what you want. You can change the Snitch to SimpleSnitch and run nodetool cleanup/repair on each node. Bare in mind that during this process you will have some outage because after the SnitchChange, previously written keys may not be available on some nodes until the repair job is done.
The way NetworkTopology works is that if you say you have DC1:1 and you have for example 2 nodes in DC1, it will evenly distribute keys across 2 nodes leading to 50% effective load on each node. With that in mind, I think what you really want to have done is to keep 3 copies of your data, 1 in each DC. So, you can really discard one DC and save money. I am saying this because I think these DCs you have are virtual in the notion of your NetworkTopology and not real physical DC because no one would want to have only 25% of data in one DC as it will not be an available setup. So, I recommend if your nodes are grouped into virtual DCs, you group them into 4 racks instead and maintain 1 DC:
DC1:
nd1-ra_1 rack-a
nd1-rb_1 rack-b
nd1-rc_1 rack-c
nd2-ra_2 rack-a
nd2-rb_2 rack-b
nd2-rc_2 rack-c
nd3-ra_3 rack-a
nd3-rb_3 rack-b
nd3-rc_3 rack-c
nd3-ra_4 rack-a
nd3-rb_4 rack-b
nd3-rc_4 rack-c
In this case, if you set your replication option to DC1:3, each of the racks a,b,and c will have 100% of your data (each node in each rack 25%).
There is one particular table in the system that I actually want to stay unique on a per server basis.
i.e. http://server1.example.com/some_stuff.html and http://server2.example.com/some_stuff.html should store and show data unique to that particular server. If one of the server dies, that table and its data goes with it.
I think CQL does not support table-level replication factors (see available create table options). One alternative is to create a key-space with a replication factor = 1:
CREATE KEYSPACE <ksname>
WITH replication = {'class':'<strategy>' [,'<option>':<val>]};
Example:
To create a keyspace with SimpleStrategy and "replication_factor" option
with a value of "1" you would use this statement:
CREATE KEYSPACE <ksname>
WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};
then all tables created in that key-space will have no replication.
If you would like to have a table for each node, then I think Cassandra does not directly support that. One work-around is to start an additional Cassandra cluster for each node where each cluster only have one node.
I am not sure why you would want to do that, but here is my take on this:
Distribution of the actual your data among the nodes in your Cassandra cluster is determined by the row key.
Just setting the replication factor to 1 will not put all data from one column family/table on 1 node. The data will still be split/distributed according to your row key.
Exactly WHERE your data will be stored is determined by the row key along with the partitioner as documented here. This is an inherent part of a DDBS and there is no easy way to force this.
The only way I can think of to have all the data for one server phisically in one table on one node, is:
use one row key per server and create (very) wide rows, (maybe using composite column keys)
and trick your token selection so that all row key tokens map to the node you are expecting (http://wiki.apache.org/cassandra/Operations#Token_selection)