Requirement:
We have a particular transaction table retail_mapping in cassandra keyspace "account".
We have another keyspace "dp" where the exact table data "retail_mapping" needs to be replicated and accessed by micro-services.
1) Is there any way we can create a mirror table retail_mapping in dp keyspace coming from account keyspace.
2) Any data which is persisted in account keyspace also needs to be copied into dp keyspace immediately
For your first question, you could create a snapshot (to do on each nodes) and copy the data file to the other table, followed by a nodetool refresh.
For your second question, the best is to achieve that at the application layer. If that s not an option then your best option is to look at Cassandra triggers.
Related
I am a newbie to Cassandra.I have created a keyspace in Cassandra in NetworkTopology Strategy with 2 replicas in one datacenter. Is there a cql command or some other way to view my data in two replicas?
Like SELECT * FROM tablename in replica1 / replica2
Whether there is another way such that I can visually see the data in two replicas?
Thanks in advance.
So your question is not real clear "See the data in 2 replicas". If you ever want to validate your data, you can run some commands to visually see things.
The first thing you'd want to do is log onto the node you want to investigate. Go to the data directory of the interested table -> DataDir/keyspace/table. In there you'll see one or more files that look like *Data.db. Those are your sstables. Data in memory is flushed to sstables in certain scenarios. You want to be sure your data is flushed from memory to disk if you're validating (as you may not find what you're looking for otherwise). To do that, you issue a "nodetool flush" command (you can use the keyspace and table as parameters if you only want to flush the specific table).
Like I said, after that, everything in memory would be flushed to disk. So you'd be able to see your sstables (again, *Data.db) files. Once you have those sstables, you can run the "sstabledump" command on each sstable to see the data that resides in them, thus validating your data.
If you have only a few rows you want to validate and a lot of nodes, you can find which node the rows would reside by running "nodetool getendpoints" with the keyspace, table, and partition key. That will tell you every node that will have the data. That way you're not guessing which node the row(s) should be on. Unfortunately, there is no way to know which sstable the rows should exist in (and it could be more than one if updates/deletes, etc. occurred). You'll have to go through each sstable on the specific node(s).
Hope that helps answer your question?
Good luck.
-Jim
You can for a specific partition. If you are sure host1 is a replica (nodetool getendpoints or from query trace), then if you make your query with CL.ONE and explicitly to that host, the coordinator will always pick local first. So
Statement q = new SimpleStatement("SELECT * FROM tablename WHERE key = X");
q.setHost("host1")
Where host1 owns X.
For SELECT * FROM tablename its a bit harder because you are looking over entire data set and coordinator will send out multiple queries for each part of ring. If you do some queries with CL.ONE it will still only go to one node for each part of that range so if you set q.enableTracing() you can see what node answered for each range. You have no control over which coordinator picks so may take few queries.
If you just want to see if theres differences you can use preview repair. nodetool repair --preview --full.
Is there any cloud storage system (i.e Cassandra, Hazelcast, Openstack Swift) where we can change the replication factor of selected objects? For instance lets say, we have found out hotspot objects in the system so we can increase the replication factor as a solution?
Thanks
In Cassandra the replication factor is controlled based on keyspaces. So you first define a keyspace by specifying the replication factor the keyspace should have in each of your data centers. Then within a keyspace, you create database tables, and those tables are replicated according to the keyspace they are defined in. Objects are then stored in rows in a table using a primary key.
You can change the replication factor for a keyspace at any time by using the "alter keyspace" CQL command. To update the cluster to use the new replication factor, you would then run "nodetool repair" for each node (most installations run this periodically anyway for anti-entropy).
Then if you use for example the Cassandra java driver, you can specify the load balancing policy to use when accessing the cluster, such as round robin, and token aware policy. So if you have multiple replicas of the the table holding the objects, then the load of accessing the object could be set to round robin on just the nodes that have a copy of the row you are accessing. If you are using a read consistency level of ONE, then this would spread out the read load.
So the granularity of this is not at the object level, but at the table level. If you had all your objects stored in one table, then changing the replication factor would change it for all objects in that table and not just one. You could have multiple keyspaces with different replication factors and keep high demand objects in a keyspace with a high RF, and less frequently accessed objects in a keyspace with a low RF.
Another way you could reduce the hot spot for an object in Cassandra is to make additional copies of it by inserting it into additional rows of a table. The rows are accessed on nodes by the compound partition key, so one field of the partition key could be a "copy_number" value, and when you go to read the object, you randomly set a copy_number value (from 0 to the number of copy rows you have) so that the load of reading the object will likely hit a different node for each read (since rows are hashed across the cluster based on the partition key). This approach would give you more granularity at the object level compared to changing the replication factor for the whole table, at the cost of more programming work to manage randomly reading different rows.
In Infinispan, you can also set number of owners (replicas) on each cache (equivalent to Hazelcast's map or Cassandra's table), but not for one specific entry. Since the routing information (aka consistent hash table) does not contain all keys but splits the hashCode() 32-bit range to variable amount of segments, and then specifies the distribution only for these segments, there's no way to specify the number of replicas per entry.
Theoretically, with specially forged keys and custom consistent hash table factory, you could achieve something similar even in one cache (certain sorts of keys would be replicated different amount of times), but that would require coding with deep understanding of the system.
Anyway, the reader would have to know the number of replicas in advance as this would be part of the routing information (cache in simple case, special keys as described above), therefore, it's not really practical unless the reader can know that.
I guess you want to use the replication factor for the sake of speeding up reads.
The regular Map (IMap) implementation, uses a master slave(s) setup, so all reads will go through the master. But there is a special setting available, so you are also allowed to read from backups. So if you have a 10 node cluster, and have a backup count of 5, there will be in total 6 members that have the information stored. 5 members in the cluster will hit the master, and 5 members in the cluster will hit the backup (since they have the backup locally available).
There also is a fully replicated map available, here every item is send to every machine. So in a 10 node cluster, all reads will be local since every machine has the same data.
In case of the IMap, we don't provide control on the number of backups on the key/value level. So the whole map is configured with a certain backup-count.
I am trying to create a table for keeping counters to different hits on my APIs. I am using Cassandra 2.0.6, and aware that there have been some performance improvements to counters starting 2.1.0, but cant upgrade at this moment.
The documentation i read on datastax always starts with creating a separate keyspace like these:
http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_counter_t.html
http://www.datastax.com/documentation/cql/3.1/cql/cql_using/use_counter_t.html
From documentation:
Create a keyspace on Linux for use in a single data center, single node cluster. Use the default data center name from the output of the nodetool status command, for example datacenter1.
CREATE KEYSPACE counterks WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 1 };
Question:
1)Does it mean that i should keep my counters in a separate keyspace
2)If yes, should i declare the keyspace as defined in documentation examples, or thats just an example and i can set my own replication strategy - specifically replicate across data centers.
Thanks
Sorry you had trouble with the instructions. The instructions need to be changed to make it clear that this is just an example and improved by changing RF to 3, for example.
Using a keyspace for a single data center and single node cluster is not a requirement. You need to keep counters in separate tables, but not separate keyspaces; however, keeping tables in separate keyspaces gives you the flexibility to change the consistency and replication from table to table. Normally you have one keyspace per application. See related single vs mutliple keyspace discussion on http://grokbase.com/t/cassandra/user/145bwd3va8/effect-of-number-of-keyspaces-on-write-throughput.
In Cassandra, can we "fix" the node in which a specific partition key resides to optimize fetches?
This is optimization for a specific keyspace and table where data written by one data center is never read by clients on a different data center. If a particular partition key will be queried only in specific data center, is it possible to avoid network delays by "fixing" it to nodes of same data center where it was written?
In other words, this is a use case where the schema is common across all data centers, but the data is never accessed across data centers. One way of doing this is to make the data center id as the partition key. However, a specific data center's data need/should not be placed in other data centers. Can we optimize by somehow specifying cassandra the partition key to data center mapping?
Is a custom Partitioner the solution for this kind of use case?
You should be able to use Cassandra's "datacenter awareness" to solve this. You won't be able to get it to enforce that awareness at the row level, but you can do it at the keyspace level. So if you have certain keyspaces that you know will be accessed only by certain localities (and served by specific datacenters) you can configure your keyspace to replicate accordingly.
In the cassandra-topology.properties file you can define which of your nodes is in which rack and datacenter. Then, make sure that you are using a snitch (in your cassandra.yaml) that will respect the topology entries (ex: propertyFileSnitch).
Then when you create your keyspace, you can define the replication factor on a per-datacenter basis:
CREATE KEYSPACE "Excalibur"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2};
To get your client applications to only access certain datacenters, you can specify a LOCAL read consistency (ex: LOCAL_ONE or LOCAL_QUORUM). This way, your client apps in one area will only read from a particular datacenter.
a specific data center's data need/should not be placed in other data
centers.
While this solution won't solve this part of your question, unless you have disk space concerns (which in this day and age, you shouldn't) having extra replicas of your data can save you in an emergency. If you should lose one or all nodes in a particular datacenter and have to rebuild them, a cluster-wide repair will restore your data. Otherwise if keeping the data separate is really that important, you may want to look into splitting the datacenters into separate clusters.
Cassandra determines which node at which to store a row using a partioner strategy. Normally you use a partitioner, such as the Murmur3 partitioner, that distribute rows effectively randomly and thus uniformly. You can write and use your own partitioner, in Java. That said, you should be cautious about doing this. Do you really want to assign a row to a specific node.
Data is too volumninous to be replicated across all data centers. Hence I am resorting to creating a keyspace per data center.
CREATE KEYSPACE "MyLocalData_dc1"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 1, dc3:0, dc4: 0};
CREATE KEYSPACE "MyLocalData_dc2"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 0, 'dc2' : 3, dc3:1, dc4: 0};
This way, MyLocalData generated by datacenter 1 has one backup in datacenter 2. And data generated by datacenter2 is backed up in data center 3. Data is "fixed" in the data center it is written in and accessed from. Network latencies are avoided.
There is one particular table in the system that I actually want to stay unique on a per server basis.
i.e. http://server1.example.com/some_stuff.html and http://server2.example.com/some_stuff.html should store and show data unique to that particular server. If one of the server dies, that table and its data goes with it.
I think CQL does not support table-level replication factors (see available create table options). One alternative is to create a key-space with a replication factor = 1:
CREATE KEYSPACE <ksname>
WITH replication = {'class':'<strategy>' [,'<option>':<val>]};
Example:
To create a keyspace with SimpleStrategy and "replication_factor" option
with a value of "1" you would use this statement:
CREATE KEYSPACE <ksname>
WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};
then all tables created in that key-space will have no replication.
If you would like to have a table for each node, then I think Cassandra does not directly support that. One work-around is to start an additional Cassandra cluster for each node where each cluster only have one node.
I am not sure why you would want to do that, but here is my take on this:
Distribution of the actual your data among the nodes in your Cassandra cluster is determined by the row key.
Just setting the replication factor to 1 will not put all data from one column family/table on 1 node. The data will still be split/distributed according to your row key.
Exactly WHERE your data will be stored is determined by the row key along with the partitioner as documented here. This is an inherent part of a DDBS and there is no easy way to force this.
The only way I can think of to have all the data for one server phisically in one table on one node, is:
use one row key per server and create (very) wide rows, (maybe using composite column keys)
and trick your token selection so that all row key tokens map to the node you are expecting (http://wiki.apache.org/cassandra/Operations#Token_selection)