making cassandra store data on a local node - cassandra

What is a simple way of configuring a cassandra cluster so that if I try to store a key in it, it will be stored in the local node to which I issue the set/write command?
I am looking at the IPartitioner which allows me to specify how the key will be hashed but it seems a bit heavy weight for something like above.
Thanks!

If you were able to arbitrarily write keys to arbitrary nodes, then on lookup the system would not know where the data for that key lived. The system would have to do a full cluster lookup which would be super slow.
By design, Cassandra spreads the data around in a known way so that lookups are quick.
Check out this post by Jonathan Ellis the primary maintainer of Cassandra.

Related

Inject Custom Sharding in Cassandra or Couchbase

Can I inject a sharding algorithm to wither Cassandra or Couchbase?
Or do they decide where each document go to?
For instance if I want to pin data to shards by one of the data properties.
Couchbase hash the key of the document to decide in which shard(vBucket) the document should be associated with. The SDK also uses the same algorithm to find out in which shard the document is located when you want to retrieve the document by its key.
One of the problems of letting developers decide on the sharding algorithm is that sometimes they end up with an excessive number of documents in a single shard, and naturally, this shard becomes the bottleneck of the application.
One of the core concepts in Couchbase is that the documents are (almost) evenly distributed between all shards, so I am not familiar with any native support to insert your own algorithm there.
Cassandra decides where the data goes by the partition key. So if you use the data you want to use as the "pin" as the partition key then it will accomplish what your asking for I think. However, you don't pick the replicas explicitly and it can change as hosts are removed and added to the cluster.

Did Cassandra really suitable to store log message?

I am looking for a NoSQL database to store firewall traffic log message from thousand of firewall devices. I hope the noSQL database can achieve 100% uptime with build-in multiple data center support, Cassandra seems a good choice for me.
The firewall log message i would like to store in Cassandra look like that
date=2014-07-04 time=14:26:59 type=traffic subtype=local
level=notice vd=vdom1 srcip=10.6.30.254 srcport=54705 srcintf="mgmt1"
dstip=10.6.30.1 dstport=80 dstintf="vdom1" sessionid=350696 status=close
policyid=0 dstcountry="Reserved" srccountry="Reserved" trandisp=noop service=HTTP
When i tried to create a table(column families) with multiple of columns corresponding to above log message key-value pair. i found it hard to define the table primary key/composite key because hundred of log message similar to above log example can be generated within the same second!
In order to uniquely identify each row, the primary key probably need to include almost all the columns..... which make me feel weird.
Did Cassandra really good for storing time series log message or i should consider another NoSQL database like MongoDB? Thanks.
Regards
Ro

Force Cassandra to save particular key values to be partitioned to Specific node.

How to use the ByteOrderedPartitioner (BOP) to force specific key values to be partitioned according to a custom requirement. I want to force Cassandra to partition and replicate data according to custom requirements, without introducing a custom partitioner how far I can control this behavior and how ?
Overall: I want my data starting with particular ID to be at a predefined node because I know data will be accessed from that node heavily. Also like the data to be replicated to nearby nodes.
I want my data starting with particular ID to be at a predefined node because I know data will be accessed from that node heavily.
Looks like that you talk about data locality problem, which is really important in bigdata-like computations (Spark, Hadoop, etc.). But the general approach for that isn't to pin data to specific node, but just to move your whole computation to the data itself.
Pinning data to specific node may cause problems like:
what should you do if your node goes down?
how evenly will the data be distributed among the cluster? Will be there any hotspots/bottlenecks because of node over(under)-utilization?
how can you scale your cluster in future?
Moving computation to data has no issues with these questions, but the approach you going to choose - has.
Found the answer here...
http://www.mail-archive.com/user%40cassandra.apache.org/msg14997.html
Changing the setting "initial_token" in cassandra.yaml file we can let the nodes to be divided into key ranges and partitioning will choose the node which is going to save the first replication of the data and strategy class SimpleStrategy will add the replica to proceeding nodes so by arranging the nodes the way you want you can exploit the replication strategy.

Cassandra: Data Type for Partition Key - Decimal or UUID

I want to describe the problem I am working on first:
Currently I try to find a strategy that would allow me to migrate data from an existing PostgreSQL database into a Cassandra cluster. The primary key in the PostgreSQL is a decimal value with 25 digits. When I migrate the data, it would be nice if I could keep the value of the current primary key in one way or another and use it to uniquely identify the data in Cassandra. This key should be used as the partition key in Cassandra (no other columns are involved in the table I am talking about). After doing some research, I found out that a good practise is to use UUIDs in Cassandra. So now I have two possible solutions to solve my problem:
I can either create a transformation rule, that would transfer my current decimal primary keys from the PostgrSQL database into UUIDs for Cassandra. Everytime someone requests to access some of the old data, I would have to reapply the transformation rule to the key and use the UUID to search for the data in Cassandra. The transformation would happen in an application server, that manages all communication with Cassandra (so no client will talk to Cassandra directly) New data added to Cassandra would of course be stored with an UUID.
The other solution, which I already have implemented in Java at the moment, is to use a decimal value as the partition key in Cassandra. Since it is possible, that multiple application servers will talk to Cassandra concurrently, my current approach is to generate a UUID in my application and transform it into a decimal value. Using this approach, I could simply reuse all the existing primary keys form PostgreSQL.
I cannot simply create new keys for the existing data, since other applications have stored their own references to the old primary key values and will therefore try to request data with those keys.
Now here is my question: Both approaches seem to work and end up with unique keys to identify my data. The distribution of data across all node should also be fine. But I wonder, if there is any benefit in using a UUID over a decimal value as partition key or visa versa. I don't know exactly what Cassandra does to determine the hash value of the partition key and therefore cannot determine if any data type is to be preferred. I am using the Murmur3Partitioner for Cassandra if that is relevant.
Does anyone have any experience with this issue?
Thanks in advance for answers.
There are two benefits of UUID's that I know of.
First, they can be generated independently with little chance of collisions. This is very useful in distributed systems since you often have multiple clients wanting to insert data with unique keys. In RDBMS we had the luxury of auto-incrementing fields to give uniqueness since that could easily be done atomically, but in a distributed database we don't have efficient global atomic locks to do that.
The second advantage is that UUID's are fairly efficient in terms of storage, and only require eight bytes.
As long as your old decimal values are unique, you should be able to use them as partition keys.

Storing binary blobs in Cassandra

I am building a simple HTTP service, that stores arbitrary binary objects. The service is backed by Cassandra. It is a simplified version of Amazon's S3. The system must withstand a heavy write load and should be highly available on the write and read path.
The stored data is kind of immutable. It can be deleted, but it cannot be updated. Therefore, data inconsistency is not an issue. The datastore must be able to efficiently expire old data.
The service uses Netflix's Astyanax library, which provides a recipe for storing (large) binary objects in Cassandra.
I see two solution to tackle the problem, which both have pros and cons. For me it is hard to estimate, which way fits Cassandra better.
Single table with TTL
Astyanax automatically chunks large objects into small pieces and stores them into a single table. A TTL is assigned to each blob to expire it after a certain period of time. A compaction run removes blobs, when the TTL is expired.
This solutions works and is pretty straight forward to implement. I started using the SizeTieredCompactionStrategy, but I think, that DateTieredCompactionStrategy might be the better choice, when dealing with TTL data.
My main concern is: can Cassandra's compaction keep up? Has anyone experience with a similar use case?
Sharding data by time
Another approach would be to shard the data by time. I could create a table for each day and store the chunks in that table. In this case I can drop the complete table to get rid of the expired data.
This solution requires a little more effort in the implementation, but simplifies and probably speeds up the deletion of expired data.
How performant is Cassandra in dropping a table?
Correct option for your scenario is DateTieredCompactionStrategy and Assign TTL to each blob.
Refer:
http://www.datastax.com/dev/blog/datetieredcompactionstrategy

Resources