Representing Network Drive as Cassandra Data Directory - cassandra

I am new to cassandra. In cassandra,in order to store cores we do specify the local directory of cassandra installed machine using the property data_file_directories in Cassandra.yalm configuration file. My need is to define the data_file_directories as network directory(something like 192..x.x.x/data/files/). I am using only single node cluster for rapid data write(For logging activities). As I don't rely on replication, My replication factor is 1.Any one help in defining network directory for cassandra data directory....
thanks in advance......

1) I have stored the data for the cassandra on amazons EBS volume (Network volume), But in EC2 case it is simple as we can mount the EBS volumes on a machine as if it is a local one.
2) In other cases you will have to use NFS to configure the network directory.I have never done this but it looks straight forword.

Cassandra is firmly designed around using local storage instead of EBS or other network-mounted data. This gives you better performance, better reliability, and better cost-effectiveness.

Related

Local Persistent Volume for Cassandra Hosted in Kubernetes

We are trying to deploy Cassandra within Kubernetes. Thinking of the storage and how to make it work its fastest at each datacenter, without the expense of implementing network attached storage at each data center, it would seem reasonable to make use of a Local Persistent Volume at each datacenter and leverage Cassandra to handle the cross-datacenter replication.
Am I thinking about this problem correctly? Is there a better way to consider implementing Cassandra in each of our data centers to make our application run their fastest by connecting to a more local data center?
#Simon Fontana Oscarsson is right.
I just want to add a bit more details about that feature for people who will find that question, because it is a common case.
Local Persistent Volumes are available only from 1.7 in alpha stage and from 1.10 in beta.
It requires pre-configured LVM on nodes, and it should be done before you will use it.
Here you may find examples of configuration here.

Cassandra directory setup of replica datacenter

I have a Cassandra cluster and plan to add a new datacenter to replicate data. There will be no write on this node, only reads.
My questions are:
in this case is it still recommended to have separate drives for commit log and data?
if I know, that my cluster will receive data only by hints (and lots of them) should I create a separate disk for the hints? I did not find any mention of this.
in this case is it still recommended to have separate drives for commit log and data?
So the whole idea of putting your commitlog on a separate mount point, goes back to spinning disks being a chokepoint for I/O. If you have your cluster/nodes backed by SSDs, you shouldn't need to do that.
if I know, that my cluster will receive data only by hints (and lots of them) should I create a separate disk for the hints?
Hints only build up when a node is down. When your writes happen, the Snitch handles propagation of all of the required replicas. So no, I wouldn't worry about putting your hints dir on a separate mount point, either.

Local disk configuration in Spark

Hi the official Spark documentation state:
While Spark can perform a lot of its computation in memory, it still
uses local disks to store data that doesn’t fit in RAM, as well as to
preserve intermediate output between stages. We recommend having 4-8
disks per node, configured without RAID (just as separate mount
points). In Linux, mount the disks with the noatime option to reduce
unnecessary writes. In Spark, configure the spark.local.dir variable
to be a comma-separated list of the local disks. If you are running
HDFS, it’s fine to use the same disks as HDFS.
I wonder what is the purpose of 4-8 per node
Is it for parallel write ? I am not sure to understand the reason why as it is not explained.
I have no clue for this: "If you are running HDFS, it’s fine to use
the same disks as HDFS".
Any idea what is meant here...
Purpose of usage 4-8 RAID disks to mirror partitions adding redundancy to prevent data lost in case of fault on hardware level. In case of HDFS the redundancy that RAID provides is not needed, since HDFS handles it by replication between nodes.
Reference

Cassandra store Keyspace to new Disk

I just setup a fresh windows server with a fresh datastax installation including cassandra 1.2 and opscenter 2.1.3. I've tried finding solutions to these questions on cassandra wikis and datastax website, but I can only find unix specific information or datastax API information.
Cassandra is defaulted to using C: drive (I was never asked to select a drive for cassandra during install).
In the same cassandra instance, can I have keyspaces on separate
disks?
If not, how do I migrate the existing keyspace to the new
drive? (just reconfiguring cassandra.yaml to use a new directory
would lose my opscenter data and may even break opscenter).
If yes, how can I create a new keyspace on a separate drive? cassandra.yaml
seems to only have configuration options for a single store location.
Should I be creating a new cluster to store my data in? If I start
adding new nodes to the default cluster, that will mean the datastax
opscenter data will be getting replicated - that seems like a bad
idea.
If there is good documentation on this somewhere, please point me there.
Thanks,
Adam
You cannot get cassandra to split the keyspaces and store them in different directories. They are all stored under a common data directory that is specified in the cassandra.yaml file.
However, you can set this up and use NTFS to mount different drives under the data directory on your server but this will not be simple or expandable.
If you want to move where the data is stored on cassandra, then stop the cassandra daemon/service, change the cassandra.yaml file to store the data at a new location, then copy/move the entirety of the data directory to this new location. THEN start cassandra back up and it will work fine with the data in the new location. I have done this quite a few times now and cassandra comes back up without incident and no lost data (if you do not move the data, then it will lose it all and recreate the directory structure under the new location).
Data getting replicated is not a bad thing - it is what cassandra was designed for. I don't know what replication factor opscenter uses, but it does not store a massive amount of data so replication is not a problem.

Dynamically adding new nodes in Cassandra

Is it possible to add new hosts to a Cassandra cluster dynamically?
What I'm trying to do is set up a program that can:
Set up a local version of the database for each user
Each user's machine will become part of the cluster (the machines will be hosts)
Data will be replicated across all the clusters
Building a cluster of multiple hosts usually entails configuring the cassandra.yaml to store the seeds, listen_address and rpc_address of each host.
My idea is to edit these files through java and insert the new host addresses as required but making sure that data is accurate across each users's cassandra.yaml files would be challenging.
I'm wondering if someone has done something similar or has any advice on a better way to achieve this.
Yes is possible. Look at Netflix's Priam for an complete example of a dynamic cassandra cluster management (but designed to work with Amazon EC2).
For rpc_address and listen_address, you can setup a startup script that configures the cassandra.yaml if it's not ok.
For seeds you can configure a custom seed provider. Look at the seed provider used for Netflix's Priam for some ideas how to implement it
The most difficult part will be managing the tokens assigned to each node in a efficient way. Cassandra 1.2 is around the corner and will include a feature called virtual nodes that, IMO, will work well in your case. See the Acunu presentation about it

Resources