Multiple data file directories with different sizes - cassandra

can you configure multiple data_file_directories with different sizes and Cassandra will manage it without issue ?
For example:
data_file_directories:
- /data1/cassandra/data
- /data4/cassandra/data
where data1 is 450 GB and data4 is 1 TB.
Thank you !

Different size will not be an issue, however there are some gotchas when it comes to using multiple dirs for data_file_directories. I'd encourage you to read through TLP's blog that highlights most of them.
The summary is that there are bugs that hit usage of multiple data dirs in older versions of cassandra. With more modern versions there are still issues where removing one of the data directories could render the node unrecognisable to the rest of the cluster, although this is not really a bug, you are removing the system tables that store cluster membership information.

It should be fine. Just mention all the directories in cassandra.yaml file and restart if needed.

Related

Cassandra - copying sstable snapshot from one cluster to another

I know there are several similar questions out there, but I'm still confused about this. As there is a need for this mechanism (copying data from one cluster to another), I'm looking for a little clarification.
Let's assume a very simple scenario. I want to copy a table from one cassandra cluster (C1) to another (C2). The table I'm copying is called "item".
Let's assume the node count of each cluster is the same (source and target nodes in cluster is 4 each). Not sure that matters or not.
I'm attempting to use snapshots and sstableloader to do the trick. I have been able to create a snapshot, copy the snapshot files from C1:N1 (cluster 1 node 1 .../myspace/item-xxxxxx/snapshot/######) to target table directory C2:N1 (cluster 2 node 1: .../myspace/item-xxxxxx). I used sstableloader to load the data and ran nodetool repair. Perfect. The only problem is that as the loaded snapshot was only from one of the source nodes, I only "restored" part of the data (about 485 of the 1k rows). So I'm thinking I'll copy the snapshot from C1:N2 to C2:N1 again and load it up. The problem is that all of the table files already exist on the C2:N1. If I copy the snapshot files from C1:N2 to table directory on C2:N1, I'll blow away the files that are already there. I didn't check all 4 target nodes, but I did check node 2 of the target and the item table directory already existed there too with data files. I'm guessing all of nodes on the target have data files, so I'm stuck with how to sstableload the other 3 source node snapshot files.
So long story short (if that's possible):
How am I supposed to load multiple source snapshot files (one from each host on the source cluster) to a target cluster? And to complicate matters, will it matter if the source and target clusters have a different number of nodes (I would think that having less nodes on the target would be potentially be a bigger problem).
What is really needed here, in my opinion, is a way to run the ssableloader on the SOURCE cluster and have it stream the data to a target cluster. Would make life a lot easier, I would think.
Thanks in advance.
-Jim
There are two options for bulk loading, It seems you may have them semi-merged together. You are mostly referring to the "copy the sstables" mechanism which is pretty manual and may not be worth the trouble unless performance of the restore is top priority. Using sstable loader is different though and doesn't require that.
sstableloader tool will connect to a node, find all the nodes in that nodes cluster and uses the connection to build metadata/discovery. It will split/stream the sstables that you select to the target cluster in the appropriate token ranges (you wont need the repair). You can run sstableloader from the source clusters nodes, and point it to the destination cluster, you dont need to copy the sstables over yourself (although if they are in different DCs it may be a bit faster).
If you have OpsCenter the automation of these steps can be done for you with a GUI https://docs.datastax.com/en/opscenter/5.2/opsc/online_help/services/opscBackupCloneCluster.html

copy command row size limit in cassandra

Could anyone tell the maximum size(no. of rows or file size) of a csv file we can load efficiently in cassandra using copy command. Is there a limit for it? if so is it a good idea to breakdown the size files into multiple files and load or we have any better option to do it? Many thanks.
I've run into this issue before... At least for me there was no clear statement in any datastax or apache documentation of the max size. Basically, it may just be limited to your pc/server/cluster resources (e.g. cpu and memory).
However, in an article by jgong found here it is stated that you can import up to 10MB. For me it was something around 8.5MB. In the docs for cassandra 1.2 here its stated that you can import a few million rows and that you should use the bulk-loader for more heavy stuff.
All in all, I do suggest importing via multiple csv files (just dont make them too small so your opening/closing files constantly) so that you can keep a handle on data being imported and finding errors easier. It can happen that waiting for an hour for a file to load it fails and you start over whereas if you have multiple files you dont need to start over on the ones that already have been successfully imported. Not to mention key duplicate errors.
Check out cassandra-9303 and 9302
and check out brian's cassandra-loader
https://github.com/brianmhess/cassandra-loader

memsql how to free space from plancachedir

I have a very small memsql instance which have on 200 tables 200MB data in total. The plancachedir kept fulling the file system (25GB+). I tried to shutdown the databases, deleted files under plancachedir. but after restarted database, all files came back. "show plancache" show 0 entries so there's no plans to be deleted.
Would anyone let me know the best way to manage the plancachedir space consumption?
Thanks in advance.
So, if you are comfortable turning off your machine and deleting the plancache directory, try just running SNAPSHOT <db_name> on each database before turning off the server and deleting the plancache.
Otherwise the queries will be recompiled for every write query and alter table you ran during recovery.
25 Gigs is a lot though...
To be honest, MemSQL is not optimized for the case of many 1-meg tables...
Depending on your use case, it might be worth investigating our JSON datatype, or rethinking your schemas.

How does cassandra split keyspace data when multiple directories are configured?

I have configured three separate data directories in cassandra.yaml file as given below:
data_file_directories:
- E:/Cassandra/data/var/lib/cassandra/data
- K:/Cassandra/data/var/lib/cassandra/data
when I create keyspace and insert data my key space got created in both two directories and data got scattered. what I want to know is how cassandra splits the data between multiple directories?. And what is the rule behind this?
You are using the JBOD feature of Cassandra when you add multiple entries under data_file_directories. Data is spread evenly over the configured drives proportionate to their available space.
This also let's you take advantage of the disk_failure_policy setting. You can read about the details here:
http://www.datastax.com/dev/blog/handling-disk-failures-in-cassandra-1-2
In short, you can configure Cassandra to keep going, doing what it can if the disk becomes full or fails completely. This has advantages over RAID0 (where you would effectively have the same capacity as JBOD) in that you do not have to replace the whole data set from backup (or full repair) but just run a repair for the missing data. On the other hand, RAID0 provides higher throughput (depending how well you know how to tune RAID arrays to match filesystem and drive geometry).
If you have the resources for fault-tolerant/more performant RAID setup (like RAID10 for example), you may want to just use a single directory for simplicity. Most deployments are starting to lean towards the density route, using JBOD rather than systems-level tolerance though.
You can read about the thought process behind the development of this issue here:
https://issues.apache.org/jira/browse/CASSANDRA-4292
Some what I am able to guess how the keyspace is split between multiple data directories. Based on the maximum available space and load on directories, SSTables of same column family written to the different data directories..

Multiple applications using copies of a directory on a SAN

I have an application (Endeca) that is a file-based search engine. A customer has Linux 100 servers, all attached to the same SAN (very fast, fiber-channel). Currently, each of those 100 servers uses the same set of files. Currently, each server has their own copy of the index (approx 4 gigs, thus 400 gigs in total).
What I would like to do is to have one directory, and 100 virtual copies of that directory. If the application needs to make changes to any of the files in that directory, only then would is start creating a distinct copy of the original folder.
So my idea is this: All 100 start using the same directory (but they each think they have their own copy, and don't know any better). As changes come in, Linux/SAN would then potentially have up to 100 copies (now slightly different) of that original.
Is something like this possible?
The reason I'm investigating this approach would be to reduce file transfer times and disk space. We would only have to copy the 4 gig index files once to the SAN and create virtual copies. If no changes came in, we'd only use 4 gigs instead of 400.
Thanks in advance!
The best solution here is to utilise the "de-dupe" functionality at the SAN level. Different vendors may call it differently, but this is what I am talking about:
https://communities.netapp.com/community/netapp-blogs/drdedupe/blog/2010/04/07/how-netapp-deduplication-works--a-primer
All 100 "virtual" copies will utilise the same physical disk blocks on the SAN. SAN will only need to allocate new blocks if there are changes made to a specific copy of a file. Then a new block will be allocated for this copy but the remaining 99 copies will keep using the old block - thus dramatically reducing the disk space requirements.
What version of Endeca are you using? MDEX7 engine has the clustering ability where the leader and follower nodes are all reading from the same set of files, so as long as the files are shared (say over NAS) then you can have multiple engines running on different machines backed by the same set of index files. Only the leader node will have ability to change the files keeping the changes consistent, the follower nodes will then be notified by the cluster coordinator when the changes are ready to be "picked up".
In MDEX 6 series you could probably achieve something similiar provided that the index files are read-only. The indexing in V6 would usually happen on another machine and the destination set of index files would usually be replaced once the new index is ready. This though won't help you if you need to have partial updates.
Netapp deduplication sounds interesting, Endeca has never tested the functionality, so I am not sure what kinds of problems you will run into.

Resources