How do you use the Cassandra tool sstableloader? - cassandra

I'm trying to use the sstableloader to load data into an existing Cassandra ring, but cant figure out how to actually get it to work. I'm trying to run it on a machine that has a running cassandra node on it, but when I run it I get an error saying that port 7000 is already in use, which is the port the running Cassandra node is using for gossip.
So does that mean I can only use sstableloader on a machine that is in the same network as the target cassandra ring, but isn't actually running a cassandra node?
Any details would be useful, thanks.

Played around with sstableloader, read the source code, and finally figured out how to run sstableloader on the same machine that hosts a running cassandra node. There are two key points to get this running. First you need to create a copy of the cassandra install folder for sstableloader. This is becase sstableloader reads the yaml file to figure out what ipaddress to use for gossip, and the existing yaml file is being used by Cassandra. The second point is that you'll need to create a new loopback ipaddress (something like 127.0.0.2) on your machine. Once this is done, change the yaml file in the copied Cassandra install folder to listen to this ipaddress.
I wrote a tutorial going more into detail about how to do this here: http://geekswithblogs.net/johnsPerfBlog/archive/2011/07/26/how-to-use-cassandrs-sstableloader.aspx

The Austin Cassandra Users Group just had a presentation on this:
http://www.slideshare.net/alex_araujo/etl-with-cassandra-streaming-bulk-loading/

I have used the sstableloader utility provided in cassandra-0.8.4 to successfully load the sstables into cassandra.From Some of the issues i have faced i have following tips
If you are running it on single machine,you have to create a copy the cassandra installation folder and have to run sstable-loader from this folder.Also change the listen address,rpc address also provide the ip address of running cassandra as seeds in cassandra.yaml file of this copied one.Check if the cluster name in both the cassandra.yaml file is same.
These sstables have to be in a directory whose name is the name of the keyspace
It requires a directory containing a cassandra.yaml configuration file in the classpath.
Note that the schema for the column families to be loaded should be defined beforehand
For Reference SEE: Using Cassandra SStableloader

For Reference SEE: Using Cassandra SStableloader for bulkloading the data into cassandra
http://ramuprograms.blogspot.com/2014/07/bulk-loading-data-into-cassandra-using.html

If you are looking to do this in Java see below utility class:
BulkWriterLoader
List<String> argList = new ArrayList<>();
argList.add("-v");
argList.add("-d");
argList.add(params.hosts);
argList.add("-f");
argList.add(params.cassYaml);
argList.add(params.fullpath);
LoaderOptions options = LoaderOptions.builder()
.parseArgs(argList.stream().toArray(String[]::new))
.build();
try
{
BulkLoader.load(options);
}
catch (BulkLoadException e)
{
e.printStackTrace();
}
...
The code will also generate the sstable files using the CQLSSTableWriter class.

Things improve and the whole procedure of using sstableloader is much easier including a easier way to generate sstables with CQLSSTableWriter.
For all the details:
https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/toolsBulkloader.html

Related

cassandra.yaml changes not working at all

Im new using Cassandra 3.11.4, and just installed it on a ubuntu VM, and following the instructions, I tried to change the cluster name on the .yaml config file, but when I save the doc, and go Start Cassandra, it throws a failure, and this happens with anything I change on the .yaml file, it just doesn't work the way documentation says it should(I located the Cassandra files in a location where my user has all permissions)
enter image description here. enter image description here
If I make no changes to the file, and Start Cassandra, it does successfully.
I found out that I can change the cluster name or listen address, or any other parameter listed on the .yaml file successfully after connecting the database and querying for example
update cluster_name from system.local where cluster_name='Test Cluster';
but that's not the point of having the .yaml conf file.
Does someone know why this happens?
I've had this issue even using other Cassandra versions, like 3.11.2
Thanks in advance.
In Cassandra you can't simply change some configuration parameters and except it to work after restart. cluster_name is not specific to the node. It is applicable to entire cluster. parameters like data_file_directories can be changed at node level.
And if you want to change the name of cluster it is whole different process. Refer below link
https://support.datastax.com/hc/en-us/articles/205289825-Change-Cluster-Name-

Elassandra Integration with Existing Cassandra Instance

I'm trying to learn Elassandra and am having an issue configuring it to my current Cassandra instance (I'm learning Cassandra as well).
I downloaded version 3.11.3 of Cassandra to my local computer. I didn't change anything except the cluster_name inside of cassandra.conf. It runs fine and I used bin/cqlsh to create a keyspace and a "user" table with a couple of rows for testing.
I followed the steps on the Elassandra integration page. I downloaded version 6.2.3.10 of Elassandra. I replaced the cassandra.yaml, cassandra-rackdc.properties and cassandra-topology.properties in the Elassandra conf with the ones from the Cassandra conf (I am assuming those last 2 are the "snitch configuration file" mentioned in the instructions but I'm not sure). I stopped my Cassandra instance and then ran the bin/cassandra -e f from my Elassandra directory.
When I run curl -X GET localhost:9200, the output seems to have my correct cluster name, etc.
However, if I run bin/cqlsh from my Elassandra directory and run describe keyspaces, the keyspace I created under Cassandra isn't there. I tried copying the data directory from Cassandra to Elassandra and that seemed to work, but I feel this can't possibly be the actual solution.
Can someone point me to what I am missing in regards to this configuration? With the steps being listed on the website, I'm sure there must be some dumb thing I'm missing.
Thanks in advance.

How to use sstableloader from a node not in Cassandra Cluster ring

we are using apache-cassandra 1.1.9 version on Production Cassandra Cluster on linux.I want to upload some data using sstableloader.
I was able to generate sstables for a small data and then tried to upload these sstables into Cassandra Cluster using sstableloader from another machine(which is in same network but not in cassandra cluster ring) but get below error
"Could not retrieve endpoint ranges:"
I do not understand why this error is coming.
This machine , where i am running sstableloader, has same cassandra installation.I copied the cassandra.yaml from production cassandra into my host machine's apache-cassandra/conf folder.
My sstables are in below directory structure:-
/path/to/keyspace dir/Keyspace/*.db
SStable command I am running is below
./sstableloader -d -i , /home/Data/Keyspace/
Could not retrieve endpoint ranges:
Please advise , if i am doing wrong here ?
Found the solution.
The sstableloader command needs to be executed from the directory containing the Keyspace subdirectory.
For e.g
if /home/Data is the directory structure, under which there is sub directories keyspace/ColumnFamily/
then execute command like below from /home/Data/ directory.
~/apache-cassandra/bin/sstableloader -d /keyspace/ColumnFamily
This is a bit old, but I ran into the "Could not retrieve endpoint ranges" error recently with a different root cause.
In our case, data was being exported from a production system, and loaded to a new development instance. The development instance had been incorrectly set up so the sstables were generated using dse 4.7, and the sstableloader being run was dse 4.6.
Note that it is possible to ingest tables from 4.6 into dse 4.7 for debugging, etc, but that it is necessary to run nodetool upgradesstables first. That isn't what was going on here.

Cassandra store Keyspace to new Disk

I just setup a fresh windows server with a fresh datastax installation including cassandra 1.2 and opscenter 2.1.3. I've tried finding solutions to these questions on cassandra wikis and datastax website, but I can only find unix specific information or datastax API information.
Cassandra is defaulted to using C: drive (I was never asked to select a drive for cassandra during install).
In the same cassandra instance, can I have keyspaces on separate
disks?
If not, how do I migrate the existing keyspace to the new
drive? (just reconfiguring cassandra.yaml to use a new directory
would lose my opscenter data and may even break opscenter).
If yes, how can I create a new keyspace on a separate drive? cassandra.yaml
seems to only have configuration options for a single store location.
Should I be creating a new cluster to store my data in? If I start
adding new nodes to the default cluster, that will mean the datastax
opscenter data will be getting replicated - that seems like a bad
idea.
If there is good documentation on this somewhere, please point me there.
Thanks,
Adam
You cannot get cassandra to split the keyspaces and store them in different directories. They are all stored under a common data directory that is specified in the cassandra.yaml file.
However, you can set this up and use NTFS to mount different drives under the data directory on your server but this will not be simple or expandable.
If you want to move where the data is stored on cassandra, then stop the cassandra daemon/service, change the cassandra.yaml file to store the data at a new location, then copy/move the entirety of the data directory to this new location. THEN start cassandra back up and it will work fine with the data in the new location. I have done this quite a few times now and cassandra comes back up without incident and no lost data (if you do not move the data, then it will lose it all and recreate the directory structure under the new location).
Data getting replicated is not a bad thing - it is what cassandra was designed for. I don't know what replication factor opscenter uses, but it does not store a massive amount of data so replication is not a problem.

Cassandra migration

I have Cassandra 0.8.0 running with data on server 1, and a clean install of Cassandra 1.0.3 on server 2.
Is it possible to just copy some files from server 1 to server 2? Or do i have to write my own import/export code?
Both servers can be taken down, restarted, etc.
Why would you not upgrade server1? Upgrade details here (either way read this first):
http://svn.apache.org/viewvc/cassandra/branches/cassandra-1.0/NEWS.txt?view=markup
But if you do want to change machines, follow the procedures for 'nodetool snapshot' as detailed here:
http://wiki.apache.org/cassandra/Operations#Backing_up_data
Re-create the schema on the new node, then add the snapshots to the data directory (as described above), restart cassandra then issue a nodetool scrub.
Thanks zznate it had to do with hardware.
Here some links i found useful:
http://jonathanhui.com/cassandra-data-maintenance-backup-and-system-recovery
http://wiki.apache.org/cassandra/StorageConfiguration
http://www.memonic.com/user/pneff/folder/database/id/1bZvk
If it looks like nothing happened after migrating make sure you create the column family's on the new node using CassandraCli.

Resources