Strip Datastax binary to have only Cassandra - cassandra

I have downloaded latest Datastax binary - 4.5.2. It comes loaded with hive, hadoop, solr etc etc which I am not interested in. I just want to bundle Cassandra with my product. I tried removing all the folders from dse-4.5.2/resources but cassandra and tried starting cassandra by executing below command from dse-4.5.2/bin
./dse cassandra
However it failed. So looks like its not as simple as deleting folders.
Has any one ever tried this?

DSE will not use hive, hadoop, solr, etc. unless you explicitly ask it to.
I.E. in order to start DSE with search run:
dse cassandra -s
If you just start using dse cassandra it will only start the cassandra process.

I'd recommend using apache cassandra for this. Here's a puppet module that you might like: https://github.com/heartysoft/puppet-cassandra

Related

How can I upgrade Apache Hive to version 3 on GCP Apache Spark Dataproc Cluster

For one reason or another, I want to upgrade the version of Apache Hive from 2.3.4 to 3 on Google Cloud Dataproc(1.4.3) Spark Cluster. How can I upgrade the version of Hive but also maintain compatibility with the Cloud Dataproc tooling?
Unfortunately there's no real way to guarantee compatibility with such customizations, and there are known incompatibilities with currently released spark versions being able to talk to Hive 3.x so you'll likely run into problems unless you've managed to cross-compile all the versions you need yourself.
In any case though, the easiest way to go about it if you're only trying to get limited subsets of functionality working is simply dumping your custom jarfiles into:
/usr/lib/hive/lib/
on all your nodes via an init action. You may need to reboot your master node after doing so to update Hive metastore and Hiveserver2, or at least running:
sudo systemctl restart hive-metastore
sudo systemctl restart hive-server2
on your master node.
For Spark issues you may need your custom build of Spark as well and replace the jarfiles under:
/usr/lib/spark/jars/

Migrate Datastax Enterprise Cassandra to Apache Cassandra

We have currently using DSE 4.8 and 5.12. we want to migrate to apache cassandra .since we don't use spark or search thought save some bucks moving to apache. can this be achieved without down time. i see sstableloader works other way. can any one share me the steps to follow to migrate from dse to apache cassandra. something like this from dse to apache.
https://support.datastax.com/hc/en-us/articles/204226209-Clarification-for-the-use-of-SSTABLELOADER
Figure out what version of Apache Cassandra is being run by DSE. Based on the DSE documentation DSE 4.8.14 is using Apache Cassandra 2.1 and DSE 5.1 is using Apache Cassandra 3.11
Simplest way to do this is to build another DC (Logical DC per Cassandra) and add it to the existing cluster.
As usual, with a "Nodetool Rebuild {from-old-DC}" on to the new DC nodes, let Cassandra take care of streaming data to the new Apache Cassandra nodes naturally.
Once data streaming is completed, based on the LoadBalancingPolicy being used by applications, switch their local_dc to DC2 (the new DC). Once the new DC starts taking traffic, shutdown nodes in old DC say DC1 one by one.
alter keyspace dse_system and dse_security not using everywhere
on non-seed nodes, cleanup cassandra data directory
turn on replace in cassandra-env.sh
start instance
monitoring streaming process using command 'nodetool netstats|grep Receiving'
change seeds node definition and rolling restart before finally migrate previous seeds nodes.

YCSB for Cassandra 3.0 Benchmarking

I have a cassandra ubuntu visual cluster and need to benchmark it.
I try to do it with yahoo's ycsb (without use of maven if possible).
I use cassandra 3.0.1 but I cant find a suitbale version of ycsb.
I dont want to change to an oldest version of cassandra (ycsb latest cassandra-binding is for cassandra 2.x)
What should I do?
As suggested here, despite Cassandra 3.x is not officially supported, you can use the cassandra-cql binding.
For instance:
/bin/ycsb load cassandra-cql -threads 4 -P workloads/workloada
I just tested it on Cassandra 3.11.0 and it works for both load and run.
That said, the benchmark software to use depends on your test schedule. If you want to benchmark only Cassandra, then #gsteiner 's solution might be the best. If you want to benchmark different databases using the same tool to avoid variability, then YCSB is the right one.
I would recommend using Cassandra-stress to perform a load/performance test on your Cassandra cluster. It is very customizable, to the point that you can test distributions with different data models as well as specify how hard you want to push your cluster.
Here is a link to the Datastax documentation for it that goes into how to use the tool in depth.
https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsCStress_t.html

Enable Spark on Same Node As Cassandra

I am trying to test out Spark so I can summarize some data I have in Cassandra. I've been through all the DataStax tutorials and they are very vague as to how you actually enable spark. The only indication I can find is that it comes enabled automatically when you select "Analytics" node during install. However, I have an existing Cassandra node and I don't want to have to use a different machine for testing as I am just evaluating everything on my laptop.
Is it possible to just enable Spark on the same node and deal with any performance implications? If so how can I enable it so that it can be tested?
I see the folders there for Spark (although I'm not positive all the files are present) but when I check to see if it's set to Spark master, it says that no spark nodes are enabled.
dsetool sparkmaster
I am using Linux Ubuntu Mint.
I'm just looking for a quick and dirty way to get my data averaged and so forth and Spark seems like the way to go since it's a massive amount of data, but I want to avoid having to pay to host multiple machines (at least for now while testing).
Yes, Spark is also able to interact with a cluster even if it is not on all the nodes.
Package install
Edit the /etc/default/dse file, and then edit the appropriate line
to this file, depending on the type of node you want:
...
Spark nodes:
SPARK_ENABLED=1
HADOOP_ENABLED=0
SOLR_ENABLED=0
Then restart the DSE service
http://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/reference/refDseServ.html
Tar Install
Stop DSE on the node and the restart it using the following command
From the install directory:
...
Spark only node: $ bin/dse cassandra -k - Starts Spark trackers on a cluster of Analytics nodes.
http://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/reference/refDseStandalone.html
Enable spark by changing SPARK_ENABLED=1
using the command: sudo nano /usr/share/dse/resources/dse/conf/dse.default

DataStax Enterprise with HDFS and Spark without Cassandra

Is it possible to work with DSE, HDFS, Spark, but without Cassandra?
I try to replace CFS (Cassandra File System) with HDFS (Hadoop in DSE)
dse hadoop fs -help
needs cassandra.
Cassandra takes a lot of memory, I hope that with HDFS-only we've get more free-RAM on node.
Calling DSE Hadoop is actually using the Cassandra file system instead of HDFS so you cannot run it without Cassandra running. Datastax does support a BYOH (bring your own Hadoop) option but that involves using a third party Hadoop. If you don't want Cassandra though I would not recommend using the DSE packaging.

Resources