DataStax Enterprise with HDFS and Spark without Cassandra - cassandra

Is it possible to work with DSE, HDFS, Spark, but without Cassandra?
I try to replace CFS (Cassandra File System) with HDFS (Hadoop in DSE)
dse hadoop fs -help
needs cassandra.
Cassandra takes a lot of memory, I hope that with HDFS-only we've get more free-RAM on node.

Calling DSE Hadoop is actually using the Cassandra file system instead of HDFS so you cannot run it without Cassandra running. Datastax does support a BYOH (bring your own Hadoop) option but that involves using a third party Hadoop. If you don't want Cassandra though I would not recommend using the DSE packaging.

Related

Migrate Datastax Enterprise Cassandra to Apache Cassandra

We have currently using DSE 4.8 and 5.12. we want to migrate to apache cassandra .since we don't use spark or search thought save some bucks moving to apache. can this be achieved without down time. i see sstableloader works other way. can any one share me the steps to follow to migrate from dse to apache cassandra. something like this from dse to apache.
https://support.datastax.com/hc/en-us/articles/204226209-Clarification-for-the-use-of-SSTABLELOADER
Figure out what version of Apache Cassandra is being run by DSE. Based on the DSE documentation DSE 4.8.14 is using Apache Cassandra 2.1 and DSE 5.1 is using Apache Cassandra 3.11
Simplest way to do this is to build another DC (Logical DC per Cassandra) and add it to the existing cluster.
As usual, with a "Nodetool Rebuild {from-old-DC}" on to the new DC nodes, let Cassandra take care of streaming data to the new Apache Cassandra nodes naturally.
Once data streaming is completed, based on the LoadBalancingPolicy being used by applications, switch their local_dc to DC2 (the new DC). Once the new DC starts taking traffic, shutdown nodes in old DC say DC1 one by one.
alter keyspace dse_system and dse_security not using everywhere
on non-seed nodes, cleanup cassandra data directory
turn on replace in cassandra-env.sh
start instance
monitoring streaming process using command 'nodetool netstats|grep Receiving'
change seeds node definition and rolling restart before finally migrate previous seeds nodes.

How to set Cassandra as my Distributed Storage(File System) for my Spark Cluster

I am new to big data and Spark(pyspark).
Recently I just setup a spark cluster and wanted to use Cassandra File System (CFS) on my spark cluster to help upload files.
Can any one tell me how to set it up and briefly introduce how to use CFS system? (like how to upload files / from where)
BTW I don't even know how to use HDFS(I downloaded pre-built spark-bin-hadoop but I can't find hadoop in my system tho.)
Thanks in advance!
CFS only exists in DataStax Enterprise and isn't appropriate for most Distributed File applications. It's primary focused as a substitute for HDFS for map/reduce jobs and small temporary but distributed files.
To use it you just use the CFS:// uri and make sure you are using dse spark-submit from your application.

Flink and Cassandra deployment similar to Spark?

DataStax bundles Spark directly into it's DSE and most documentation I've seen recommends co-locating Spark with each Cassandra node, so that the spark-cassandra-connector works most efficiently with the data of that node.
Does Flink's Cassandra connector optimize it's data access based on Cassandra partition key hashes as well? If so, does Flink recommend a similar co-located install of Flink and C* on the same nodes?

Strip Datastax binary to have only Cassandra

I have downloaded latest Datastax binary - 4.5.2. It comes loaded with hive, hadoop, solr etc etc which I am not interested in. I just want to bundle Cassandra with my product. I tried removing all the folders from dse-4.5.2/resources but cassandra and tried starting cassandra by executing below command from dse-4.5.2/bin
./dse cassandra
However it failed. So looks like its not as simple as deleting folders.
Has any one ever tried this?
DSE will not use hive, hadoop, solr, etc. unless you explicitly ask it to.
I.E. in order to start DSE with search run:
dse cassandra -s
If you just start using dse cassandra it will only start the cassandra process.
I'd recommend using apache cassandra for this. Here's a puppet module that you might like: https://github.com/heartysoft/puppet-cassandra

Performing Analytics over Cassandra DB

I am working for a small concern and very new to apache cassandra. Studying about cassandra and performing some small analytics like sum function on cassandra DB for creating reports. For the same, Hive and Accunu can be choices.
Datastax Enterprise provides the solution for Apache Cassandra and Hive Integration. Is Datastax Enterprise is the only solution for such integration. Is there any way to resolve the hive and cassandra integration. If so, Can I get the links or documents regarding the same. Is that possible to work the same with the windows platform.
Is any other solution to perform analytics on cassandra DB?
Thanks in advance .
I was trying to download DataStax Enterprise (DSE) for Windows but found there is no such option on their website. I suppose they do not support DSE for Windows.
Apache Cassandra does have builtin Hadoop support. You need to set up a standalone Hadoop cluster colocated with Apache Cassandra nodes and then use ColumnFamilyInputFormat and ColumnFamilyOutputFormat to read/write data from/to your Hadoop cluster.

Resources