use dse sqoop import data from RDBMS to Cassandra

use dse sqoop import data from RDBMS to Cassandra - cassandra

I followed this instruction:
https://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/ana/anaSqpDemo.html
Set HADOOP_ENABLED=1 from /etc/default/dse.
sudo service dse start or dse cassandra -t
Now I am able to use dse hadoop, which means Hadoop is enabled.
But When I try to run dse sqoop import help , I got this error :
Unable to start sqoop: jobtracker not found
Then I figured out I need to add credential, I added username and password, got this help information worked:
But when I tried to use dse sqoop import, I got this error:
I think it's because dse sqoop doesn't recognize cassandra arguments, but it's dse(datastax), how can it not recognize cassandra arguments?
How to make it work? Thanks

You either start dse as a service or manually, not both. These two steps are redundant:
sudo service dse start
dse cassandra -t
Pick one.
If it's a package install, you should probably to the service so as to avoid breaking file permissions.

Related

How to install Cassandra?

How do I download and install Apache Cassandra?

Assuming that you've already done this, but here's the Apache Cassandra download page: https://cassandra.apache.org/_/download.html
Following the links to download the current GA release should put an Apache Cassandra tarball in your ~/Downloads directory. I'd recommend moving that:
cd ~
mv ~/Downloads/apache-cassandra-4.0.6-bin.tar.gz .
Next, untar it:
tar -zxvf apache-cassandra-4.0.6-bin.tar.gz
That will create a directory containing Apache Cassandra. To start Cassandra cd into that directory and execute the cassandra binary.
cd apache-cassandra-4.0.6
bin/cassandra -p cassandra.pid
You should see several messages, but this indicates a successful start:
StorageService.java:2785 - Node localhost/127.0.0.1:7000 state jump to NORMAL
Running Cassandra like this with the -p option will put the process ID into the cassandra.pid file and run it in the background. To stop Cassandra, simply run a kill on the contents of the file.
kill `cat cassandra.pid`
Powershell script execution unavailable. Please use 'powershell Set-ExecutionPolicy Unrestricted' on this user-account to run cassandra with fully featured functionality on this platform. Starting with legacy startup options JAVA_HOME environment variable must be set!
Ahh... You're trying to run Cassandra on Windows! That's an important detail to mention. Cassandra used to ship with Powershell scripts for this purpose...which were removed with version 4.0. Your options:
Run Apache Cassandra 3.11, which still has the Powershell scripts.
Run Apache Cassandra 4.0 using local virtualization/containerization.
Last year I posted a video on how run Cassandra 4.0 locally on Windows using Minikube: Setting up Cassandra 4.0 locally on a Windows Machine
You can also use the official Cassandra container image on Docker Hub, assuming your company allows it, and you don't mind it missing important things like security.

How can I install sjk into cassandra / nodetool?

I found interesting page in cassandra documentation:
https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsSjk.html
However, when I try it, I get
# nodetool sjk --commands
nodetool: Found unexpected parameters: [sjk, --commands]
See 'nodetool help' or 'nodetool help <command>'.
I suppose it's because in my standard cassandra 3.11.3 debian instalation the sjk is not installed. However, it seems the sjk is free tool:
https://github.com/aragozin/jvm-tools
Is it possible to install it in way it integrates with nodetool? How? Or is sjk already integrated and was it just renamed?

While still no idea how to integrate it with nodetool, for case someone who knows even less than me finds this, note that it can be used by downloading as sjk-plus-0.12.jar for example, then run like
java -jar sjk-plus-0.12.jar mxdump -s 127.0.01:7199 -q 'java.lang:type=GarbageCollector,name=*'

In Apache Cassandra sjk is not available with nodetool utility but available in datastax. you need to use sjk jar for the metrix and details.

Datastax CE Cassandra migrate to Apache Cassandra

I have Datastax community edition 2.2.11 Cassandra with 90 nodes in a cluster.
I am trying to migrate to Apache Cassandra 2.2.11
First I would like to try in my test environment but couldn't find any documentation Is there a pattern or a way that I should do the migration?
Anybody who has experience?

Steps:
Alter the keyspaces using "EverywhereStrategy" to "SimpleStrategy". "EverywhereStrategy" is not supported by Apache cassandra.
There's one or two keyspaces that uses it, dse_system is one of them.
Run nodetool drain before shutting down the existing Cassandra service.
Stop cassandra services.
Back up your Cassandra configuration files from the old installation.
Update java version if needed.
Install the binaries (via tarball, apt-get, yum, etc...) for the apache Cassandra.
Configure the new product.
Compare, merge and update any modifications you have previously made into the new configuration files for the apache version (cassandra.yaml, cassandra-env.sh, etc.).
Start the cassandra services.
Check the logs for warnings, errors, and exceptions.
tail -f /var/logs/cassandra/system.log # or path where you set your logs.
Run nodetool upgradesstables
"nodetool upgradesstables"
(The upgradesstables step can be run on each node after the nodes done with migration.)
Check the logs for warnings, errors, and exceptions.
tail -f /var/logs/cassandra/system.log # or path where you set your logs.
Check the status of the cluster
nodetool status
Repeat theses upgrade steps on each node in the cluster.

"Cannot find hadoop installation : $HADOOP_HOME .. " getting this error while trying to run hive on spark.

I have followed this https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-Configurationpropertydetails
Have executed:
set spark.home=/location/to/sparkHome;
set hive.execution.engine=spark;
set spark.master= Spark-Master-URL
However, on running ./hive i am getting the above error:-
Cannot find hadoop installation: $HADOOP_HOME or $HADOOP_PREFIX must
be set or hadoop must be in the path
I do not have Hadoop installed, and want to run hive on top of spark running on standalone.
Is it mandatory that i need to have HADOOP set up to run hive over spark?

IMHO hive cannot run without the hadoop. There may be VM's which have pre installed everything. Hive will run on top of Hadoop. So First you need to install Hadoop and then you can try hive.
Please refer this https://stackoverflow.com/a/21339399/5756149.
Anyone Correct me if I am wrong

Cassandra RandomPartitioner on version 1.2.3

Im installing Cassandra 1.2.3 on debian using apt, I was previously using a tarball 1.1.7 install. After install i'm changing the partitioner from Murmur3Partitioner to RandomPartitioner in cassandra.yaml as follows:
partitioner: org.apache.cassandra.dht.RandomPartitioner
Then on starting i'm seeing incompatible system keyspace errors as follows:
ERROR 18:22:11,465 Cannot open /var/lib/cassandra/data/system/schema_keyspaces/system-schema_keyspaces-ib-1; partitioner org.apache.cassandra.dht.Murmur3Partitioner does not match system partitioner org.apache.cassandra.dht.RandomPartitioner. Note that the default partitioner starting with Cassandra 1.2 is Murmur3Partitioner, so you will need to edit that to match your old partitioner if upgrading.
Service exit with a return value of 1
How can I set the system keyspace to be RandomPartitioner? I have tried purging the data folder, apt-get remove, also apt-get purge then re-installing, changing to RandomPartitioner then starting cassandra but it is still failing. I've also replicated this on my ubuntu desktop so im thinking im doing something wrong here.
Any help is appreciated!
Cheers
Sam

The partitioner cannot be changed once Cassandra has started for the first time. This error is showing that the data directory was initialized with Murmur3Partitioner but you're starting it using RandomPartitioner.
If you're trying to upgrade your data from your 1.1 install, Cassandra isn't reading from the right place. Adjust your data directory to use your 1.1 directory and it should start with partitioner set to RandomPartitioner.
If you're trying to start with no data, stop Cassandra, remove /var/lib/cassandra/* and start it again. Note you need to remove the commitlog directory as well as the data directory.

I got a similar error as reported by Sam when I did a
[root#fedora user]# dse cassandra.
To correct the problem I did:
[root#fedora user]# vi /etc/dse/cassandra/cassandra.yaml
In the cassandra.yaml file made the following change
Commented out "# partitioner: org.apache.cassandra.dht.Murmur3Partitioner" and replaced it with "partitioner: org.apache.cassandra.dht.RandomPartitioner"
3. Saved the change in cassandra.yaml
Hope this helps.
Mayukh.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string