Extract from Datastax Cassandra and load into HBase using Sqoop

Extract from Datastax Cassandra and load into HBase using Sqoop - cassandra

I have 3 physical nodes running DSE 4.5. On the same 3 nodes I have HDP 2.2 running as well. Using Sqoop (either dse sqoop or sqoop provided by HortonWorks), how can I extract data from a Cassandra table and load into Hbase?
I have searched on the net and all examples describe RDBMS to HBase and vice versa or RDBMS to Cassandra and vice versa. Have not found any example for Cassandra to HBase i.e. NoSQL to NoSQL. The README.txt in /usr/share/dse/demos/sqoop also details import/export from mysql to cassandra.
Any help is much appreciated

The version of sqoop provided with DSE 4.5 does not support this. It only supports data transfer between RDBMS and NoSql not data transfer between NoSql and NoSql.

Related

JDBC - can cassandra sparksql connector do joins in query tool ie Tableau/Alteryx/Sqlclient?

With SparkSQL Cassandra connector can a JDBC client tool (ie DBVisualizer, Tableau, Alteryx.etc) join 2 cassandra tables with SparkSQL?
All documentation I see refers to joinWithCassandraTable (which I assume only works in scala/java code or spark-shell but not a standard SQL client)
https://github.com/datastax/spark-cassandra-connector

DSE should support this if you're using JDBC driver that is available from DataStax Academy Downloads page. You'll need to run the Spark SQL Thrift server (via dse spark-sql-thriftserver command)... If you're just starting, DSE 6 has more improvements around this part (so-called Always On SQL Service (AOSS)).
Here is the old blog post that talks about ODBC driver + Spark SQL and joins, but the same should be for JDBC drivers.

Flink and Cassandra deployment similar to Spark?

DataStax bundles Spark directly into it's DSE and most documentation I've seen recommends co-locating Spark with each Cassandra node, so that the spark-cassandra-connector works most efficiently with the data of that node.
Does Flink's Cassandra connector optimize it's data access based on Cassandra partition key hashes as well? If so, does Flink recommend a similar co-located install of Flink and C* on the same nodes?

DataStax Enterprise with HDFS and Spark without Cassandra

Is it possible to work with DSE, HDFS, Spark, but without Cassandra?
I try to replace CFS (Cassandra File System) with HDFS (Hadoop in DSE)
dse hadoop fs -help
needs cassandra.
Cassandra takes a lot of memory, I hope that with HDFS-only we've get more free-RAM on node.

Calling DSE Hadoop is actually using the Cassandra file system instead of HDFS so you cannot run it without Cassandra running. Datastax does support a BYOH (bring your own Hadoop) option but that involves using a third party Hadoop. If you don't want Cassandra though I would not recommend using the DSE packaging.

Spark Sql JDBC Support

Currently we are building a reporting platform as a data store we used Shark. Since the development of Shark is stopped so we are in the phase of evaluating Spark SQL. Based on the use cases we have we had few questions.
1) We have data from various sources( MySQL, Oracle, Cassandra, Mongo). We would like to know how can we get this data into Spark SQL? Does there exist any utility which we can use? Does this utility support continuous refresh of data (sync of new add/update/delete on data store to Spark SQL?
2) Is the a way to create multiple database in Spark SQL?
3) For Reporting UI we use Jasper, we would like to connect from Jasper to Spark SQL. When we did our initial search we got to know currently there is no support for consumer to connect Spark SQL through JDBC, but in future releases you would like the add the same. We would like to know by when Spark SQL would have a stable release which would have JDBC Support? Meanwhile we took the source code from https://github.com/amplab/shark/tree/sparkSql but we had some difficulty in setting it up locally and evaluating it . It would be great if you can help us with setup instructions.(I can share the issue we are facing please let me know where can I post the error logs)
4) We would also require a SQL prompt where we can execute queries, currently Spark Shell provides SCALA prompt where SCALA code can be executed, from SCALA code we can fire SQL queries. Like Shark we would like to have SQL prompt in Spark SQL. When we did our search we found that in future release of Spark this would be added. It would be great if you can tell us which release of Spark would address the same.

as for
3) Spark 1.1 provides better support for SparkSQL ThriftServer interface, which you may want to use for JDBC interfacing. Hive JDBC clients that support v. 0.12.0 are able to connect and interface with such server.
4) Spark 1.1 also provides a SparkSQL CLI interface that can be used for entering queries. In the same fashion that Hive CLI or Impala Shell.
Please, provide more details about what you are trying to achieve for 1 and 2.

I can answer (1):
Apache Sqoop was made specifically to solve this problem for the relational databases. The tool was made for HDFS, HBase, and Hive -- as such it can be used to make data available to Spark, via HDFS and the Hive metastore.
http://sqoop.apache.org/
I believe Cassandra is available to SparkContext via this connector from DataStax: https://github.com/datastax/spark-cassandra-connector -- which I have never used.
I'm not aware of any connector for MongoDB.

1) We have data from various sources( MySQL, Oracle, Cassandra, Mongo)
You have to use different driver for each case. For cassandra there is datastax driver (but i encountered some compatibility problems with SparkSQL). For any SQL system you can use JdbcRDD. The usage is straightforward, look at the scala example:
test("basic functionality") {
sc = new SparkContext("local", "test")
val rdd = new JdbcRDD(
sc,
() => { DriverManager.getConnection("jdbc:derby:target/JdbcRDDSuiteDb") },
"SELECT DATA FROM FOO WHERE ? <= ID AND ID <= ?",
1, 100, 3,
(r: ResultSet) => { r.getInt(1) } ).cache()
assert(rdd.count === 100)
assert(rdd.reduce(_+_) === 10100)
}
But notion that it's just an RDD, so you should work with this data through map-reduce api, not in SQLContext.
Does there exist any utility which we can use?
There is Apache Sqoop project but it's in active development state. The current stable version even doesn't save files in parquet format.

Spark SQL is a capability of the Spark framework. It shouldn't be compared to Shark because Shark is a service. (Recall that with Shark, you run a ThriftServer that you can then connect to from your Thrift app or even ODBC.)
Can you elaborate on what you mean by "get this data into Spark SQL"?

There are a couple of Spark - MongoDB connectors:
- the mongodb connector for hadoop (which doesn't actually need Hadoop at all!) https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html
the Stratio mongodb connector https://github.com/Stratio/spark-mongodb

If your data is huge and need to perform a lot of transformations then Spark SQL can be used for ETL purpose, else presto could solve all your problems. Addressing your queries one by one:
As your data is in MySQL, Oracle, Cassandra, Mongo all these can be integrated in Presto as it has connectors https://prestodb.github.io/docs/current/connector.html for all these databases.
Once you install Presto in cluster mode you can query all these databases together in one platform, which also provides to join a table from Cassandra and other tables from Mongo, this flexibility is unparalleled.
Presto can be used to connect to Apache Superset https://superset.incubator.apache.org/ which is open source and provides all sets Dashboarding. Also Presto can be connected to Tableau.
You can install MySQL workbench with presto connecting details which helps in providing a UI for all your databases at one place.

Performing Analytics over Cassandra DB

I am working for a small concern and very new to apache cassandra. Studying about cassandra and performing some small analytics like sum function on cassandra DB for creating reports. For the same, Hive and Accunu can be choices.
Datastax Enterprise provides the solution for Apache Cassandra and Hive Integration. Is Datastax Enterprise is the only solution for such integration. Is there any way to resolve the hive and cassandra integration. If so, Can I get the links or documents regarding the same. Is that possible to work the same with the windows platform.
Is any other solution to perform analytics on cassandra DB?
Thanks in advance .

I was trying to download DataStax Enterprise (DSE) for Windows but found there is no such option on their website. I suppose they do not support DSE for Windows.

Apache Cassandra does have builtin Hadoop support. You need to set up a standalone Hadoop cluster colocated with Apache Cassandra nodes and then use ColumnFamilyInputFormat and ColumnFamilyOutputFormat to read/write data from/to your Hadoop cluster.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string