How to handle the errors while data migration to cassandra - cassandra

trying to migrate the data from oracle to cassandra i have below issues:
How to handle the errors while data migration to cassandra using spark-sql?
How to design the retry machanisum if anything fails ?
Is there any document/sample/github regarding the same.
~Sha

you can look into these below github repos for reference
https://github.com/snazy/Oracle_to_Cassandra
https://github.com/AlexGruPerm/oratocass

Related

Spark JobServer can use Cassandra as SharedDb

I have been doing a research about Configuring Spark JobServer Backend (SharedDb) with Cassandra.
And I saw in the SJS documentation that they cited Cassandra as one of the Shared DBs that can be used.
Here is the documentation part:
Spark Jobserver offers a variety of options for backend storage such as:
H2/PostreSQL or other SQL Databases
Cassandra
Combination of SQL DB or Zookeeper with HDFS
But I didn't find any configuration example for this.
Would anyone have an example? Or can help me to configure it?
Edited:
I want to use Cassandra to store metadata and jobs from Spark JobServer. So, I can hit any servers through a proxy behind of these servers.
Cassandra was supported in the previous versions of Jobserver. You just needed to have Cassandra running, add correct settings to your configuration file for Jobserver: https://github.com/spark-jobserver/spark-jobserver/blob/0.8.0/job-server/src/main/resources/application.conf#L60 and specify spark.jobserver.io.JobCassandraDAO as DAO.
But Cassandra DAO was recently deprecated and removed from the project, because it was not really used and maintained by the community.

Cassandra Presto synchronization issue

We have been using Cassandra in our current live project for almost a year. We are using Cassandra 2.1.14 and sometimes we get to see that there is some synchronization problem between Cassandra and Presto. When there is some update in database using Cassandra and I am going to fire any query from presto then it doesn’t return data while data exists in the database.
Second issue is that sometimes delete and update statements don’t get executed. It shows no error but transaction is not committed.
The metadata caching in Cassandra doesn't update immediately, which means you might not see some changes. I suggest you change cassandra.schema-cache-ttl to 0s; we're going to remove caching in Cassandra altogether soon.

Migrate Datastax Enterprise Cassandra to Apache Cassandra or Datastax Community?

I have a large, but simple Cassandra database on a Datastax 4.6 cluster. The license renewal is prohibitive for this very simple use case and I am trying to migrate to either a straight Apache or Datastax Comunity version. First is it possible to do an inline update?
I have altered all the keyspaces to remove the "EverywhereStrategy" replication strategy but I still get an error that the DSC version of cassandra I'm trying to get to join the cluster doesn't support it. I'm using Like Cassandra versions (2.0.16) and most other things seem to be close.
java.lang.RuntimeException: org.apache.cassandra.exceptions.ConfigurationException: Unable to find replication strategy class 'org.apache.cassandra.locator.EverywhereStrategy'
If it's not possible to do an inline upgrade what would be the best strategy to migrate a decent size (30 node, 150Tb) cluster?
So to make this work you have to extract any of the DSE features that you may have on any of your tables.
This meant I had to change the replication strategy on the dse_system table from EverywhereStrategy to SimpleStrategy with RF=3 (or almost anything after conversion you can drop this keyspace) The error message was:
java.lang.RuntimeException: org.apache.cassandra.exceptions.ConfigurationException: Unable to find replication strategy class 'org.apache.cassandra.locator.EverywhereStrategy'
I Also had to drop the unused CFS keyspaces. We never used the hadoop/CFS integration so we had nothing in these keyspaces anyway. I didn't capture the error for this.
We did have a solr index on a table we were testing on this cluster about a year ago so I had to drop this columnfamily. The error message was:
java.lang.RuntimeException: java.lang.ClassNotFoundException: com.datastax.bdp.search.solr.Cql3SolrSecondaryIndex
There may be other incompatibilities if you use other features of Datastax Enterprise that you would have to remove, but this was enough for me to get the migration working.
dse-core.jar contains the EverywhereStrategy class.
We solved this problem by doing the following:
Replace everything except the above JAR so nodes can come up fine. Once all nodes are migrated to OSS, drop the dse_system keyspace (that uses this replication), delete the JAR and restart the nodes one by one.

Change Capture from DB2 to Cassandra

I am trying to get all inserts, updates, deletes to a normalized DB2 database (hosted on an IBM Mainframe) synced to a Cassandra database. I also need to denormalize these changes before I write them to Cassandra so that the data structure meets my Cassandra model.
Searched on google but tools either lack processing support or streaming CDC support.
Is there any tool out there that can help me achieve the above?
It's likely that no stock tool exists. What's the format of the CDC stream coming out? What queries do you need to run? Like any other Cassandra data modeling question, start with the queries you need to run and work backwards to the table structure(s).

Possibilities of Hadoop with MSSQL Reporting

I have been evaluating Hadoop on azure HDInsight to find a big data solution for our reporting application. The key part of this technology evaluation is that the I need to integrate with MSSQL Reporting Services as that is what our application already uses. We are very short on developer resources so the more I can make this into an engineering exercise the better. What I have tried so far
Use an ODBC connection from MSSQL mapped to the Hive on HDInsight.
Use an ODBC connection from MSSQL using HBASE on HDInsight.
Use SPARKQL locally on the azure HDInsight Remote desktop
What I have found is that HBASE and Hive are far slower to use with our reports. For test data I used a table with 60k rows and found that the report on MSSQL ran in less than 10 seconds. I ran the query on the hive query console and on the ODBC connection and found that it took over a minute to execute. Spark was faster (30 seconds) but there is no way to connect to it externally since ports cannot be opened on the HDInsight cluster.
Big data and Hadoop are all new to me. My question is, am I looking for Hadoop to do something it is not designed to do and are there ways to make this faster?I have considered caching results and periodically refreshing them, but it sounds like a management nightmare. Kylin looks promising but we are pretty married to windows azure, so I am not sure that is a viable solution.
Look at this documentation on optimizing Hive queries: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-optimize-hive-query/
Specifically look at ORC and using Tez. I would create a cluster that has Tez on by default and then store your data in ORC format. Your queries should be much more performant then.
If going through Spark is fast enough, you should consider using the Microsoft Spark ODBC driver. I am using it and the performance is not comparable to what you'll get with MSSQL, other RDBMS or something like ElasticSearch but it does work pretty reliably.

Resources