Migration of Cassandra DB cluster from AWS to Azure - azure

We have one requirement to migrate Cassandra DB Cluster sitting over CentOS 6.5 from AWS to Azure. It is 4 node cluster with approx 3TB of data.
So how to meet this requirement?
We explored several methods mentioned below:
Cassandra Replication: We created VPN tunnel b/w AWS and Azure and tried Cassandra Replication. // Failed.
ASR: We tried ASR for the migration, getting error :" The data size for physical or virtual machine should be less than or equal to 1023 GB ". ErrorID: 539.
AMI to VHD conversion: We are not able to find a way to convert AWS Linux AMI to Azure supported VHD.
No matter which method we choose, we are looking for a feasible solution for this.
Expect your reply.

I would insure you are using the network topology replication strategy and the gossiping property file snitch, make sure you have SSL and the cluster secure and just create new nodes in Azure as their own DC then change the replication to include the new DC.
You can do this without a VPN assuming you have the cluster secured for doing this. Its what you use when you replicate between regions anyway in AWS.

Related

How does HDInsight cluster maps to Azure Storage as HDFS?

I have a fair idea of how Hadoop works as I have studied the on-premise model since that's how everyone learns. In that sense the top level idea is fairly straightforward.We have a set of machines (nodes) and we run certain processes on each one of them and then configure those processes in such a way that the entire thing starts behaving as a single logical entity that we call a Hadoop (YARN) cluster. Here HDFS is a logical layer on top of individual storage of all the machines in the cluster. But when we start of thinking of the same cluster in cloud , this becomes little confusing. Taking the case of HDInsight Hadoop cluster , lets say I already have an Azure Storage account with lots of text data and I want to do some analysis so I go ahead and spin a Hadoop cluster in the same region as the storage account. Now the whole idea behind Hadoop is that of processing closest to where data exists. In this case when we create the Hadoop cluster , a bunch of Azure Virtual Machines start behind the scenes with their own underlying storage (though in the same region). But then, while creating the cluster we do specify a default storage account and a few other storage accounts to be attached where data that is to be processed lies. So ideally the data that is to be processed needs to exist on the disks for the virtual machines. How does this thing work in Azure? I guess the virtual machines create disks that are actually pointers to azure storage accounts (default + attached) ? This part is what is not really explained well and is really cloudy. So lot of people including myself are always in dark when they learn the classic on-premise Hadoop model academically and start using cloud based clusters in the real world. If we could see more information about these virtual machines right from the cluster Overview page from the Azure portal , it would help the understanding. I know it's visible from Ambari but again Ambari is blind to Azure, it's an independent component so that is not very helpful.
There is an underlying driver which works as a bridge in mapping the Azure Storage as HDFS to other services running in HDInsight.
You can read more about this driver's functionality in the below official page.
https://hadoop.apache.org/docs/current/hadoop-azure/index.html
If your Azure Storage Account is of type ADLS Gen 2 (Azure Data Lake Storage Gen2) then the driver used is different and can be found under the following official page. This offers some advance capabilities of ADLS Gen2 to beef up your HDInsight performance.
https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html
Finally, as same as your on-prem Hadoop installation, HDInsight too has a local HDFS that is deployed across your HDInsight cluster VM Hard drives also. You can access this local HDFS using URI as below.
hdfs://mycluster/
For example you can issue the following to view your local HDFS root level content.
hdfs dfs -ls hdfs://mycluster/

How to Configure Hazelcast Clusters in local environment to test

I'm working on creating Hazelcast backup cluster along with the primary cluster.
It's like I want to setup primary cluster in one machine and backup cluster in another machine to have the backup maps.
How to do that?
If you want to synchronize an entire cluster from a remote cluster, you'll need Hazelcast's WAN Replication feature which is available in Enterprise version. Please see in documentation at https://hazelcast.com/product-features/wan-replication.
If you actually only want to maintain backups within the same cluster by virtue of having multiple nodes within the same datacenter then this is available out of the box in open source. For example, Map by default has backup count of 1 so if you have 2 machines clustered, you will already have a backup of every entry.

What is the best way to move Cassandra cluster form AWS to GoogleCloud

We have a cluster of Cassandra consists of two datacenters one of them on AWS and the other on our on-premises servers.
The cluster is running with DSE 5.0 and we need to move the AWS dc to GoogleCloud and upgrade the cluster to DSE 5.1.
Can I create a new dc in Google with the DSE 5.1 and join it with the current cluster which is running DSE 5.0 then shutdown the AWS dc after the data transferred to new dc?
Or create a new cluster on Google then transfer the data manually from AWS to Google then format the on premises then join it to the new cluster on Google?
Or there are other solutions?
Thanks for help
I would suggest creating a new DC in GoogleCloud (DSE 5.0, don't mix different version during transition, schema change, topology change) , join the existing cluster, run repair and then shutdown AWS DC. You need to make sure you have connectivity between all DC during this transition. This approach allows availability of existing DCs and phased transition of application from AWS to Google DC.
Avoid upgrading to DSE 5.1 during this transition.

Maintaining replicated database in kubernetes

I have replicated cassandra database and would like to know the best way to maintain its data.
Currently im using kubernetes emptyDir for cassandra container volume.
Can i use google's Persistent disks for replicated cassandra db pods?
If i have 3 cassandra nodes and one of them fails / destroyed what happens to the google's Persistent disks data?
If all 3 nodes fail, will i still be able to populate db data from google's persistant disks to new pods that spins up?
How to backup db's data which is in google's persistent disks?
I will answer your questions in the same order:
1: You can use Google's persistent disks for the master Cassandra node and then all the other cassandra replicas will just use their local emptyDir.
2: When deploying to the cloud, the expectation is that instances are ephemeral and might die at any time. Cassandra is built to replicate data across the cluster to facilitate data redundancy, so that in the case that an instance dies, the data stored on the instance does not, and the cluster can react by re-replicating the data to other running nodes. You can use DaemonSet to place a single pod on each node in the Kubernetes cluster which will give u data redundancy.
Is it possible to provide more information here? how the new pods will spin up?
Taking a snapshot of the disk, or use epmtyDir with a sidecar container in order to periodically snapshot the directory and upload it to Google Cloud Storage.

Accessing Raw Data for Hadoop

I am looking at the data.seattle.gov data sets and I'm wondering in general how all of this large raw data can get sent to hadoop clusters. I am using hadoop on azure.
It looks like data.seattle.gov is a self contained data service, not built on top of the public cloud.
They have own Restful API for the data access.
Thereof I think the simplest way is to download interested Data to your hadoop cluster, or
to S3 and then use EMR or own clusters on Amazon EC2.
If they (data.seattle.gov ) has relevant queries capabilities you can query the data on demand from Your hadoop cluster passing data references as input. It might work only if you doing very serious data reduction in these queries - otherwise network bandwidth will limit the performance.
In Windows Azure you can place your data sets (unstructured data etc..) in Windows Azure Storage and then access it from the Hadoop Cluster
Check out the blog post: Apache Hadoop on Windows Azure: Connecting to Windows Azure Storage from Hadoop Cluster:
http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/05/apache-hadoop-on-windows-azure-connecting-to-windows-azure-storage-your-hadoop-cluster.aspx
You can also get your data from the Azure Marketplace e.g. Gov Data sets etc..
http://social.technet.microsoft.com/wiki/contents/articles/6857.how-to-import-data-to-hadoop-on-windows-azure-from-windows-azure-marketplace.aspx

Resources