Maintaining replicated database in kubernetes - cassandra

I have replicated cassandra database and would like to know the best way to maintain its data.
Currently im using kubernetes emptyDir for cassandra container volume.
Can i use google's Persistent disks for replicated cassandra db pods?
If i have 3 cassandra nodes and one of them fails / destroyed what happens to the google's Persistent disks data?
If all 3 nodes fail, will i still be able to populate db data from google's persistant disks to new pods that spins up?
How to backup db's data which is in google's persistent disks?

I will answer your questions in the same order:
1: You can use Google's persistent disks for the master Cassandra node and then all the other cassandra replicas will just use their local emptyDir.
2: When deploying to the cloud, the expectation is that instances are ephemeral and might die at any time. Cassandra is built to replicate data across the cluster to facilitate data redundancy, so that in the case that an instance dies, the data stored on the instance does not, and the cluster can react by re-replicating the data to other running nodes. You can use DaemonSet to place a single pod on each node in the Kubernetes cluster which will give u data redundancy.
Is it possible to provide more information here? how the new pods will spin up?
Taking a snapshot of the disk, or use epmtyDir with a sidecar container in order to periodically snapshot the directory and upload it to Google Cloud Storage.

Related

Azure Databricks Cluster Questions

I am new to azure and am trying to understand the below things. It would be helpful if anyone can share their knowledge on this.
Can the table be created in Cluster A be accessed in Cluster B if Cluster A is down?
What is the connection between the cluster and the data in the tables?
You need to have running process (cluster) to be able to access metastore, and read data, because data is stored in the customer's location, not directly accessible from the control plane that runs UI.
When you wrote data into table, then this data should be available in other cluster in following conditions:
the both clusters are using the same metastore
user has correct permissions (could be enforced via Table ACLs)

Issue with persistent storage in Azure Kubernetes Service using Azure Disk

Not able to set up persistent volume using Azure disk
We are trying to deploy an application on AKS and the application is to use persistent volume. If we use Azure disk, we have noticed if the node having the pod running the application container is stopped / not working , another pod from another node is spinned up but it is no longer accessing the persistent volume.
As per documentation ,azure disk is mapped to a particular node and file share is shared across nodes. What is the way to ensure that a application running on AKS using persistent volume is not lost if a pod/node does not work ?
We are looking for a solution with regard to persistent storage so that an application with 3 pods as a replica set can use an Azure disk persistent volume in AKS.
The Azure disk to work as the persistent storage volume in AKS, it should associates to the actual node, so it cannot share the files between multiple pods. So if you want to share files and persist files between pods whenever the pods in any node, the Azure File Share is a good way for you.
Finally, all of all, if you have multiple nodes and the deployment has 3 replicas. Then the best way to share and persist data between pods is using the Azure File Share or the NFS.

Is it possible to take snapshot of existing HDInsight cluster in azure

Currently we have a HDInsights cluster which we might have to shut it down or delete for few days. We need the cluster in the same state as we left. What are the ways we can preserve the current snapshot of this cluster and restore it back after few days.
It depends on how have you created the HDInsight cluster. When you created the cluster, did you specify external meta stores, so that your hive meta store is running on your own SQL azure and not the one that HDInsight created?
Check this documentation.
https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters#use-hiveoozie-metastore
If you haven't used external meta stores when you created the cluster, unfortunately, you will lose that state. Your data however, will be persisted in the Azure blob store or Azure data lake store.

Migration of Cassandra DB cluster from AWS to Azure

We have one requirement to migrate Cassandra DB Cluster sitting over CentOS 6.5 from AWS to Azure. It is 4 node cluster with approx 3TB of data.
So how to meet this requirement?
We explored several methods mentioned below:
Cassandra Replication: We created VPN tunnel b/w AWS and Azure and tried Cassandra Replication. // Failed.
ASR: We tried ASR for the migration, getting error :" The data size for physical or virtual machine should be less than or equal to 1023 GB ". ErrorID: 539.
AMI to VHD conversion: We are not able to find a way to convert AWS Linux AMI to Azure supported VHD.
No matter which method we choose, we are looking for a feasible solution for this.
Expect your reply.
I would insure you are using the network topology replication strategy and the gossiping property file snitch, make sure you have SSL and the cluster secure and just create new nodes in Azure as their own DC then change the replication to include the new DC.
You can do this without a VPN assuming you have the cluster secured for doing this. Its what you use when you replicate between regions anyway in AWS.

HDInsight cluster size when using azure blob storage

When using HDInsight and choosing Azure Storage Blob to store the data that needs to be computed, you still have to choose the number of data nodes when provisioning a new cluster. If your data is being stored on an Azure Storage Blob, what impact does the number of data nodes have? Is the data from the blob actually replicated onto the data nodes?
If you put data on the Azure Blob Store, it stays there, and is read directly from Azure Storage.
The data nodes in the HDInsight cluster have two purposes. Firstly, they run the actual compute jobs, which read from Azure Storage Directly. This is not as crazy as it might sound to an HDFS user because of Azure's consistent underlying fabric, which keeps the storage nice and close to the compute.
Secondly, the data nodes are running an HDFS filesystem on their local disk. This is generally only used for intermediate and tmp files in HDInsight, since it is transitory (only lasts as long as the cluster).
So, choosing the number of data nodes is essentially choosing how many job running nodes (yarn application containers, or job tracker slots depending on version) you want to be able to handle, and to a lesser extent, choosing how much temp space your jobs need.

Resources