How to Configure Hazelcast Clusters in local environment to test - hazelcast

I'm working on creating Hazelcast backup cluster along with the primary cluster.
It's like I want to setup primary cluster in one machine and backup cluster in another machine to have the backup maps.
How to do that?

If you want to synchronize an entire cluster from a remote cluster, you'll need Hazelcast's WAN Replication feature which is available in Enterprise version. Please see in documentation at https://hazelcast.com/product-features/wan-replication.
If you actually only want to maintain backups within the same cluster by virtue of having multiple nodes within the same datacenter then this is available out of the box in open source. For example, Map by default has backup count of 1 so if you have 2 machines clustered, you will already have a backup of every entry.

Related

How does HDInsight cluster maps to Azure Storage as HDFS?

I have a fair idea of how Hadoop works as I have studied the on-premise model since that's how everyone learns. In that sense the top level idea is fairly straightforward.We have a set of machines (nodes) and we run certain processes on each one of them and then configure those processes in such a way that the entire thing starts behaving as a single logical entity that we call a Hadoop (YARN) cluster. Here HDFS is a logical layer on top of individual storage of all the machines in the cluster. But when we start of thinking of the same cluster in cloud , this becomes little confusing. Taking the case of HDInsight Hadoop cluster , lets say I already have an Azure Storage account with lots of text data and I want to do some analysis so I go ahead and spin a Hadoop cluster in the same region as the storage account. Now the whole idea behind Hadoop is that of processing closest to where data exists. In this case when we create the Hadoop cluster , a bunch of Azure Virtual Machines start behind the scenes with their own underlying storage (though in the same region). But then, while creating the cluster we do specify a default storage account and a few other storage accounts to be attached where data that is to be processed lies. So ideally the data that is to be processed needs to exist on the disks for the virtual machines. How does this thing work in Azure? I guess the virtual machines create disks that are actually pointers to azure storage accounts (default + attached) ? This part is what is not really explained well and is really cloudy. So lot of people including myself are always in dark when they learn the classic on-premise Hadoop model academically and start using cloud based clusters in the real world. If we could see more information about these virtual machines right from the cluster Overview page from the Azure portal , it would help the understanding. I know it's visible from Ambari but again Ambari is blind to Azure, it's an independent component so that is not very helpful.
There is an underlying driver which works as a bridge in mapping the Azure Storage as HDFS to other services running in HDInsight.
You can read more about this driver's functionality in the below official page.
https://hadoop.apache.org/docs/current/hadoop-azure/index.html
If your Azure Storage Account is of type ADLS Gen 2 (Azure Data Lake Storage Gen2) then the driver used is different and can be found under the following official page. This offers some advance capabilities of ADLS Gen2 to beef up your HDInsight performance.
https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html
Finally, as same as your on-prem Hadoop installation, HDInsight too has a local HDFS that is deployed across your HDInsight cluster VM Hard drives also. You can access this local HDFS using URI as below.
hdfs://mycluster/
For example you can issue the following to view your local HDFS root level content.
hdfs dfs -ls hdfs://mycluster/

Migration of Cassandra DB cluster from AWS to Azure

We have one requirement to migrate Cassandra DB Cluster sitting over CentOS 6.5 from AWS to Azure. It is 4 node cluster with approx 3TB of data.
So how to meet this requirement?
We explored several methods mentioned below:
Cassandra Replication: We created VPN tunnel b/w AWS and Azure and tried Cassandra Replication. // Failed.
ASR: We tried ASR for the migration, getting error :" The data size for physical or virtual machine should be less than or equal to 1023 GB ". ErrorID: 539.
AMI to VHD conversion: We are not able to find a way to convert AWS Linux AMI to Azure supported VHD.
No matter which method we choose, we are looking for a feasible solution for this.
Expect your reply.
I would insure you are using the network topology replication strategy and the gossiping property file snitch, make sure you have SSL and the cluster secure and just create new nodes in Azure as their own DC then change the replication to include the new DC.
You can do this without a VPN assuming you have the cluster secured for doing this. Its what you use when you replicate between regions anyway in AWS.

Maintaining replicated database in kubernetes

I have replicated cassandra database and would like to know the best way to maintain its data.
Currently im using kubernetes emptyDir for cassandra container volume.
Can i use google's Persistent disks for replicated cassandra db pods?
If i have 3 cassandra nodes and one of them fails / destroyed what happens to the google's Persistent disks data?
If all 3 nodes fail, will i still be able to populate db data from google's persistant disks to new pods that spins up?
How to backup db's data which is in google's persistent disks?
I will answer your questions in the same order:
1: You can use Google's persistent disks for the master Cassandra node and then all the other cassandra replicas will just use their local emptyDir.
2: When deploying to the cloud, the expectation is that instances are ephemeral and might die at any time. Cassandra is built to replicate data across the cluster to facilitate data redundancy, so that in the case that an instance dies, the data stored on the instance does not, and the cluster can react by re-replicating the data to other running nodes. You can use DaemonSet to place a single pod on each node in the Kubernetes cluster which will give u data redundancy.
Is it possible to provide more information here? how the new pods will spin up?
Taking a snapshot of the disk, or use epmtyDir with a sidecar container in order to periodically snapshot the directory and upload it to Google Cloud Storage.

How to run query over more than one Azure HDInsight(HBase ) cluster installed on different region?

I am new in Azure and HBase .
Say that I have 2 HDInsight (HBase ) cluster one installed in Asia and one on Europe, to get a better read/write performance for users access from different country.
but How to run a query over all data of these clusters ? Do I need to run query separately on all the clusters then combine the results ? Or there is some build-in functions like Distributed Queries for SQLserver
There is no distributed query across clusters in HBase. In your scenario the best solution would probably be setting up replication between two hbase clusters and then querying one of them. The data in both clusters will be complete with the data from the other cluster a few minutes stale as replication is asynchronous. You can also setup more complex replication typologies and have a separate central cluster that has superset of data while two others have their local subsets.
HDInsight team is working on documentation for replication setup in Azure. For now you would need to discover configuration yourself. You would need to provision clusters in the VNets, connect VNets, ensure they have name resolution setup correctly and then use hbase replication setup steps to setup replication itself: http://hbase.apache.org/book.html#_cluster_replication
Without replication solution you would need to query both clusters separately.

How to create a new HDInsight cluster on an existing storage container

I am using HDInsight on Azure to research the scalability of ranking machine learning methods (learning to rank, for the insiders) on Hadoop. I managed to test run my implementation of a learning to rank algorithm on a HDInsight cluster and clocked its time to complete the operation.
Now I want to run the same code over and over again with different numbers of cores to see how the running time scales as a function of the number of cores. From other questions on this forum I understood that HDInsight does not allow changing the number of cores of a cluster. Would it instead be possible in some way to delete the current cluster, and then create a new cluster that makes use of the exact same container on my Azure Storage? I tried to do this by simply giving the new cluster the same name as the previous one (as the container that is created for a new cluster is automatically named after the cluster at creation time), but that doesn't work as the new container created for this new cluster will have "-1" appended to the cluster name. The datafile that I am trying to process is around 15GB in size, so it would be a real pain in the ass if I would need to upload this file to the cluster container for each cluster that I create.
Any help on how I can run my algorithms on HDInsight with varying numbers of cores without having to re-upload my input data for each point of measurement would be very much appreciated!
Kind Regards,
Niek Tax
You should be able to link your existing storage container to an HDInsight cluster According to http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-blob-storage/#benefits
Using the custom create, you have one of the following options for the default storage account:
Use existing storage
Create new storage
Use storage from another subscription.
You also have the option to create your own Blob container or use an existing one.
The link shows how you can do that through the Windows Azure Portal.

Resources