Setting up multi datacenter cluster using OpsCenter - cassandra

Can we create a multi datacenter cluster using OpsCenter alone? I am able to create one ring, but it is not clear how I can specify the data center settings for the nodes in the second ring.

Currently it's not possible to create a multi datacenter cluster with OpsCenter. It can manage such a cluster if you create it yourself, but cannot create one.
Here are the relevant docs for doing a multi-DC install:
Cassandra: Initializing a multiple node cluster (multiple data centers)
DSE: Multiple data center deployment per workload type
FYI regarding #phact comment above: OpsCenter does automatically create separate logical datacenters when using DataStax Enterprise, separating the workload into Cassandra, Solr, and Analytics DCs. However it does not support creating multiple Cassandra-only datacenters, for example.

Related

Is there a way I can share data stored in Hazelcast server members(A,B,C) in AKS cluster 1 with Hazelcast Server members (D,E,F) in AKS cluster 2

I have 3 nodes (members) of Hazelcast server and 3 nodes (members) of Hazelcast client running in Kubernetes cluster of east region.
I have 3 nodes (members) of Hazelcast server and 3 nodes (members) of Hazelcast client running in Kubernetes cluster of west region.
My use case is to store data in both east and west region kubernetes clusters so that if any of the region is down, we can get data from another.
I am using Azure Kubernetes service and namespace name is same in both region clusters.
Any help will be appreciated. Thanks
This is the use case that the WAN replication feature was designed for; it provides disaster recovery capabilities by replicating the content of a cluster (either completely, or just selected IMaps) to a remote cluster. WAN replication is an Enterprise feature, so it isn't available in the Open Source distribution.
If you're trying to do something similar in Open Source, you could write MapListeners that observe all changes made on one cluster and then send the changes to the remote cluster. The advantage of WAN replication (other than not having to write it yourself) is that it has optimizations to batch writes together, configuration options to help trade off performance vs. consistency, and very efficient delta sync logic to help efficiently resynchronize clusters after a network or other outage.

MarkLogic Cluster - Configure Forest with all documents

We are working on MarkLogic 9.0.8.2
We are setting up MarkLogic Cluster (3 VMs) on Azure and as per failover design, want to have 3 forests (each for Node) in Azure Blob.
I am done with Setup and when started ingestion, i found that documents are distributed across 3 forests and not stored all in each Forest.
For e.g.
i ingested 30000 records and each forest contains 10000 records.
What i need is to have all forest with 30000 records.
Is there any configuration (at DB or forest level) i need to achieve this?
MarkLogic does not work the same as some of the other noSQL document databases failover which may keep a copy of every document on each host.
The clustered nature of MarkLogic distributes the documents across the hosts to provide a balance of availability and resource consumption. For failover protection, you must create additional forests on each host and attach them to your existing forests as replicas. This ensures availability should any 1 of the 3 hosts fail.
Here is a sample forest layout:
Host 1: primary_forest_01 replica_forest_03
Host 2: primary_forest_02 replica_forest_01
Host 3: primary_forest_03 replica_forest_02
The replica forest must be on a different host than the primary forest, and if there are multiple forests per host, they should be striped across hosts to best balance out resource consumption when failed over.
It's also important to note that for HA, you need replicas configured for the system databases as well.
So there is no database setting to put all the documents on every hosts, because that is not the way MarkLogic is designed to work. The Scalability, Availability and Failover Guide is very informative, and in this case, the High Availability of Data Nodes with Failover section is particularly relevant. I also highly recommend checking out the free training that MarkLogic offers.

How to design Azure HDInsights Cluster

I have a query on AZURE HDInsights. How do I need to design AZURE HDInsights Cluster according to my on-premises infrastructure ?
What are the major parameters which I need to consider before designing the cluster ?
(For Example) If I have 100 servers running on-premises, how many nodes I need to select in my Cloud Cluster like that. ?!! In AWS we have EMR sizing calculator and Cluster Planner/Advisor. Do we have anything similar planning mechanism in AZURE apart from Pricing Calculator ? Please clarify and provide your inputs. With Any example will be really great. Thanks.
Before deploying an HDInsight cluster, plan for the desired cluster capacity by determining the needed performance and scale. This planning helps optimize both usability and costs. Some cluster capacity decisions cannot be changed after deployment. If the performance parameters change, a cluster can be dismantled and re-created without losing stored data.
The key questions to ask for capacity planning are:
In which geographic region should you deploy your cluster?
How much storage do you need?
What cluster type should you deploy?
What size and type of virtual machine (VM) should your cluster nodes use?
How many worker nodes should your cluster have?
Each and every question is addressed here: "Capacity planning for HDInsight Clusters".

On what nodes should Kafka Connect distributed be deployed on Azure Kafka for HD Insight?

We are running a lot of connectors on premise and we need to go to Azure. These on premise machines are running Kafka Connect API on 4 nodes. We deploy this API executing this on all these machines:
export CLASSPATH=/path/to/connectors-jars
/usr/hdp/current/kafka-broker/bin/connect-distributed.sh distributed.properties
We have Kafka deployed on Azure Kafka for HD Insight. We need at least 2 nodes running the distributed Connect API and we don't know where to deploy them:
On head nodes (which we still don't know what they are for)
On worker nodes (where kafka brokers live)
On edge nodes
We also have Azure AKS running containers. Should we deploy the distributed Connect API on AKS?
where kafka brokers live
Ideally, no. Connect uses lots of memory when batching lots of records. That memory is better left to the page cache for the broker.
On edge nodes
Probably not. That is where you users are interacting with your cluster. You wouldn't want them poking at your configurations or accidentally messing up the processes in other ways. For example, we had someone fill-up an edge-nodes local disk because they were copying large volumes of data in and out of the "edge".
On head nodes
Maybe? But then again, those are only for cluster admin services, and probably have little memory.
Better solution - run dedicated instances outside of HD Insights in Azure that are only running Kafka Connect. Perhaps running them as containers in Kubernetes because they are completely stateless services and only need access to your sources. sinks, and Kafka brokers for transferring data. This way, they can be upgraded and configured separately from what Hortonworks and HDInsights provides.

Using Apache Falcon to setup data replication accross clusters

We have been PoC-ing falcon for our data ingestion workflow. We have a requirement to use falcon to setup a replication between two clusters (feed replication, not mirroring). The problem I have is that the user ID on cluster A is difference from the ID in cluster B. Has anyone used falcon with this setup? I can't seem to find a way to get this to work.
1) I am setting up a replication from Cluster A => Cluster B
2) I am defining the falcon job on cluster A
At the time of the job setup it looks like I can only define one user ID that owns the job. How do I setup a job where the ID on cluster A is different from the ID in cluster B? Any help would be awesome!!
Apache Falcon uses 'ACL owner', which should have write access as the target cluster where the data is to be copied.
Source cluster should have webhdfs enabled, by which the data will be accessed.
So on the source cluster dont schedule the feed, if the user does not have write access which is required for retention.
Hope this helps.

Resources