Using Apache Falcon to setup data replication accross clusters - falcon

We have been PoC-ing falcon for our data ingestion workflow. We have a requirement to use falcon to setup a replication between two clusters (feed replication, not mirroring). The problem I have is that the user ID on cluster A is difference from the ID in cluster B. Has anyone used falcon with this setup? I can't seem to find a way to get this to work.
1) I am setting up a replication from Cluster A => Cluster B
2) I am defining the falcon job on cluster A
At the time of the job setup it looks like I can only define one user ID that owns the job. How do I setup a job where the ID on cluster A is different from the ID in cluster B? Any help would be awesome!!

Apache Falcon uses 'ACL owner', which should have write access as the target cluster where the data is to be copied.
Source cluster should have webhdfs enabled, by which the data will be accessed.
So on the source cluster dont schedule the feed, if the user does not have write access which is required for retention.
Hope this helps.

Related

Propagating google service account keys in large on-prem hadoop cluster for distCP -> HCFS->GCS

As the title describes, I am trying to glean some details about how service accounts (multiple service accounts across a large cluster) can be handled in a secure manner in a large hadoop on-premesis cluster that runs workflows across many different teams.
What's an effective way to manage propagation of these keys so they're available across data nodes without also making it available to multiple other different users? I know there are certain parameters one must pass through to core-site.xml but this is at a cluster level, how can this be done in a secure manner just for my project?
I will have workflows scheduled via oozie jobs which push a subset of data in the direction of GCS (I already know this is possible manually, just using gsutil)

How can I check if someone is using cluster with databricks connect?

When someone is connected to the Databricks cluster , I can see in Clusters details that the certain cluster is active and there are some notebooks attached.
But when I'm using the cluster with databricks-connect, cluster is not running.
Is there a way to check if someone is connected to the cluster with databricks-connect?
You can see that in the Spark UI of the cluster, in the Jobs tab, in the description of the executed job you will see message like: DB Connect execution from <user-name>

Is there a way I can share data stored in Hazelcast server members(A,B,C) in AKS cluster 1 with Hazelcast Server members (D,E,F) in AKS cluster 2

I have 3 nodes (members) of Hazelcast server and 3 nodes (members) of Hazelcast client running in Kubernetes cluster of east region.
I have 3 nodes (members) of Hazelcast server and 3 nodes (members) of Hazelcast client running in Kubernetes cluster of west region.
My use case is to store data in both east and west region kubernetes clusters so that if any of the region is down, we can get data from another.
I am using Azure Kubernetes service and namespace name is same in both region clusters.
Any help will be appreciated. Thanks
This is the use case that the WAN replication feature was designed for; it provides disaster recovery capabilities by replicating the content of a cluster (either completely, or just selected IMaps) to a remote cluster. WAN replication is an Enterprise feature, so it isn't available in the Open Source distribution.
If you're trying to do something similar in Open Source, you could write MapListeners that observe all changes made on one cluster and then send the changes to the remote cluster. The advantage of WAN replication (other than not having to write it yourself) is that it has optimizations to batch writes together, configuration options to help trade off performance vs. consistency, and very efficient delta sync logic to help efficiently resynchronize clusters after a network or other outage.

How to design Azure HDInsights Cluster

I have a query on AZURE HDInsights. How do I need to design AZURE HDInsights Cluster according to my on-premises infrastructure ?
What are the major parameters which I need to consider before designing the cluster ?
(For Example) If I have 100 servers running on-premises, how many nodes I need to select in my Cloud Cluster like that. ?!! In AWS we have EMR sizing calculator and Cluster Planner/Advisor. Do we have anything similar planning mechanism in AZURE apart from Pricing Calculator ? Please clarify and provide your inputs. With Any example will be really great. Thanks.
Before deploying an HDInsight cluster, plan for the desired cluster capacity by determining the needed performance and scale. This planning helps optimize both usability and costs. Some cluster capacity decisions cannot be changed after deployment. If the performance parameters change, a cluster can be dismantled and re-created without losing stored data.
The key questions to ask for capacity planning are:
In which geographic region should you deploy your cluster?
How much storage do you need?
What cluster type should you deploy?
What size and type of virtual machine (VM) should your cluster nodes use?
How many worker nodes should your cluster have?
Each and every question is addressed here: "Capacity planning for HDInsight Clusters".

Setting up multi datacenter cluster using OpsCenter

Can we create a multi datacenter cluster using OpsCenter alone? I am able to create one ring, but it is not clear how I can specify the data center settings for the nodes in the second ring.
Currently it's not possible to create a multi datacenter cluster with OpsCenter. It can manage such a cluster if you create it yourself, but cannot create one.
Here are the relevant docs for doing a multi-DC install:
Cassandra: Initializing a multiple node cluster (multiple data centers)
DSE: Multiple data center deployment per workload type
FYI regarding #phact comment above: OpsCenter does automatically create separate logical datacenters when using DataStax Enterprise, separating the workload into Cassandra, Solr, and Analytics DCs. However it does not support creating multiple Cassandra-only datacenters, for example.

Resources