Deploying apache Kafka on Azure - azure

I read about, Event HUB, HDinsights and deploying Kafka on IaaS in Availability Set.
I need to know which are the requirements to implementing Kafka on AKS.
How can I know how many nodes needed? And also want to know how to calculate billing.
Finally to compare the three that I mentioned vs Kafka on AKS

The number of nodes depends on your requirements in terms of load, throughput and so on. If you are going to choose Apache Kafka on AKS I would suggest using the Strimzi project (https://strimzi.io/) to deploy it pretty easily. I also wrote a simple demo for a session I had about this (https://github.com/ppatierno/strimzi-aks).

Related

Apache Spark with Kafka: do I need multiple servers for a customer hosted solution?

I'm new with both Spark and Kafka...
I need to create an event listener and then process the data to feed a Power Bi report. The solution would be customer hosted in their Azure environment, so I'm wondering if a local configuration could do or if I need to use a cluster. I would like to keep things as simple as possible.
How many servers would I need as a minimum? The amount of data to process would be around 200Gb per year.
Also, is there a recommended way to deploy the solution to the customer's tenant? It's only purpose would be to work in the backend to get notifications and process the data with Spark.

How to design Azure HDInsights Cluster

I have a query on AZURE HDInsights. How do I need to design AZURE HDInsights Cluster according to my on-premises infrastructure ?
What are the major parameters which I need to consider before designing the cluster ?
(For Example) If I have 100 servers running on-premises, how many nodes I need to select in my Cloud Cluster like that. ?!! In AWS we have EMR sizing calculator and Cluster Planner/Advisor. Do we have anything similar planning mechanism in AZURE apart from Pricing Calculator ? Please clarify and provide your inputs. With Any example will be really great. Thanks.
Before deploying an HDInsight cluster, plan for the desired cluster capacity by determining the needed performance and scale. This planning helps optimize both usability and costs. Some cluster capacity decisions cannot be changed after deployment. If the performance parameters change, a cluster can be dismantled and re-created without losing stored data.
The key questions to ask for capacity planning are:
In which geographic region should you deploy your cluster?
How much storage do you need?
What cluster type should you deploy?
What size and type of virtual machine (VM) should your cluster nodes use?
How many worker nodes should your cluster have?
Each and every question is addressed here: "Capacity planning for HDInsight Clusters".

On what nodes should Kafka Connect distributed be deployed on Azure Kafka for HD Insight?

We are running a lot of connectors on premise and we need to go to Azure. These on premise machines are running Kafka Connect API on 4 nodes. We deploy this API executing this on all these machines:
export CLASSPATH=/path/to/connectors-jars
/usr/hdp/current/kafka-broker/bin/connect-distributed.sh distributed.properties
We have Kafka deployed on Azure Kafka for HD Insight. We need at least 2 nodes running the distributed Connect API and we don't know where to deploy them:
On head nodes (which we still don't know what they are for)
On worker nodes (where kafka brokers live)
On edge nodes
We also have Azure AKS running containers. Should we deploy the distributed Connect API on AKS?
where kafka brokers live
Ideally, no. Connect uses lots of memory when batching lots of records. That memory is better left to the page cache for the broker.
On edge nodes
Probably not. That is where you users are interacting with your cluster. You wouldn't want them poking at your configurations or accidentally messing up the processes in other ways. For example, we had someone fill-up an edge-nodes local disk because they were copying large volumes of data in and out of the "edge".
On head nodes
Maybe? But then again, those are only for cluster admin services, and probably have little memory.
Better solution - run dedicated instances outside of HD Insights in Azure that are only running Kafka Connect. Perhaps running them as containers in Kubernetes because they are completely stateless services and only need access to your sources. sinks, and Kafka brokers for transferring data. This way, they can be upgraded and configured separately from what Hortonworks and HDInsights provides.

Is Kubernetes + Docker + AWS = Azure + Service Fabric?

I see advantages of Kubernetes which include Rolling Deployments, Automatic Health check monitoring, and swinging a new server to action when an existing one fails. I also do understand that Kubernetes is not just for Docker.
So, that brings a couple of questions!
When Azure, and Service Fabric could provide all that I said (and beyond), why would I need Kubernetes?
Would it make sense for one to use Kubernetes along with Service Fabric for large scale deployments on Azure?
Let's look first at the similarities between Kubernetes and Service Fabric.
They are both cloud-agnostic clustering, orchestration, and scheduling software.
They can both be deployed manually, by you, to any set of VMs, anywhere.
There are "managed" offerings for both, meaning a cloud provider like Azure or Google Cloud will host a cluster for you, but generally you still own the VMs.
They both deploy and manage containers.
They both have rich management operations, such as rolling upgrades, health checks, and self-healing capabilities.
That's a fairly high-level view but should give you an idea of what and where you can run with each.
Now let's look where they're different. There are a ton of small differences, but I want to focus on two of the really big conceptual differences:
Application model:
Service Fabric allows you to orchestrate any arbitrary container or EXE (whether that's a small node.js app or a giant legacy application), and in that sense it is similar to Kubernetes. But overall it is more focused on application development specifically, with programming models that are integrated with the platform. In this respect, it is more closely comparable to Cloud Foundry than Kubernetes.
Kubernetes is focused more on orchestrating infrastructure for an application. It doesn't really focus on how you write your application. That's up to you to figure out; Kubernetes just wants a container to run, doesn't matter what's in it.
State management
Kubernetes allows you to deploy stateful software to it, by providing persistent disk storage volumes to containers and assigning unique identifiers to pods. This lets you deploy things like ZooKeeper or MySQL.
Service Fabric is stateful software. Service Fabric is designed as a stateful, data-aware platform. It provides HA state and scale-out primitives. So while Kubernetes allows you to deploy stateful things, Service Fabric allows you to build stateful things. This is one of the key differences that's often overlooked. For example:
On Kubernetes, you can deploy ZooKeeper.
On Service Fabric, you can actually build ZooKeeper yourself using Service Fabric's replication and leader election primitives.
Kubernetes uses etcd for distributed, reliable storage about the state of the cluster.
Service Fabric doesn't need etcd, because Service Fabric itself is a distributed, reliable storage platform. The system services in Service Fabric make use of this to reliably store the state of the cluster. This makes Service Fabric entirely self-contained.
The fact that Service Fabric is a stateful platform is key to understanding it and how it differs from other major orchestrators. Everything it does - scheduling, health checking, rolling upgrades, application versioning, failover, self-healing, etc - are all designed around the fact that it is managing replicated and distributed data that needs to be consistent and highly available at all times.
Please find below a good comparaison article about the difference between ACS and Azure Service Fabric:
https://blogs.msdn.microsoft.com/maheshkshirsagar/2016/11/21/choosing-between-azure-container-service-azure-service-fabric-and-azure-functions/
Could you please clarify what you refer to when you talk mentionne "AWS" ?
From a "developer level" solution could be statefull in both cases but it have a major difference from an Infrastructure point of view:
Docker + Kuberest is a "IaaS" oriented solution
Azure Service Fabric (if you are using Azure service) is a PaaS solution.
IaaS is, in general, more costly and have a more significant maintenance cost.
From a support point of view:
Azure Service Fabric is supported by Microsoft
Docker and Kubernetest are more open source oriented
Hope this help.
Best regards

Multi regional Azure Container Service DC/OS clusters

I'm experimenting a little with ACS using the DC/OS orchestrator, and while spinning up a cluster within a single region seems simple enough, I'm not quite sure what the best practice would be for doing deployments across multiple regions.
Azure itself does not seem to support deploying to more than one region right now. With that assumption, I guess my only other option is to create multiple, identical clusters in all the regions I wish to be available, and then use Azure Traffic Manager to route incoming traffic to the nearest available cluster.
While this solution works, it also causes a few issues I'm not 100% sure on how I should work around.
Our deployment pipelines must make sure to deploy to all regions when deploying a new version of a service. If we have a East US and North Europe region, during deployments from our CI tool I have to connect to the Marathon API in both regions to trigger the new deployments. If the deployment fails in one region, and succeeds in the other, I suddenly have a disparity between the two regions.
If i have a service using local persistent volumes deployed, let's say PostgreSQL or ElasticSearch, it needs to have instances in both regions since service discovery will only find services local to the region. That brings up the problem of replication between regions to keep all state in all regions; this seem to require some/a lot of manual configuration to get to work.
Has anyone ever used a setup somewhat like this using Azure Container Service (or really Amazon Container Service, as I assume the same challenges can be found there) and have some pointers on how to approach this?
You have multiple options for spinning up across regions. I would use a custom installation together with terraform for each of them. This here is a great starting point: https://github.com/bernadinm/terraform-dcos
Distributing agents across different regions should be no problem, ensuring that your services will keep running despite failures.
Distributing masters (giving you control over the services during failures) is a little more diffult as it involves distributing a zookeeper quorum across high latency links, so you should be careful in choosing the "distance" between regions.
Have a look at the documentation for more details.
You are correct ACS does not currently support Multi-Region deployments.
Your first issue is specific to Marathon in DC/OS, I'll ping some of the engineering folks over there to see if they have any input on best practice.
Your second point is something we (I'm the ACS PM) are looking at. There are some solutions you can use in certain scenarios (e.g. ArangoDB is in the DC/OS universe and will provide replication). The DC/OS team may have something to say here too. In ACS we are evaluating the best approaches to providing solutions for this use case but I'm afraid I can't give any indication of timeline.
An alternative solution is to have your database in a SaaS offering. This takes away all the complexity of managing redundancy and replication.

Resources