Google Cloud DataFlow for NRT data application - apache-spark

I'm evaluating Kafka/Spark/HDFS for developing NRT (sub sec) java application that receives data from an external gateway and publishes it to desktop/mobile clients (consumer) for various topics. At the same time the data will be fed through streaming and batching (persistent) pipelines for analytics and ML.
For example the flow would be...
A standalone TCP client reads streaming data from external TCP server
The client publishes data for different topics based on the packets (Kafka) and passes it to the streaming pipeline for analytics (Spark)
A desktop/mobile consumer app subscribes to various topics and receives NRT data events (Kafka)
The consumer also receives analytics from the streaming/batch pipelines (Spark)
Kafka clusters have to be managed, configured and monitored for optimal performance and scalability. This may require additional person resources and tools to manage the operation.
Kafka, Spark and HDFS can optionally be deployed over Amazon EC2 (or Google Cloud using connectors).
I was reading about Google Cloud DataFlow, Cloud Storage, BigQuery and Pub-Sub. The data flow provides auto scaling and tools to monitor data pipelines in real-time, which is extremely useful. But the setup has a few restrictions e.g. pub-sub push requires the client to use https endpoint and the app deployment needs to use web server e.g. App engine webapp or web server on GCE.
This may not be as efficient (I'm concerned about latency when using http) as deploying a bidirectional tcp/ip app that can leverage the pub-sub and data flow pipelines for streaming data.
Ideally, the preferable setup on Google cloud would be to run the TCP client connecting to the external gateway deployed on GCE that pushes data using pub-sub to the desktop consumer app. In addition, it would leverage the DataFlow pipeline for analytics and cloud storage with spark for ML (prediction API is a bit restrictive) using the cloudera spark connector for data flow.
One could deploy Kafka/Spark/HDFS etc on Google cloud but that kinda defeats the purpose of leveraging the Google cloud technology.
Appreciate any thoughts on whether the above setup is possible using Google cloud or stay with EC2/Kafka/Spark etc.

Speaking about the Cloud Pub/Sub side, there are a couple of things to keep in mind:
If you don't want to have to have a web server running in your subscribers, you could consider using the pull-based subscriber instead of the push-based one. To minimize latency, you want to have at least a few outstanding pull requests at any time.
Having your desktop consumer app act as a subscriber to Pub/Sub directly will only work if you have no more than 10,000 clients; there is a limit of 10,000 subscriptions. If you need to scale beyond that, you should consider Google Cloud Messaging or Firebase.

From the Dataflow side of things, this sounds like a good fit, particularly as you'll be mixing streaming and batch style analytics. If you haven't yet, check out our Mobile Gaming walkthrough.
I'm not quite sure what you mean about using Cloudera's Dataflow/Spark runner for ML. That runner runs Dataflow code on Spark, but not the other way around.

Related

Azure Event Hub vs Kafka as a Service Broker

I'm evaluating the use of Azure Event Hub vs Kafka as a Service Broker. I was hoping I would be able to create two local apps side by side, one that consumes messages using Kafka with the other one using Azure Event Hub. I've got a docker container set up which is a Kafka instance and I'm in the process of setting up Azure Event hub using my Azure account (as far as I know there's no other way to create a local/development instance for Azure Event Hub).
Does anyone have any information regarding the two that might be useful when comparing their features?
Can't add a comment directly, but the currently top rate answer has the line
Kafka can have multiple topics each Azure Event Hub is a single topic.
This is misleading as it makes it sound like you can't have multiple topics, which you can.
As per https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview#kafka-and-event-hub-conceptual-mapping an "Event Hub" is a topic while an "Event Hub Namespace" is the Kafka cluster.
This decision usually is driven by a broader architectural choice if you are choosing azure as your iaas and paas solution then event hub provides a great integration within the azure ecosystem but if you don't want a vendor lock in kafka is better option.
Operationally also if you want fully managed service then with event hub it's out of the box but with kafka you also get this with confluent platform.
Maturity wise kafka is older and with large community you have a larger support.
Feature wise what kafka ecosystem provides azure ecosystem has those things but if you talk about only event hub then it lacks few features compared to kafka
I think this link can help you extend your understanding https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview
While Apache Kafka is software you typically need to install and operate, Event Hubs is a fully managed, cloud-native service. There are no servers, disks, or networks to manage and monitor and no brokers to consider or configure, ever. You create a namespace, which is an endpoint with a fully qualified domain name, and then you create Event Hubs (topics) within that namespace. For more information about Event Hubs and namespaces, see Event Hubs features. As a cloud service, Event Hubs uses a single stable virtual IP address as the endpoint, so clients don't need to know about the brokers or machines within a cluster. Even though Event Hubs implements the same protocol, this difference means that all Kafka traffic for all partitions is predictably routed through this one endpoint rather than requiring firewall access for all brokers of a cluster. Scale in Event Hubs is controlled by how many throughput units you purchase, with each throughput unit entitling you to 1 Megabyte per second, or 1000 events per second of ingress and twice that volume in egress. Event Hubs can automatically scale up throughput units when you reach the throughput limit if you use the Auto-Inflate feature; this feature work also works with the Apache Kafka protocol support.
You can find more on feature comparison here - https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview
Kafka can have multiple topics each Azure Event Hub is a single topic. Kafka running inside a container means you have to manage it. Azure Event Hub is a PaaS which means they managed the platform side. If you don't know how to make Kafka redundant, reliable, and scalable you may want to go with Azure Event Hubs or any PaaS that offers a similar pub/sub model. Event Hub platform is already scalable, reliable, and redundant.
You should compare
the administration capabilites / effort (as previously said)
the functional capabilities such as competing customer and pub/sub patterns
the performance : you should consider kafka if you plan to exceed the event hub quotas

How to apply load test to Azure Event Hubs triggered consumer function with JMeter to find out consumer's processing speed to egress speed ratio?

I have an Event Hubs application that I haven't published Azure yet, so it is running in my local with Azure emulator.
My consumer is an Azure Event Hubs triggered function and running in my local. So there are no multiple function instances; only one function instance backed by a single EventProcessorHost instance running in my local and there are 2 partitions in the Event Hub.
The consumer does not do any map/filter operation, but only saves the incoming message to an Azure Blob storage.
I want to do load testing/stress testing with probably JMeter to find out and compare the consumer's processing time with egress(flow of data to consumer from Event Hubs broker). In this way I hope to be able to see whether my consumer is late or not, so the criteria of consumer's processing speed/egress speed is the reason of the load test.
I haven't done load testing with JMeter or any other tool before, but I have learned that you can use it to test things from simple APIs to JMS message brokers. API or web page load testing makes sense and looks straightforward, but I'm struggling to fully grasp how to test pub/sub programs. There are some tutorials like Apache Kafka - How to Load Test with JMeter I have read, but for .NET side and Event Hub there are no descriptive tutorials that make sense.
Here are my questions:
Is it right to run the load test in my local with the consumer function which is running in the Azure emulator and saving to Azure cloud blob storage or should I publish this function to the cloud and run the test in the cloud? Does it affects the results?
I couldn't find any JMeter tutorials specific to Event Hubs, so how can I do it with JMeter? Should I have to write any wrapper program?
Besides JMeter, if it's not suitable for this task, how can I do load testing with any other way?

Choosing real time azure services hdinsight Kafka or service bus?

I am evaluating message streaming services on Azure. I want real time message processing service (Most reliable) where message carrying high degree of importance and data must not be lost. Basically I want to make available real time data transmitted from some third party cloud to the API I have hosted on Azure (I have exposed API to the third party so that they can send data).
Following are the options I worked up on.
Event Hub and IOT Hub are used mostly for telemetry data/events. So I am excluding those. Here message is carrying great value in my use case.
Service Bus or Kafka on HDInsight I am thinking to use.
Now, service bus is offering more features as compared to Kafka and also providing very good documentation about how to use it.
But on the documentation I couldn't find anywhere that service bus is used for real time processing. Where as documentation is available stating use kafka for real time processing.
https://learn.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/real-time-ingestion
Which should be the best service among above for my use case? Any other better option which I have not thought of?

Azure Functions vs Azure Stream Analytics

I noticed that both Azure Functions and Azure Stream Analytics can take an input, modify or transform that input, and put it into an output.
When would I use one versus the other? Are there any general rules I can use to decide?
I tried looking at the pricing of each to guide me, but I'm having trouble discerning how my logic would affect the compute time cost of Functions versus the App service plan cost of Functions versus the streaming unit cost of Stream Analytics.
Azure Stream Analytics is a real time analytics service which can "run massively parallel real-time analytics on multiple IoT or non-IoT streams of data" whereas Azure Function is a (serverless) service to host functions (little pieces of code) that can be used for e.g. event-driven applications.
General rule is always difficult since everything depends on your requirement, but I would say if you have to analyze a data stream, you should take a look at Azure Stream Analytics and if you want to implement something like a serverless event-driven or timer-based application, you should check Azure Function or Logic Apps.

Biztalk vs Azure Service Bus

I am looking for a solution for real-time data integration between few on-premise databases. There is no much transformation of data involved.
I am evaluating various ESBs available. I am thinking that data integration using Azure Service Bus as quick to develop a solution. Is it advisable to use Azure service bus for integration of all on-premise databases?
Unless there is extraordinary complexity in the integration, BizTalk is probably not the right tool for the job here. On the other hand, sending data out to the cloud just to transform it back to another database (on the same LAN?) is also not the right approach - this will introduce latency and traffic cost.
(Near) Real Time integration of databases sounds like a job for something like:
SQL Server Integration Services (SSIS)
If DB's are Sql-Server and the schemas are similar, Sql Server Replication
And similar technologies exist for other RDMBS technologies , e.g. Oracle Streams
If you really want to build a service bus, either build a local AMQP based bus as Sam suggests (e.g. Windows Service Bus or Rabbit), or buy an existing product (NServiceBus etc).
If you have all your applications 'on prem', you introduce an extra risk by moving your integration layer to the cloud (suddenly your internet connection could bring down your integration layer)
but the good news is that you can use Service Bus for Windows Server, that you run locally (even with Windows Azure pack!)
The same programming model, similar messaging features, so that might be a good option.
Comparing with BizTalk... Service Bus is light weight, messaging only. BizTalk provides much more rich features (transformations, pipelines, BAM, Business Rules, adapters).
Good luck
If you are only looking to integrate between a few on premise databases then you might consider using Sql Server's Service Broker (http://msdn.microsoft.com/en-gb/library/bb522893.aspx).
It provides a reliable asynchronous way of passing data between databases in a real time way. It can manage the message order and can have numerous conversations running at the same time on the same queue, each being processed by its own instance of the receiver.
There's a good overview here...
http://technet.microsoft.com/en-us/library/ms166104(v=sql.105).aspx

Resources