I noticed that both Azure Functions and Azure Stream Analytics can take an input, modify or transform that input, and put it into an output.
When would I use one versus the other? Are there any general rules I can use to decide?
I tried looking at the pricing of each to guide me, but I'm having trouble discerning how my logic would affect the compute time cost of Functions versus the App service plan cost of Functions versus the streaming unit cost of Stream Analytics.
Azure Stream Analytics is a real time analytics service which can "run massively parallel real-time analytics on multiple IoT or non-IoT streams of data" whereas Azure Function is a (serverless) service to host functions (little pieces of code) that can be used for e.g. event-driven applications.
General rule is always difficult since everything depends on your requirement, but I would say if you have to analyze a data stream, you should take a look at Azure Stream Analytics and if you want to implement something like a serverless event-driven or timer-based application, you should check Azure Function or Logic Apps.
Related
Currently we are using Blob trigger Azure Functions to move json data into Cosmos DB. We are planning to replace Azure Functions with Azure Data Factory(ADF) pipeline.
I am new to Azure Data Factory(ADF), so not sure, Could Azure Data Factory(ADF) pipeline be better option or not?
Though my answer is a bit late, I would like to add that I would not recommend replacing your current setup with ADF. Reasons:
It is too expensive. ADF costs way more than azure functions.
Custom Logic: ADF is not built to perform cleansing logics or any custom code. Its primary goal is for data integration from external systems using its vast connector pool
Latency: ADF has much higer latency due to the large overhead of its job frameweork
Based on you requirements, Azure Data Factory is your perfect option. You could follow this tutorial to configure Cosmos DB Output and Azure Blob Storage Input.
Advantage over azure function is being that you don't need to write any custom code unless there is a data cleaning involved and azure data factory is the recommended option, even if you want azure function for other purposes you can add it within the pipeline.
Fundamental use of Azure Data Factory is data ingestion. Azure Functions are Server-less (Function as a Service) and its best usage is for short lived instances. Azure Functions which are executed for multiple seconds are far more expensive. Azure Functions are good for Event Driven micro services. For Data ingestion , Azure Data Factory is a better option as its running cost for huge data will be lesser than azure functions. Also you can integrate Spark processing pipelines in ADF for more advanced data ingestion pipelines.
Moreover , it depends upon your situation . Azure functions are server less light weight processes meant for quick access in response to an event instead of volumetric responses which are meant for batch processes.
So, if your requirement is to quickly respond to an event with little information stay with Azure functions or if you have a need for batch process switch to ADF.
Cost
I get images from here.
Let's calculate the cost:
if your file is large:
43:51hour=43.867(h)
4(DIU)*43.867(h)*0.25($/DIU-H)=43.867$
43.867/7.514GB= 5.838 ($/GB)
if your file is small(2.497MB), take about 45 seconds:
4(DIU)*1/60(h)*0.25($/DIU-H)=0.0167$
2.497MB/1024MB=0.00244013671 GB
0.0167/0.00244013671= 6.844 ($/GB)
scale
The Max instances Azure function can run is 200.
ADF can run 3,000 Concurrent External activities. And In my test, only 1500 copy activities were running parallel. (This test wasted a lot of money.)
In the context of Azure IoT hub, when would one use Stream Analytics over Time Series Insights?
The product pages and documentation for both indicates they are heavily geared for IoT/data applications. However, I'm not clear on the differences.
The use case I have is both real time monitoring as well as ETL analysis. Could (or even should?) the two be used together?
One immediate difference I can see is that Time Series Insights stores the data whereas Stream Analytics (I think) would need the developer to integrate storage.
In short, stream analytics is about transforming, filtering and aggregation of data and time series insight is about visualising (stored) data.
Data passed through stream analytics is typically forwarded to resources like power bi (for realtime monitoring) or storage like a database for later analysis or processing.
One immediate difference I can see is that Time Series Insights stores the data whereas Stream Analytics (I think) would need the developer to integrate storage.
This is a correct statement. TSI is a data store, but its purpose is to create an environment to (visually) analyze that data. ASA cannot be used to analyze data on its own.
You could use ASA to transform the data and have the data send to Event Hub. That same Event Hub can then be used as a data source for TSI.
Working on a IoT telemetry project that receives humidity and weather pollution data from different sites on the field. I will then apply Machine Learning on the collected data. I'm using Event Hubs and Stream Analytics. Is there a way of pulling the data to Azure Machine Learning without the hassle of writing an application to get it from Stream Analytics and push to AML web service?
Stream Analytics has a functionality called the “Functions”. You can call any web service you’ve published using AML from within Stream Analytics and apply it within your Stream Analytics query. Check this link for a tutorial.
Example workflow in your case would be like the following;
Telemetry arrives and reaches Stream Analytics
Streaming Analytics (SA) calls the Machine Learning function to apply it on the data
SA redirects it to the output accordingly, here you can use the PowerBI to create a predictions dashboards.
Another way would be using R, and here’s a good tutorial showing that https://blogs.technet.microsoft.com/machinelearning/2015/12/10/azure-ml-now-available-as-a-function-in-azure-stream-analytics/ .
It is more work of course but can give you more control as you control the code.
Yes,
This is actually quite easy as it is well supported by ASA.
You can call custom AzureML function from your ASA query when you create this function from the portal.
See the following tutorial on how to achieve something like this.
Does Steam Analytics support input sources other than products in the Azure family?
For example, can I setup a REST endpoint and send events this way? Are there client libraries for node.js?
Documentation is somewhat scant in this regard; I wanted to check here before assuming no on both fronts.
I believe the answer is no Azure Stream Analytics does not currently support non Azure sources.
One recommended approach is to write to Azure Event Hub then let Azure Stream Analytics read from there.
You could write to an event hub in Node.JS:
http://hypernephelist.com/2014/09/16/sending-data-to-azure-event-hubs-from-nodejs.html
Revise for my old answer.
As #PanagiotisKanavos said, Azure Stream Analytics (ASA) is just the processing service engine, not the ingestion endpoint, that doesn't need to have a non-azure input source as EventHub do and that how to feed ASA with data to it.
EventHub can be used by ASA, has a variety of libraries that work on tons of different machines, form factors etc, and can run on any OS and many frameworks. Worst case, simple HTTP works as well, AMQP is not mandatory but definitely ideal in terms of performance.
The correct route is PRODUCER -> EventHub -> ASA or PRODUCER -> STORAGE -> ASA. So if there is a library that supports storage on the device that they want, it can work as well, but EventHub is obviously a better choice.
Thanks a lot for #PanagiotisKanavos help.
Some circumstantial evidence below seems to prove that Azure not suppport non Azure Service as input for Stream Analytics.
From the REST API Create Input of Stream Analytics https://msdn.microsoft.com/en-us/library/azure/dn835010.aspx, there are only three data sources that include Event Hub, Blob Storgae & IoT Hub.
Screenshots from Azure old & new portal for add input.
Fig 1. The input options on Azure old portal (Step 1)
Fig 2. The options for Data stream (Step 2)
Fig 3. The option for Reference data (Step 2)
Fig 4. The input options on Azure new portal
I'm evaluating Kafka/Spark/HDFS for developing NRT (sub sec) java application that receives data from an external gateway and publishes it to desktop/mobile clients (consumer) for various topics. At the same time the data will be fed through streaming and batching (persistent) pipelines for analytics and ML.
For example the flow would be...
A standalone TCP client reads streaming data from external TCP server
The client publishes data for different topics based on the packets (Kafka) and passes it to the streaming pipeline for analytics (Spark)
A desktop/mobile consumer app subscribes to various topics and receives NRT data events (Kafka)
The consumer also receives analytics from the streaming/batch pipelines (Spark)
Kafka clusters have to be managed, configured and monitored for optimal performance and scalability. This may require additional person resources and tools to manage the operation.
Kafka, Spark and HDFS can optionally be deployed over Amazon EC2 (or Google Cloud using connectors).
I was reading about Google Cloud DataFlow, Cloud Storage, BigQuery and Pub-Sub. The data flow provides auto scaling and tools to monitor data pipelines in real-time, which is extremely useful. But the setup has a few restrictions e.g. pub-sub push requires the client to use https endpoint and the app deployment needs to use web server e.g. App engine webapp or web server on GCE.
This may not be as efficient (I'm concerned about latency when using http) as deploying a bidirectional tcp/ip app that can leverage the pub-sub and data flow pipelines for streaming data.
Ideally, the preferable setup on Google cloud would be to run the TCP client connecting to the external gateway deployed on GCE that pushes data using pub-sub to the desktop consumer app. In addition, it would leverage the DataFlow pipeline for analytics and cloud storage with spark for ML (prediction API is a bit restrictive) using the cloudera spark connector for data flow.
One could deploy Kafka/Spark/HDFS etc on Google cloud but that kinda defeats the purpose of leveraging the Google cloud technology.
Appreciate any thoughts on whether the above setup is possible using Google cloud or stay with EC2/Kafka/Spark etc.
Speaking about the Cloud Pub/Sub side, there are a couple of things to keep in mind:
If you don't want to have to have a web server running in your subscribers, you could consider using the pull-based subscriber instead of the push-based one. To minimize latency, you want to have at least a few outstanding pull requests at any time.
Having your desktop consumer app act as a subscriber to Pub/Sub directly will only work if you have no more than 10,000 clients; there is a limit of 10,000 subscriptions. If you need to scale beyond that, you should consider Google Cloud Messaging or Firebase.
From the Dataflow side of things, this sounds like a good fit, particularly as you'll be mixing streaming and batch style analytics. If you haven't yet, check out our Mobile Gaming walkthrough.
I'm not quite sure what you mean about using Cloudera's Dataflow/Spark runner for ML. That runner runs Dataflow code on Spark, but not the other way around.