How to Handle or Architecture, incremental data ingestion in Azure data lake Store? - azure

I've Two Custom code dll, for Image related to IP Cams.
dll-One : Extract image from IP cams and can be stored it to Azure data lake Store.
Like :
/adls/clinic1/patientimages
/adls/clinic2/patientimages
dll-two : Use those image and extract information from it and load data into RDBMS tables.
So for instance in RDBMS ,say there are entities dimpatient, dimclinic and factpatientVisit.
For start, a one time data can be exported to defined location in Azure data lake store.
Like:
/adls/dimpatient
/adls/dimclinic
/adls/factpatientVisit
Question :
How to push incremental data in same file or how we can handle this incremental load in Azure data Analytics?
This like implementing Warehouse in Azure Data Analytics.
Note: Azure SQL db or any other storage offered by Azure is not want to.
I mean why to spend in other Azure Services if one type of storage has capabilities to hold all types of data.
adls is name of my ADLS storage.

I am not sure I completely understand your question, but you can organize your data files in Azure Data Lake Store or your rows in partitioned U-SQL tables along a time dimension, so you can add new partitions/files for each increment. In general, we recommend that such increments are of substantial sizes though to preserve the ability to scale.

Related

How good is Azure Data Lake for storing an SQL database used for Power BI visualizations?

We have an Azure SQL database where we collect a large amount of sensor data and we regularly extract the data from it and transform it a bit with a python script. The end result is a pandas DataFrame file. We would like to store the transformed data in an Azure database and use it as a source of a power BI dashboard.
On the one hand, we want to show the "almost" real-time data on a dashboard (the latency due to the transformation etc. is acceptable, but the dashboard needs to refresh very frequently, let's say once a minute), but we also want to store the transformed data and query it later e.g. to visualize the data only for a given day.
Is it possible to convert the pandas DataFrame into SQL and store it on Data Lake and stream the data from there? I read that it is possible to store structured data on Data Lake and even query it, but I am unsure if this would be the best solution.
(My current task is to choose the best database for storing the transformed data to enable both streaming and querying it later. I am very new in Azure products and I don't have a sandbox account yet to even try around and identify possible pitfalls. I've just figured out that PowerBI does not support DirectQuery for DataLake and I feel like this can be an issue - meaning we would have to query the data on DataLake at first and store it somewhere if we wanted to visualize a subset, is that correct?)
Azure Datalake is not a database, just a store for the data both structured and unstructured, so as mentioned you can't direct query it unless you have some compute capacity (Databricks, Azure Synapse, Azure DataLake Analytics, Power BI Premium with enhanced compute)
Depending on your approach, it may be best to move from Azure SQL Database and Pandas, to Azure Databricks, that can ingest the streaming data, transform, and provide an outputted table that is stored in the data lake. You will then connect Power BI to the Databricks instance and query that. The data will only be available while the cluster is running.
Moving to Databricks, will involve rewriting your Panda code to Koalas, or preferably Pyspark.
You do have the option of using Databricks to write the items back to a Azure SQL Database table. Depending on what transformations you are doing you could keep it all in Azure SQL, or if it is sensor data streaming, take the data through Azure Event Hubs, to Azure Streaming Analytics (does transformations), to Azure SQL Database (store Realtime and historical).

Need solution to integrate Grafana with Azure data lake

I want to integrate Azure data lake storage with Grafana for visualization of time series data. I need to know what all the tools I can use to make it possible.
I used ADF to extract data from csv files stored in data lake and move to a table in Azure data explorer. After that, I used Azure data explorer plugin in grafana to visualize the same. It worked fine. But I need to know is there any other approach which may be better or cost-effective.
Integrating Grafana with Azure Data Lake is the best option when compared to others because the other options include data movements using ADF and additional cost for Azure SQL Datawarehouse along with the cost of PowerBI.
Reason:
Grafana is a leading open source software designed for visualizing time series analytics. It is an analytics and metrics platform that enables you to query and visualize data and create and share dashboards based on those visualizations. Combining Grafana’s beautiful visualizations with Azure Data Explorer’s snappy ad hoc queries over massive amounts of data, creates impressive usage potential.
The Grafana and Azure Data Explorer teams have created a dedicated plugin which enables you to connect to and visualize data from Azure Data Explorer using its intuitive and powerful Kusto Query Language. In just a few minutes, you can unlock the potential of your data and create your first Grafana dashboard with Azure Data Explorer.
For more details on visualizing data from Azure Data Explorer in Grafana please visit our documentation, “Visualize data from Azure Data Explorer in Grafana”.
Other options:
For Azure Data Lake Gen1:
You can use a mix of services to create visual representations of data stored in Data Lake Storage Gen1.
You can start by using Azure Data Factory to move data from Data Lake Storage Gen1 to Azure SQL Data Warehouse.
After that, you can integrate Power BI with Azure SQL Data Warehouse to create visual representation of the data.
For Azure Data Lake Gen2:
You can use a mix of services to create visual representations of data stored in Data Lake Storage Gen2.
You can start by using Azure Data Factory to move data from Data Lake Storage Gen2 to Azure SQL Data Warehouse.
After that, you can integrate Power BI with Azure SQL Data Warehouse to create visual representation of the data.
Hope this helps.
They just released a new guide. This is for Grafana 5.3
https://learn.microsoft.com/en-us/azure/data-explorer/grafana
you are able to test this by running Grafana in a Docker container (or for real, if you want). I followed the guide, and it is working almost exactly as expected. The only issue I am having is Grafana is concatenating the column name and the data in the column, making reading and formatting tricky.

Azure Data lake VS Azure HDInsight

I was going through the Microsoft documents:
https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview
I'm new to Azure Data lake and HDInsight. There is a statement in the URL which tells that
"Azure Data Lake Store can be accessed from Hadoop (available with HDInsight cluster) using the WebHDFS-compatible REST APIs."
As per my initial understanding, Data lake store is a store in which any kind of data can be stored. I think, HDInsight also kind of does the same thing.
My question is what is the difference between Azure Data lake and Azure HDInsight? If HDInsight can be used for file storage or any kind of storage then Why to use Data Lake?It would be great if some one could clarify this in details. Thanks.
The easiest way to think of Data Lake is to think of this large container that has like a real lake with rivers coming into the river you never know where the rivers are coming from (or what "type" of river). Azure Data Lake was introduced to make big data easy for developers, data scientists, and analysts to store data of any size. It removes the complexities of ingesting and storing all your data while making it faster to get up and running with big data. Data Lake is able to stored the mass different types of data (Structured data, unstructured data, log files, real-time, images, etc. ) and to blend that together, to correlate many different data types. The key thing here is as we are moving from traditional way to the modern tools (like Hadoop, Cassandra, NoSQL DB, etc). Azure Data Lake includes three services:
Azure Data Lake Store, a no limits data lake that powers big data
analytics
Azure Data Lake Analytics, a massively parallel on-demand
job service
Azure HDInsight, a full managed Cloud Hadoop and Spark
offering
Azure Data Lake Store is like a cloud-based file service or file system that is pretty much unlimited in size. We can run services on top of the data that's in that store. So you could use Hadoop or Spark in an HDInsight cluster, or you could use the Azure Data Lake analytic service, which is a complement to the Azure Data Lake Store. And what that service will let you do is to run jobs that effectively query the data you have stored in the Azure Data Lake store and generate output results.
In nutshell,
Hdinsight is a managed hadoop service (to provide compute support)
Azure Data lake(ADL) is a managed storage service (to provide large amount of storage support)
(Instead of ADL, you can alternatively choose to use Blobs in HDinsight, but Blobs have some limitations (like file streaming to storage via hdinsight cluster is not supported)
Here is the definition from Azure documentation (below):
Azure uses "decomposed hardware method"
You can relate or assume HDinsight as a Hadoop Cluster, Azure Data lake (ADL) as HDFS. But they are detached.
If you want to relate with AWS, HDInsight is equivalent to EMR and ADL is equivalent to EMRFS or S3
If you terminate the cluster, ADL storage stays with the files stored in it. You can access the storage directly using another service or tool (like Azure Data bricks) or you can create one another hdinsight cluster on top of the data.
Hdinsight access the ADL using adl:// , and hdinsight never
store the file blocks in the nodes (like Hadoop does), rather it has
mappings to storage service.
Azure Data Lake Store, is just that a data store. HDInsight can also do that in the cluster that you spin up. However, when you stop that cluster, the data also goes away.
It is common that customers use either Azure Data Lake Store, or Azure storage to provide permanent storage separate from the cluster (compute) used to process the data.
Guy
HDInsight is the analytics service whereas the Azure Data Lake Storage is the storage service. You most likely need both to have functional analytics cluster.
HDInsight provides the cluster, fully manages the open-source packages for analytics (Hadoop, Spark ...etc), and you set up your cluster to use Azure Data Lake Storage which support HDFS API ( Hadoop FileSystem ) on top of Cloud Storage.
Azure Data Lake Storage Gen2 is what you are supposed to start looking at which merges the benefits of both Azure Storage and ADLS in one service.
ADLS Gen 2 documentation - https://learn.microsoft.com/en-us/azure/storage/data-lake-storage/introduction
Azure Data Lake Analytics provides server less compute while using Azure Data Lake Store for data storage, whereas in HDInsight,we need to specify and design for Compute Virtual Machine nodes as per processing requirements. It may be advantageous for developers to work with server less compute in Azure Data Lake Analytics, as scaling needs of Analytics Job are taken care out of box.

Where are Azure Data Lake Analytics databases stored?

I created a database with some tables through a U-SQL script run through the Azure Data Lake Tools for Visual Studio (see screenshot below). Is that database stored in the Data Lake Store?
The file structure as shown in the Azure portal
In addition to Amit's answer:
The data that is stored in the store is stored in the \catalog folder of your default ADLS account. It will be charged at the same rate as the remaining data.
The cost of the data that is stored in the internal metadata service is internalized into the ADLA COGS calculations.
Some of the artifacts related to databases are stored in the Azure Data Lake Store. However not all of the artifacts related to databases are stored in the associated ADLS account. More specifically some of the metadata associated with the databases are stored in a ADL service-managed internal location that is not directly accessible to you. What you will see in the ADLS account is the data associated with the tables and databases in an internal format. Hope this information is useful.
Thanks,
Amit

using Azure Data Lake for Analytics

Currently as part of our requirements we are working with the below Azure components
Azure Event Hub
Azure Stream Analytics
Azure Table Storage
Azure Sql DB
Basically with first 3 components, we will be building an Analytics and Reports platform.
Currently as we just started we analyze the data from Azure Table Storage and display it in the analytics dashboard.
Recently we came across a new Azure product Azure Data Lake . Doing some research on microsoft website , we could see we can easily migrate data from Azure Table Storage (with help of Azure Data Factory) to Azure Lake Store. Creating big data pipelines using Azure Data Lake and Azure Data Factory
As we go through the above link, it's mentioned that we need to create an Azure Data Lake Analytics pipeline to process the data.
So what am unclear is the where will be analytics output data will be saved. Do we need to save the analytics output to some DB ? or can we real-time analytics through a Http request ?
We have huge number rows of records in Azure Table Storage that will be moved to Azure Data Lake. For this scenario is it a good option or Can we go an analytics-based solution from Azure Table Storage itself.
Please share your thoughts
You can store your analytics output data on Azure Data Lake Store (a data repository that enables you to store all kinds of data in their raw format without defining schemas.) after processing it through Azure Data lake Analytics (An analytics service that enables you to run jobs on data sets without having to think about clusters.)
As you said "We have huge number rows of records in Azure Table Storage that will be moved to Azure Data Lake.", I think performing analytics on data placed on Azure data lake store is much more efficient because it offers unlimited storage with immediate read/write access to it and scaling the throughput you need for your workloads. It's also offers small writes at low latency for big data sets. So I believe it is better choice then Azure Table storage.

Resources